dbscan:是一种简单的,基于密度的聚类算法。本次实现中,dbscan使用了基于中心的方法。在基于中心的方法中,每个数据点的密度通过对以该点为中心以边长为2*eps的网格(邻域)内的其他数据点的个数来度量。根据数据点的密度分为三类点:
核心点:该点在邻域内的密度超过给定的阀值minps。
边界点:该点不是核心点,但是其邻域内包含至少一个核心点。
噪音点:不是核心点,也不是边界点。
有了以上对数据点的划分,聚合可以这样进行:各个核心点与其邻域内的所有核心点放在同一个簇中,把边界点跟其邻域内的某个核心点放在同一个簇中。
# scoding=utf-8import pylab as plfrom collections import defaultdict,counterpoints = [[int(eachpoint.split(#)[0]), int(eachpoint.split(#)[1])] for eachpoint in open(points,r)]# 计算每个数据点相邻的数据点,邻域定义为以该点为中心以边长为2*eps的网格eps = 10surroundpoints = defaultdict(list)for idx1,point1 in enumerate(points): for idx2,point2 in enumerate(points): if (idx1 < idx2): if(abs(point1[0]-point2[0])<=eps and abs(point1[1]-point2[1])=minpts]# 邻域内包含某个核心点的非核心点,定义为边界点borderpointidx = []for pointidx,surpointidxs in surroundpoints.iteritems(): if (pointidx not in corepointidx): for onesurpointidx in surpointidxs: if onesurpointidx in corepointidx: borderpointidx.append(pointidx) break# 噪音点既不是边界点也不是核心点noisepointidx = [pointidx for pointidx in range(len(points)) if pointidx not in corepointidx and pointidx not in borderpointidx]corepoint = [points[pointidx] for pointidx in corepointidx] borderpoint = [points[pointidx] for pointidx in borderpointidx]noisepoint = [points[pointidx] for pointidx in noisepointidx]# pl.plot([eachpoint[0] for eachpoint in corepoint], [eachpoint[1] for eachpoint in corepoint], 'or')# pl.plot([eachpoint[0] for eachpoint in borderpoint], [eachpoint[1] for eachpoint in borderpoint], 'oy')# pl.plot([eachpoint[0] for eachpoint in noisepoint], [eachpoint[1] for eachpoint in noisepoint], 'ok')groups = [idx for idx in range(len(points))]# 各个核心点与其邻域内的所有核心点放在同一个簇中for pointidx,surroundidxs in surroundpoints.iteritems(): for onesurroundidx in surroundidxs: if (pointidx in corepointidx and onesurroundidx in corepointidx and pointidx < onesurroundidx): for idx in range(len(groups)): if groups[idx] == groups[onesurroundidx]: groups[idx] = groups[pointidx]# 边界点跟其邻域内的某个核心点放在同一个簇中for pointidx,surroundidxs in surroundpoints.iteritems(): for onesurroundidx in surroundidxs: if (pointidx in borderpointidx and onesurroundidx in corepointidx): groups[pointidx] = groups[onesurroundidx] break# 取簇规模最大的5个簇wantgroupnum = 3finalgroup = counter(groups).most_common(3)finalgroup = [onecount[0] for onecount in finalgroup]group1 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalgroup[0]]group2 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalgroup[1]]group3 = [points[idx] for idx in xrange(len(points)) if groups[idx]==finalgroup[2]]pl.plot([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], 'or')pl.plot([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], 'oy')pl.plot([eachpoint[0] for eachpoint in group3], [eachpoint[1] for eachpoint in group3], 'og')# 打印噪音点,黑色pl.plot([eachpoint[0] for eachpoint in noisepoint], [eachpoint[1] for eachpoint in noisepoint], 'ok') pl.show()
运行效果截图如下:
希望本文所述对大家python程序设计有所帮助。
