前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Unsupervised Classification - Sprawl Classification Algorithm

Unsupervised Classification - Sprawl Classification Algorithm

作者头像
绿巨人
发布2018-05-16 17:47:41
6620
发布2018-05-16 17:47:41
举报
文章被收录于专栏:绿巨人专栏绿巨人专栏

Idea

Points (data) in same cluster are near each others, or are connected by each others. So:

  • For a distance d,every points in a cluster always can find some points in the same cluster.
  • Distances between points in difference clusters are bigger than the distance d. The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect. So need some improvement.

Sprawl Classification Algorithm

  • Input:
    • data: Training Data
    • d: The minimum distance between clusters
    • minConnectedPoints: The minimum connected points:
  • Output:
    • Result: an array of classified data
  • Logical:
代码语言:javascript
复制
Load data into TotalCache.
i = 0
while (TotalCache.size > 0) 
{
    Find a any point A from TotalCache, put A into Cache2.
    Remove A from TotalCache
    In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
    Put Cache2 points into Cache1.
    Clear Cache2.
    Put nearPoints into Cache2.
    Remove nearPoints from TotalCache.
    if Cache2.size = 0, add Cache1 points into Result[i].
    Clear Cache1.
    i++
}
Return Result

Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.

Improvement

A big problem is the method need too much calculation for the distances between points. The max times is frac{n * (n - 1)}{2}

Improvement ideas:

  • Check distance for one feature first maybe quicker. We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than \(d\), if points x1, x2, the distance will be bigger or equals to \(d\) when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$.
  • Split data in multiple area For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size \(d^{n}\). We can image that each small space is a n dimensions cube and adjoin each other. so we only need to calculate points in the current space and neighbour spaces.

Cons

  • Need a amount of calculating.
  • Need to improve to handle clusters which have common points.
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2016-08-12 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Idea
  • Sprawl Classification Algorithm
  • Improvement
    • Improvement ideas:
    • Cons
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档