前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Clustering and clustering visualisation

Clustering and clustering visualisation

原创
作者头像
403 Forbidden
发布2021-05-19 13:34:19
2680
发布2021-05-19 13:34:19
举报
文章被收录于专栏:hsdoifh biuwedsyhsdoifh biuwedsy

Lecture 7: Clustering and clustering visualisation

-be able to explain why it is useful to perform clustering on a dataset and understand the challenges involved

  • clustering is used to find structure in unlabeled data
  • it can discover which set of data shows similar pattern
  • clustering is a major task in data analysis and visualisation
  • what callenges
    • bad clustering may mislead us to find the structure of the data
    • define K
    • what is the formula for distance
    • what feature to use
  • For datasets with more than 4 dimensions (features)
    • Difficult to visualise
  • How can we determine what the significant groups/segments/communities are?
    • If we have this information
      • Can understandv the data better
      • Apply separate interventions to each group (e.g. marketing campaign)
  • Challenges
    • Data Distribution
      • Large number of samples.
        • The number of samples to be processed is very high. Algorithms have to be very conscious of scaling issues. Like many interesting problems, clustering in general is NP-hard, and practical and successful data mining algorithms usually scale linear or log-linear. Quadratic and cubic scaling may also be allowable but a linear behavior is highly desirable.
      • High dimensionality.
      • The number of features is very high and may even exceed the number of samples. So one has to face the curse of dimensionality
      • Sparsity.
        • Most features are zero for most samples, i.e. the object feature matrix is sparse. This property strongly affects the measurements of similarity and the computational complexity.
      • Strong non-Gaussian distribution of feature values.
        • The data is so skewed that it can not be safely modeled by normal distributions.
      • Significant outliers.
        • Outliers may have significant importance. Finding these outliers is highly non-trivial, and removing them is not necessarily desirable.
    • Application context
      • Legacy clusterings.
        • Previous cluster analysis results are often available. This knowledge should be reused instead of starting each analysis from scratch.
      • Distributed data.
        • Large systems often have heterogeneous distributed data sources. Local cluster analysis results have to be integrated into global models.

-understand the steps of the k-means algorithm

  • Step1: Select the number of clusters you want to identify in your data. This is the K in "K-means clustering"
  • Step2: Randomly select k distinct data points, they're initial clusters
  • Step3: Measure the distance between the 1st point and the k clusters
  • Step4: Assign the 1st point to the nearest cluster. Do the same thing for next point until all of the points are in clusters
  • Step5: Calculate the mean of each cluster
  • Step6: Repeat until no change

-be able to identify scenarios where the k-means algorithm may perform poorly

  • Different runs of the algorithms will produce different results (local optimum)
  • Different number of k can lead to different results
  • Typically choose the initial seed points randomly
    • Different runs of the algorithms will produce different results
  • Closeness measured by Euclidean distance (Can also use other distance functions)
  • Algorithm can be shown to converge (to a local optimum), typically doesn’t require many iterations
  • An outlier is expected to be far away from any groups of normal objects
  • Each object is associated with exactly one cluster and its outlier score is equal to the distance from its cluster centre.

-be able to interpret a heat map visualisation of a dissimilarity matrix

  • The diagonal of D is all zeros
  • D is symmetric about its leading diagonal
    • D(i,j)=D(j,i) for all i and j
    • Objects follow the same order along rows and columns
  • In general, visualising the (raw) dissimilarity matrix may not reveal enough useful information
    • Further processing is needed

-understand the steps (pseudo code) for reordering a dissimilarity matrix using the VAT algorithm

  • VAT algorithm won’t be effective in every situation
  • For complex shaped datasets (either significant overlap or irregular geometries between different clusters), the quality of the VAT image may significantly degrade.

-understand why the VAT algorithm is useful and how to interpret a dissimilarity matrix that has been reordered using the VAT algorithm

  • Reordering the matrix reveals the clusters
    • Nearby objects in the ordering are similar to each other
    • Producing large dark blocks
    • Diagonal dark block = tight group exists in the data (low within-cluster dissimilarities)
  • VAT Algorithms
    • Choose pair of objects that are furthest apart
    • Choose next object that is closest to the first
    • Repeat by selecting next closest object

-understand how VAT may be used to estimate the number of clusters in a dataset

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档