Hierarchical clustering and dimension reduction

原创

403 Forbidden

发布于 2021-05-19 13:35:01

4610

文章被收录于专栏：hsdoifh biuwedsyhsdoifh biuwedsy

Lecture 8: Hierarchical clustering and dimension reduction

-be able to explain the steps of (agglomerative) hierarchical clustering, using single linkage (min)

Two main types of hierarchical clustering
- Agglomerative:
  - Start with the points as individual clusters
  - At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
- Divisive:
  - Start with one, all-inclusive cluster
  - At each step, split a cluster until each cluster contains a point (or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance matrix
- Merge or split one cluster at a time
Basic algorithm is straightforward

1. Compute the proximity matrix

2. Let each data point be a cluster

3. Repeat

-Merge the two closest clusters

-Update the proximity matrix

-Until only a single cluster remains

Key operation is the computation of the proximity of two clusters
- Different approaches to defining the distance between clusters distinguish the different algorithms
We define the similarity to be the minimum distance between the clusters . This is also known as single linkage.
Single linkage
- Similarity of two clusters is based on the two most similar (closest) points in the different clusters
Determined by one pair of points, i.e., by one link in the proximity graph.

-understand how a hierarchical clustering corresponds to a tree structure (dendrogram)

define inter-cluster similarity
- MIN (single linkage) = similarity of two clusters is based on the two most similar (closest) points in different clusters
  - Strength: can handle non-elliptical shapes
  - Limitations: sensitive to noise & outliers
- MAX (Complete Linkage)
  - Strength: less susceptible to noise & outliers
  - Limitations: tends to break large clusters
- Problems & Limitation
  - One a decision is made to combine two clusters, it cannot be undone
  - No objective function is directly minimized

-be able to explain why knowledge of clustering can assist in outlier detection

An outlier is expected to be far away from any groups of normal objects
Each object is associated with exactly once cluster and its outlier score is equal to the distance from its cluster centre.

-understand the concept of a dissimilarity matrix and the steps for its construction

Clustering methods are typically distance based. Represent each object/instance as a vector (a row in the data, with values for each different feature) and then can compute Euclidean distance between pairs of vectors.
Commonly normalise (scale) each attribute into range [0,1] via a pre-processing step before computing distances
Compute all pairwise distances between objects. This gives a dissimilarity matrix
The diagonal of D is all zeros
D is symmetric about its leading diagonal
- D(i,j)=D(j,i) for all i and j
- Objects follow the same order along rows and columns
In general, visualising the (raw) dissimilarity matrix may not reveal enough useful information
- Further processing is needed

-appreciate the relative advantages and disadvantages of k-means versus hierarchical clustering

K-Means:
- Advantages
  - Easy to implement
  - With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
  - k-Means may produce tighter clusters than hierarchical clustering
  - An instance can change cluster (move to another cluster) when the centroids are recomputed.
- Disadvantages
  - Difficult to predict the number of clusters (K-Value)
  - Initial seeds have a strong impact on the final results
  - The order of the data has an impact on the final results
  - Sensitive to scale: rescaling your datasets (normalization or standardization) will completely change results. While this itself is not bad, not realizing that you have to spend extra a4en(on to scaling your data might be bad.
Hierarchical Clustering:
- Advantages
  - Hierarchical clustering outputs a hierarchy, ie a structure that is more informative than the unstructured set of flat clusters returned by k-means. Therefore, it is easier to decide on the number of clusters by looking at the dendrogram (see suggestion on how to cut a dendrogram in lab8).
  - Easy to implement
- Disadvantages
  - It is not possible to undo the previous step: once the instances have been assigned to a cluster, they can no longer be moved around.
  - Time complexity: not suitable for large datasets
  - Initial seeds have a strong impact on the final results
  - The order of the data has an impact on the final results
  - Very sensitive to outliers

-appreciate the following motivations for dimensionality reduction i) reduce amount of time and memory required by data processing algorithms, ii) allow data to be more easily visualized, iii) help eliminate irrelevant features or noise

Motivation
- The curse of dimensionality: data analysis techniques which work well at lower dimensions, often perform poorly as the dimensionality of the analysed data increases of features
- Dimensionality increases
  - data becomes increasingly sparse & all the distances between pairs of points begin to look the same
  - impacts any algorithm that is based on distances between objects
Purpose
- avoid curse of dimensionality
- reduce amount of time & memory required by data processing algorithms
- allow data to be more easily visualised
- may help to eliminate irrelevant features / reduce noise

-understand the concept of dimensionality reduction of a dataset (what is the input and what is the output and what is their relationship)

Method
- input: A dataset with N features & K objects
  - select a subset of the original features
- output: A transformed dataset with n<N features & K objects
  - n is often 2 or 3 -> easily visualised
  - do not need to be a subset of the input N features
  - can be new features whose values are constructed using some function applied to the input N features

-understand that dimensionality reduction may be performed by i) selecting a subset of the original features or ii) generating a small number of new features

-understand the purpose of using PCA for dimensionality reduction. Understand the potential benefits of using PCA for data visualisation

Principal Components Analysis (PCA)

Intuition: select less features but won't lost much information to describe objects
Motivation: it's a systematical way to implement dimensionality reduction which is much better than doing it manually
PCA Idea: find the new axis line with the largest variance among data
Not much information lost when drop pc2 in 2D plot since it contains the least variance
Disadvantage: can lose much information without care; takes too long for large dataset
Second Principal Component: orthogonal to the first, capture as much of the rest variability as possible
Find a new set of features that better captures the variability of the data
- First dimension chosen to capture as much of the variability as possible
- Second dimension is orthogonal to the first & captures as much as the remaining variability as possible
- Third dimension is orthogonal to the first & second, captures as much of the remaining variability as possible
Goal: find a projection that captures the largest amount of variation in data
Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible.

-understand the intuition of how PCA works. It is not necessary to understand the mathematical formulas used for PCA.

Select less features but won't lost much information to describe objects

-Understand intuitively what is meant by ”First Principal Component” and ”Second Principal Component”

Find a new set of features that better captures the variability of the data
- First dimension chosen to capture as much of the variability as possible.
- The second dimension is orthogonal to the first, and subject to that constraint, captures as much of the remaining variability as possible,
- The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible.

-understand how to interpret a 2-D visualization of the first two principal components of a dataset

Step1: Standardization
Step2: Covariance Matrix computation
Step3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

linux