无监督学习是相对于有监督学习的概念,无监督学习的样本只有数据没有标签(label),由模型自主发现样本之间的关系。可用于数据的类聚(类聚算法)和降维(主成分分析)等。
当样本有真实指标(带label)时,可以使用ARI(调整兰德指数),公式为$$RI = \cfrac{a + b}{C_{2}^{n_{sample}}}$$ $$ARI = \cfrac{RI - E(RI)}{max(RI) - E(RI)}$$ 其中:
该值越大,说明结果越好
轮廓系数不需要先验知识,计算过程如下:
对所有样本重复该过程,取平均值为轮廓系数
k均值类聚是一种简单的无监督学习模型,该模型是基于距离的类聚模型,将把特征空间中距离相近的点进行类聚。 在训练k均值类聚模型中,有以下步骤:
import numpy as np
import pandas as pd
digits_train = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra', header=None)
digits_test = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes', header=None)
print(digits_test[:2])
0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 \
0 0 0 5 13 9 1 0 0 0 0 ... 0 0 0 6 13 10 0 0
1 0 0 0 12 13 5 0 0 0 0 ... 0 0 0 0 11 16 10 0
63 64
0 0 0
1 0 1
[2 rows x 65 columns]
x_train = digits_train[np.arange(64)]
x_test = digits_test[np.arange(64)]
y_train = digits_train[64]
y_test = digits_test[64]
from sklearn.cluster import KMeans
kme = KMeans(n_clusters=10)
model = kme.fit(x_train,y_train)
y_pre = kme.predict(x_test)
from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_test,y_pre)
0.66305779493265249
from sklearn.metrics import silhouette_score
silhouette_score(y_test.values.reshape(-1,1),y_pre.reshape(-1,1),metric="euclidean")
c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\sklearn\utils\validation.py:547: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
0.27296875226980805