复现经典：《统计学习方法》第16章主成分分析

黄博的机器学习圈子

发布于 2020-06-09 14:44:07

7010

发布于 2020-06-09 14:44:07

文章被收录于专栏：机器学习初学者精选文章

第16章主成分分析

本文是李航老师的《统计学习方法》一书的代码复现。作者：黄海广备注：代码都可以在github中下载。我将陆续将代码发布在公众号“机器学习初学者”，可以在这个专辑在线阅读。

PCA（principal components analysis）即主成分分析技术旨在利用降维的思想，把多指标转化为少数几个综合指标。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat

data = loadmat('data/ex7data1.mat')
# data

X = data['X']

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:, 0], X[:, 1])
plt.show()

PCA的算法相当简单。在确保数据被归一化之后，输出仅仅是原始数据的协方差矩阵的奇异值分解。

def pca(X):
    # normalize the features
    X = (X - X.mean()) / X.std()
    
    # compute the covariance matrix
    X = np.matrix(X)
    cov = (X.T * X) / X.shape[0]
    
    # perform SVD
    U, S, V = np.linalg.svd(cov)
    
    return U, S, V

U, S, V = pca(X)
U, S, V

(matrix([[-0.79241747, -0.60997914],
         [-0.60997914,  0.79241747]]),
 array([1.43584536, 0.56415464]),
 matrix([[-0.79241747, -0.60997914],
         [-0.60997914,  0.79241747]]))

现在我们有主成分（矩阵U），我们可以用这些来将原始数据投影到一个较低维的空间中。对于这个任务，我们将实现一个计算投影并且仅选择顶部K个分量的函数，有效地减少了维数。

def project_data(X, U, k):
    U_reduced = U[:,:k]
    return np.dot(X, U_reduced)

Z = project_data(X, U, 1)
Z

matrix([[-4.74689738],
        [-7.15889408],
        [-4.79563345],
        [-4.45754509],
        [-4.80263579],
        [-7.04081342],
        [-4.97025076],
        [-8.75934561],
        [-6.2232703 ],
        [-7.04497331],
        [-6.91702866],
        [-6.79543508],
        [-6.3438312 ],
        [-6.99891495],
        [-4.54558119],
        [-8.31574426],
        [-7.16920841],
        [-5.08083842],
        [-8.54077427],
        [-6.94102769],
        [-8.5978815 ],
        [-5.76620067],
        [-8.2020797 ],
        [-6.23890078],
        [-4.37943868],
        [-5.56947441],
        [-7.53865023],
        [-7.70645413],
        [-5.17158343],
        [-6.19268884],
        [-6.24385246],
        [-8.02715303],
        [-4.81235176],
        [-7.07993347],
        [-5.45953289],
        [-7.60014707],
        [-4.39612191],
        [-7.82288033],
        [-3.40498213],
        [-6.54290343],
        [-7.17879573],
        [-5.22572421],
        [-4.83081168],
        [-7.23907851],
        [-4.36164051],
        [-6.44590096],
        [-2.69118076],
        [-4.61386195],
        [-5.88236227],
        [-7.76732508]])

我们也可以通过反向转换步骤来恢复原始数据。

def recover_data(Z, U, k):
    U_reduced = U[:,:k]
    return np.dot(Z, U_reduced.T)

X_recovered = recover_data(Z, U, 1)
X_recovered

matrix([[3.76152442, 2.89550838],
        [5.67283275, 4.36677606],
        [3.80014373, 2.92523637],
        [3.53223661, 2.71900952],
        [3.80569251, 2.92950765],
        [5.57926356, 4.29474931],
        [3.93851354, 3.03174929],
        [6.94105849, 5.3430181 ],
        [4.93142811, 3.79606507],
        [5.58255993, 4.29728676],
        [5.48117436, 4.21924319],
        [5.38482148, 4.14507365],
        [5.02696267, 3.8696047 ],
        [5.54606249, 4.26919213],
        [3.60199795, 2.77270971],
        [6.58954104, 5.07243054],
        [5.681006  , 4.37306758],
        [4.02614513, 3.09920545],
        [6.76785875, 5.20969415],
        [5.50019161, 4.2338821 ],
        [6.81311151, 5.24452836],
        [4.56923815, 3.51726213],
        [6.49947125, 5.00309752],
        [4.94381398, 3.80559934],
        [3.47034372, 2.67136624],
        [4.41334883, 3.39726321],
        [5.97375815, 4.59841938],
        [6.10672889, 4.70077626],
        [4.09805306, 3.15455801],
        [4.90719483, 3.77741101],
        [4.94773778, 3.80861976],
        [6.36085631, 4.8963959 ],
        [3.81339161, 2.93543419],
        [5.61026298, 4.31861173],
        [4.32622924, 3.33020118],
        [6.02248932, 4.63593118],
        [3.48356381, 2.68154267],
        [6.19898705, 4.77179382],
        [2.69816733, 2.07696807],
        [5.18471099, 3.99103461],
        [5.68860316, 4.37891565],
        [4.14095516, 3.18758276],
        [3.82801958, 2.94669436],
        [5.73637229, 4.41568689],
        [3.45624014, 2.66050973],
        [5.10784454, 3.93186513],
        [2.13253865, 1.64156413],
        [3.65610482, 2.81435955],
        [4.66128664, 3.58811828],
        [6.1549641 , 4.73790627]])

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(list(X_recovered[:, 0]), list(X_recovered[:, 1]))
plt.show()

下载地址

https://github.com/fengdu78/lihang-code

参考资料：

[1] 《统计学习方法》: https://baike.baidu.com/item/统计学习方法/10430179

[2] 黄海广: https://github.com/fengdu78

[3] github: https://github.com/fengdu78/lihang-code

[4] wzyonggege: https://github.com/wzyonggege/statistical-learning-method

[5] WenDesi: https://github.com/WenDesi/lihang_book_algorithm

[6] 火烫火烫的: https://blog.csdn.net/tudaodiaozhale

[7] hktxt: https://github.com/hktxt/Learn-Statistical-Learning-Method

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-06-08，如有侵权请联系 cloudcommunity@tencent.com 删除

github

本文分享自机器学习初学者微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度