本文是李航老师的《统计学习方法》一书的代码复现。作者:黄海广 备注:代码都可以在github中下载。
1.概率潜在语义分析是利用概率生成模型对文本集合进行话题分析的方法。概率潜在语义分析受潜在语义分析的启发提出两者可以通过矩阵分解关联起来。
给定一个文本集合,通过概率潜在语义分析,可以得到各个文本生成话题的条件概率分布,以及各个话题生成单词的条件概率分布。
概率潜在语义分析的模型有生成模型,以及等价的共现模型。其学习策略是观测数据的极大似然估计,其学习算法是EM算法。
概率潜在语义分析(probabilistic latent semantic analysis, PLSA),也称概率潜在语义索引(probabilistic latent semantic indexing, PLSI),是一种利用概率生成模型对文本集合进行话题分析的无监督学习方法。
模型最大特点是用隐变量表示话题,整个模型表示文本生成话题,话题生成单词,从而得到单词-文本共现数据的过程;假设每个文本由一个话题分布决定,每个话题由一个单词分布决定。
18.1.3 共现模型
算法 18.1 (概率潜在语义模型参数估计的EM算法)
import numpy as np
X = [[0,0,1,1,0,0,0,0,0],
[0,0,0,0,0,1,0,0,1],
[0,1,0,0,0,0,0,1,0],
[0,0,0,0,0,0,1,0,1],
[1,0,0,0,0,1,0,0,0],
[1,1,1,1,1,1,1,1,1],
[1,0,1,0,0,0,0,0,0],
[0,0,0,0,0,0,1,0,1],
[0,0,0,0,0,2,0,0,1],
[1,0,1,0,0,0,0,1,0],
[0,0,0,1,1,0,0,0,0]]
X = np.asarray(X);X
array([[0, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 1],
[1, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 0, 2, 0, 0, 1],
[1, 0, 1, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0]])
X.shape
(11, 9)
X = X.T;X
array([[0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 1, 1, 0, 0, 2, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0]])
class PLSA:
def __init__(self, K, max_iter):
self.K = K
self.max_iter = max_iter
def fit(self, X):
n_d, n_w = X.shape
# P(z|w,d)
p_z_dw = np.zeros((n_d, n_w, self.K))
# P(z|d)
p_z_d = np.random.rand(n_d, self.K)
# P(w|z)
p_w_z = np.random.rand(self.K, n_w)
for i_iter in range(self.max_iter):
# E step
for di in range(n_d):
for wi in range(n_w):
sum_zk = np.zeros((self.K))
for zi in range(self.K):
sum_zk[zi] = p_z_d[di, zi] * p_w_z[zi, wi]
sum1 = np.sum(sum_zk)
if sum1 == 0:
sum1 = 1
for zi in range(self.K):
p_z_dw[di, wi, zi] = sum_zk[zi] / sum1
# M step
# update P(z|d)
for di in range(n_d):
for zi in range(self.K):
sum1 = 0.
sum2 = 0.
for wi in range(n_w):
sum1 = sum1 + X[di, wi] * p_z_dw[di, wi, zi]
sum2 = sum2 + X[di, wi]
if sum2 == 0:
sum2 = 1
p_z_d[di, zi] = sum1 / sum2
# update P(w|z)
for zi in range(self.K):
sum2 = np.zeros((n_w))
for wi in range(n_w):
for di in range(n_d):
sum2[wi] = sum2[wi] + X[di, wi] * p_z_dw[di, wi, zi]
sum1 = np.sum(sum2)
if sum1 == 0:
sum1 = 1
for wi in range(n_w):
p_w_z[zi, wi] = sum2[wi] / sum1
return p_w_z, p_z_d
# https://github.com/lipiji/PG_PLSA/blob/master/plsa.py
model = PLSA(2, 100)
p_w_z, p_z_d = model.fit(X)
p_w_z
array([[0.64238757, 0.05486094, 0.18905573, 0.24047994, 0.41230822,
0.38136674, 0.81525232, 0.74314243, 0.32465342, 0.19798429,
0.72010476],
[0.6337431 , 0.79442181, 0.96755364, 0.22924392, 0.99367301,
0.20277986, 0.40513752, 0.51164374, 0.73750246, 0.22300907,
0.17339099]])
p_z_d
array([[7.14884177e-01, 2.85115823e-01],
[5.38307075e-02, 9.46169293e-01],
[1.00000000e+00, 3.40624611e-11],
[1.00000000e+00, 1.12459358e-24],
[1.00000000e+00, 5.00831891e-42],
[1.66511004e-19, 1.00000000e+00],
[1.00000000e+00, 8.02144289e-15],
[1.04149223e-02, 9.89585078e-01],
[5.96793031e-03, 9.94032070e-01]])
https://github.com/fengdu78/lihang-code
[1] 《统计学习方法》: https://baike.baidu.com/item/统计学习方法/10430179
[2] 黄海广: https://github.com/fengdu78
[3] github: https://github.com/fengdu78/lihang-code
[4] wzyonggege: https://github.com/wzyonggege/statistical-learning-method
[5] WenDesi: https://github.com/WenDesi/lihang_book_algorithm
[6] 火烫火烫的: https://blog.csdn.net/tudaodiaozhale
[7] hktxt: https://github.com/hktxt/Learn-Statistical-Learning-Method