# 教你在Python中实现潜在语义分析（附代码）

1. 什么是主题模型？

2. 何时使用主题建模？

3. 潜在语义分析（LSA）概述

4. 在Python中实现LSA

• 数据读取和检查
• 数据预处理
• 文档-词项矩阵（Document-Term Matrix）
• 主题建模
• 主题可视化

5. LSA的优缺点

6. 其他主题建模技术

1. I liked his last novel quite a lot.

2. We would like to go for a novel marketing campaign.

LSA的实施步骤

• 生成一个m×n维的文档-词项矩阵（Document-Term Matrix），矩阵元素为TF-IDF分数
• 然后，我们使用奇异值分解（SVD）把上述矩阵的维度降到k（预期的主题数）维
• SVD将一个矩阵分解为三个矩阵。假设我们利用SVD分解矩阵A，我们会得到矩阵U，矩阵S和矩阵VT（矩阵V的转置）

• 因此，SVD为数据中的每篇文档和每个词项都提供了向量。每个向量的长度均为k。我们可以使用余弦相似度的方法通过这些向量找到相似的单词和文档。

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns pd.set_option("display.max_colwidth", 200)

from sklearn.datasets import fetch_20newsgroups dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('header','footers',quotes')) documents = dataset.data len(documents)

Output: 11,314

Dataset.target_names

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian',

'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

news_df = pd.DataFrame({'document':documents}) # removing everything except alphabets news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ") # removing short words news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) # make all the lowercase news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

from nltk.corpus import stopwords stop_words = stopwords.words('english') # tokenization tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) # remove stop-words tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x ifitem not in stop_words]) # de-tokenization

detokenized_doc = [] for i in range(len(news_df)): t = ' '.join(tokenized_doc[i]) detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(stop_words='english', max_features =1000, # keep top 1000 terms max_df = 0.5, smooth_idf = True) X = vectorizer.fit_transform(news_df['clean_doc'])

X.shape # check shape of the document-term matrix

(11314, 1000)

from sklearn.decomposition import TruncatedSVD # SVD represent documents and terms in vectors svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122) svd_model.fit(X)

len(svd_model.components_)

20

svd_model的组成部分即是我们的主题，我们可以通过svd_model.components_来访问它们。最后，我们打印出20个主题中前几个最重要的单词，看看我们的模型都做了什么。

terms = vectorizer.get_feature_names() for i, comp in enumerate(svd_model.components_): terms_comp = zip(terms, comp) sorted_terms = sorted(terms_comp, key=lambda x:x[1], reverse=True)[:7] print("Topic "+str(i)+": ") for t in sorted_terms: print(t[0]) print(" ")

Topic 0: like know people think good time thanks Topic 0: like know people think good time thanks Topic 1: thanks windows card drive mail file advance Topic 2: game team year games season players good Topic 3: drive scsi disk hard card drives problem Topic 4: windows file window files program using problem Topic 5: government chip mail space information encryption data Topic 6: like bike know chip sounds looks look Topic 7: card sale video offer monitor price jesus Topic 8: know card chip video government people clipper Topic 9: good know time bike jesus problem work

Topic 10: think chip good thanks clipper need encryption Topic 11: thanks right problem good bike time window Topic 12: good people windows know file sale files Topic 13: space think know nasa problem year israel Topic 14: space good card people time nasa thanks Topic 15: people problem window time game want bike Topic 16: time bike right windows file need really Topic 17: time problem file think israel long mail Topic 18: file need card files problem right good Topic 19: problem file thanks used space chip sale

import umap X_topics = svd_model.fit_transform(X) embedding = umap.UMAP(n_neighbors=150, min_dist=0.5, random_state=12).fit_transform(X_topics)

plt.figure(figsize=(7,5)) plt.scatter(embedding[:, 0], embedding[:, 1],

c = dataset.target, s = 10, # size edgecolor='none'

) plt.show()

LSA的优缺点

• LSA快速且易于实施。
• 它的结果相当好，比简单的向量模型强很多。

• 因为它是线性模型，因此在具有非线性依赖性的数据集上可能效果不佳。
• LSA假设文本中的词项服从正态分布，这可能不适用于所有问题。
• LSA涉及到了SVD，它是计算密集型的，当新数据出现时难以更新。

578 篇文章113 人订阅

0 条评论