文章/答案/技术大牛

发布

社区首页 >问答首页 >潜在语义分析结果

问潜在语义分析结果
EN

Stack Overflow用户

提问于 2018-09-06 15:29:09

回答 1查看 821关注 0票数 1

我遵循了LSA的教程，并将示例切换到不同的字符串列表，我不确定代码是否按预期工作。

当我使用本教程中给出的示例输入时，它会产生合理的答案。然而，当我使用自己的输入时，我得到了非常奇怪的结果。

为了便于比较，下面是示例输入的结果：

当我使用我自己的例子时，结果就是这样。同样值得注意的是，我似乎没有得到一致的结果：

任何帮助我弄清楚为什么会得到这些结果的人都会非常感激:)

代码如下：

import sklearn
# Import all of the scikit learn stuff
from __future__ import print_function
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import pandas as pd
import warnings
# Suppress warnings from pandas library
warnings.filterwarnings("ignore", category=DeprecationWarning,
module="pandas", lineno=570)
import numpy


example = ["Coffee brewed by expressing or forcing a small amount of 
nearly boiling water under pressure through finely ground coffee 
beans.", 
"An espresso-based coffee drink consisting of espresso with 
microfoam (steamed milk with small, fine bubbles with a glossy or 
velvety consistency)", 
"American fast-food dish, consisting of french fries covered in 
cheese with the possible addition of various other toppings", 
"Pounded and breaded chicken is topped with sweet honey, salty 
dill pickles, and vinegar-y iceberg slaw, then served upon crispy 
challah toast.", 
"A layered, flaky texture, similar to a puff pastry."]

''''
example = ["Machine learning is super fun",
"Python is super, super cool",
"Statistics is cool, too",
"Data science is fun",
"Python is great for machine learning",
"I like football",
"Football is great to watch"]
'''

vectorizer = CountVectorizer(min_df = 1, stop_words = 'english')
dtm = vectorizer.fit_transform(example)
pd.DataFrame(dtm.toarray(),index=example,columns=vectorizer.get_feature_names()).head(10)

# Get words that correspond to each column
vectorizer.get_feature_names()

# Fit LSA. Use algorithm = “randomized” for large datasets
lsa = TruncatedSVD(2, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm.astype(float))
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

pd.DataFrame(lsa.components_,index = ["component_1","component_2"],columns = vectorizer.get_feature_names())

pd.DataFrame(dtm_lsa, index = example, columns = "component_1","component_2"])

xs = [w[0] for w in dtm_lsa]
ys = [w[1] for w in dtm_lsa]
xs, ys

# Plot scatter plot of points
%pylab inline
import matplotlib.pyplot as plt
figure()
plt.scatter(xs,ys)
xlabel('First principal component')
ylabel('Second principal component')
title('Plot of points against LSA principal components')
show()

#Plot scatter plot of points with vectors
%pylab inline
import matplotlib.pyplot as plt
plt.figure()
ax = plt.gca()
ax.quiver(0,0,xs,ys,angles='xy',scale_units='xy',scale=1, linewidth = .01)
ax.set_xlim([-1,1])
ax.set_ylim([-1,1])
xlabel('First principal component')
ylabel('Second principal component')
title('Plot of points against LSA principal components')
plt.draw()
plt.show()

# Compute document similarity using LSA components
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * 
numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)

svd

sklearn-pandas

lsa

python

scikit-learn

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-09-06 16:55:16

问题看起来是由于你使用的少量示例和标准化步骤的组合造成的。因为TrucatedSVD将您的计数向量映射到许多非常小的数字和一个相对较大的数字，所以当您对这些进行归一化时，您会看到一些奇怪的行为。您可以通过查看数据的散点图来了解这一点。

dtm_lsa = lsa.fit_transform(dtm.astype(float))
fig, ax = plt.subplots()
for i in range(dtm_lsa.shape[0]):
    ax.scatter(dtm_lsa[i, 0], dtm_lsa[i, 1], label=f'{i+1}')
ax.legend()

我想说这个图代表了你的数据，因为两个咖啡例子在右边(很难用少量的例子说太多)。但是，当您对数据进行标准化时

dtm_lsa = lsa.fit_transform(dtm.astype(float))
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
fig, ax = plt.subplots()
for i in range(dtm_lsa.shape[0]):
    ax.scatter(dtm_lsa[i, 0], dtm_lsa[i, 1], label=f'{i+1}')
ax.legend()

这会将一些点叠加在一起，这将为您提供与1相似的地方。方差越大，即添加的新样本越多，这个问题几乎肯定会消失。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52198701

复制

相似问题

问潜在语义分析结果
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问潜在语义分析结果EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问潜在语义分析结果
EN