问潜在语义分析(LSA)教程
EN

Stack Overflow用户

提问于 2013-08-26 07:58:11

回答 1查看 7.5K关注 0票数 5

我试图在这个链接中使用LSA教程(编辑:2017年7月)。删除死链接)

下面是本教程的代码：

titles = [doc1,doc2]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''

class LSA(object):
    def __init__(self, stopwords, ignorechars):
        self.stopwords = open('stop words.txt', 'r').read()
        self.ignorechars = ignorechars
        self.wdict = {}
        self.dcount = 0        
    def parse(self, doc):
        words = doc.split();
        for w in words:
            w = w.lower()
            if w in self.stopwords:
                continue
            elif w in self.wdict:
                self.wdict[w].append(self.dcount)
            else:
                self.wdict[w] = [self.dcount]
        self.dcount += 1      
    def build(self):
        self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
        self.keys.sort()
        self.A = zeros([len(self.keys), self.dcount])
        for i, k in enumerate(self.keys):
            for d in self.wdict[k]:
                self.A[i,d] += 1
    def calc(self):
        self.U, self.S, self.Vt = svd(self.A)
    def TFIDF(self):
        WordsPerDoc = sum(self.A, axis=0)        
        DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
        rows, cols = self.A.shape
        for i in range(rows):
            for j in range(cols):
                self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
    def printA(self):
        print 'Here is the count matrix'
        print self.A
    def printSVD(self):
        print 'Here are the singular values'
        print self.S
        print 'Here are the first 3 columns of the U matrix'
        print -1*self.U[:, 0:3]
        print 'Here are the first 3 rows of the Vt matrix'
        print -1*self.Vt[0:3, :]

mylsa = LSA(stopwords, ignorechars)
for t in titles:
    mylsa.parse(t)
mylsa.build()
mylsa.printA()
mylsa.calc()
mylsa.printSVD()

我又读了又读，但我想不出什么。如果我执行该代码，结果如下

Here are the singular values
[  4.28485706e+01   3.36652135e-14]
Here are the first 3 columns of the U matrix
[[  3.30049181e-02  -9.99311821e-01   7.14336493e-04]
 [  6.60098362e-02   1.43697129e-03   6.53394384e-02]
 [  6.60098362e-02   1.43697129e-03  -9.95952378e-01]
 ..., 
 [  3.30049181e-02   7.18485644e-04   2.02381089e-03]
 [  9.90147543e-02   6.81929920e-03   6.35728804e-03]
 [  3.30049181e-02   7.18485644e-04   2.02381089e-03]]
Here are the first 3 rows of the Vt matrix
array([[ 0.5015178 ,  0.86514732],
   [-0.86514732,  0.5015178 ]])

如何从这些矩阵中计算doc1和doc2的相似性？在我自己写的tfidf算法中，我有一个简单的浮点数，这里有3个矩阵。有什么建议吗？

python

lsa

回答 1

Stack Overflow用户

发布于 2013-11-01 22:57:20

一种选择是在两个矩阵之间运行余弦相似性。我想你会发现有问题的好信息，我之前发过的。我也发了这个问题的答案，我看到其他人也给出了很好的答案。

Python: tf-idf-cosine: to find document similarity

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/18439316

复制

相似问题

问潜在语义分析(LSA)教程
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问潜在语义分析(LSA)教程EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问潜在语义分析(LSA)教程
EN