LSA概述与实例

用户1147754

发布于 2018-01-03 15:02:32

5450

发布于 2018-01-03 15:02:32

文章被收录于专栏：YoungGy

LSA概述

Latent Semantic Analysis简单来说，就是将word和document透射到concept space，然后在concept space中聚类，以实现语义级别的检索等功能。

LSA的核心，有以下几点：

parse阶段，将文档表示为bags of words，同时忽略掉stop words以及标点符号。例如实例中的parse(self, doc)函数，输出一个字典对象，key是word，value是出现的文档序号的list（同一篇文档可能出现同一个词多次，因此list中的值不唯一）。
build阶段，构建count Matrix，行是word，列是document，对应的值是对应的word在document中出现的频数。
SVD，基于SVD上发现比较大的奇异值，并且投射到concept space。
picture，实现二维空间的可视化，发现聚类模式

LSA的使用，基于以下假设：

文档被表示为bags of words，也就是只考虑一篇文章中的词的频率而不考虑其顺序。
相同概念的词（表示相同或者近似内容）的词总会被聚类在一起
不考虑多义词，每个单词只确定其唯一含义

LSA注意

得到Count Matrix后，最好进行TF-IDF，来决定对应词在对应文档的重要性权值。
下面的实例中省略了第一个维度，因为第一个维度表征一个平均参数，具体来说就是这个文档平均有多少个词，或者这个词平均在多少个文档出现，意义不大因此省略。但是更加通用的做法是先对Count Matrix进行列的normalize，这样的话就不用省略第一个维度，缺点是这样会让sparse matrix变得dense。

LSA优缺点

优点

将词和文档都聚类到同样的概念空间，因此可以在概念空间上实现聚类，并且可以实现词和文档的相互查询（比如根据词在概念空间上检索相应的文档）。
概念空间的维度相比原矩阵小得多，并且这些维度中包含的信息多噪音少。
LSA是一种global algorithm，容易让我们发现难以观察到的模式信息等。

缺点

假设Gaussian distribution和Frobenius norm，不一定适合所有的问题。比如，文章中的单词遵从Poisson distribution而不是Gaussian distribution。
不能处理多义词的问题，假设每个单词只有一个意思。
严重依赖svd，计算量相对较大。

LSA实例

选用的9个文档标题分别是：

The Neatest Little Guide to Stock Market Investing
Investing For Dummies, 4th Edition
The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
The Little Book of Value Investing
Value Investing: From Graham to Buffett and Beyond
Rich Dad’s Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
Investing in Real Estate, 5th Edition
Stock Investing For Dummies
Rich Dad’s Advisors: The ABC’s of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Count Matrix为

SVD分解后，根据矩阵SS对角线上奇异值的平方进行重要性排序，结果如下所示：

根据Book Title Matrix的聚类方法结果如下，使用维度2，3进行简单的聚类：

Dim2	Dim3	Titles
red	red	7,9
red	blue	6
blue	red	2,4,5,8
blue	blue	1,3

根据Book Title Matrix和word matrix的聚类方法结果如下，同样使用维度2，3进行简单的聚类：

%pylab inline
from numpy import zeros
from scipy.linalg import svd
#following needed for TFIDF
from math import log
from numpy import asarray, sum
import matplotlib.pyplot as plt 

titles = ["The Neatest Little Guide to Stock Market Investing",
          "Investing For Dummies, 4th Edition",
          "The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
          "The Little Book of Value Investing",
          "Value Investing: From Graham to Buffett and Beyond",
          "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
          "Investing in Real Estate, 5th Edition",
          "Stock Investing For Dummies",
          "Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss"
          ]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''

class LSA(object):
    def __init__(self, stopwords, ignorechars):
        self.stopwords = stopwords
        self.ignorechars = ignorechars
        self.wdict = {}
        self.dcount = 0        
    def parse(self, doc):
        words = doc.split();
        for w in words:
            w = w.lower().translate(None, self.ignorechars)
            if w in self.stopwords:
                continue
            elif w in self.wdict:
                self.wdict[w].append(self.dcount)
            else:
                #考虑wdict['book']会不会出现[0,0]如果book在0中出现两次
                self.wdict[w] = [self.dcount]
        self.dcount += 1      
    def build(self):
        self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
        self.keys.sort()
        self.A = zeros([len(self.keys), self.dcount])
        for i, k in enumerate(self.keys):
            for d in self.wdict[k]:
                self.A[i,d] += 1
    def calc(self):
        self.U, self.S, self.Vt = svd(self.A)

    def picture0(self):
        '''
        根据奇异值的平方画出奇异值的重要性的bar图
        '''
        plt.bar(left=range(len(self.S)) ,height=(self.S**2)/sum(self.S**2),align="center")
        plt.xticks(range(len(self.S)))
        plt.title("The Importance of Each Singular Value")
        plt.xlabel(u"Singular Values")
        plt.ylabel(u"Importance")
    def picture1(self):
        '''
        画出瓦片图
        '''
        plt.set_cmap('bwr') 
        plt.pcolor(-1*self.Vt[0:3,:])
        plt.colorbar()
        plt.yticks(np.arange(3)+0.5,['Dim1','Dim2','Dim3',])
        plt.xticks(np.arange(9)+0.5,[i[0]+i[1] for i in zip(['T']*9 ,map(str,range(1,10)))])
        plt.gca().invert_yaxis()
        plt.gca().set_aspect('equal')
        plt.xlabel("Book Titles")
        plt.ylabel("Dimensions")
        plt.title("Top 3 Dimensions of Each Book Title")

    def picture2(self):
        '''
        画出散点图加上点的注释，投影到概念空间
        '''
        TitleX = -1*self.Vt[1,:]
        TitleY = -1*self.Vt[2,:]
        WordX = -1*self.U[:,1]
        WordY = -1*self.U[:,2]

        #画Word图的形状和注释
        Words = self.keys
        plt.plot(WordX,WordY,'rs')
        for i in range(len(Words)):
            plt.annotate(Words[i],xy=(WordX[i],WordY[i]),xytext=(2, 6),textcoords='offset points',color='red')
        #画Title图的形状和注释
        Titles = [i[0]+i[1] for i in zip(['T']*9 ,map(str,range(1,10)))]
        plt.plot(TitleX,TitleY,'bo')
        for i in range(len(TitleX)):
            plt.annotate(Titles[i],xy=(TitleX[i],TitleY[i]),xytext=(2, 2),textcoords='offset points',color='blue')
        plt.title('XY plots of Words and Titles')
        plt.xlabel('Dimension 2')
        plt.ylabel('Dimension 1')

    def TFIDF(self):
        WordsPerDoc = sum(self.A, axis=0)        
        DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
        rows, cols = self.A.shape
        for i in range(rows):
            for j in range(cols):
                self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])

    def printA(self):
        print 'Here is the count matrix'
        print self.A
    def printSVD(self):
        print 'Here are the singular values'
        print self.S
        print 'Here are the first 3 columns of the U matrix'
        print -1*self.U[:, 0:3]
        print 'Here are the first 3 rows of the Vt matrix'
        print -1*self.Vt[0:3, :]

参考

非常棒的资料，参考了其中大多数内容

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2015年11月30日，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法

登录后参与评论

0 条评论

热度

LSA概述与实例

LSA概述与实例

LSA概述

LSA注意

LSA优缺点

优点

缺点

LSA实例

参考

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐