Python简单实现基于VSM的余弦相似度计算

周小董

发布于 2019-03-25 03:04:01

1.8K00

代码可运行

文章被收录于专栏：python前行者python前行者

运行总次数：0

代码可运行

在知识图谱构建阶段的实体对齐和属性值决策、判断一篇文章是否是你喜欢的文章、比较两篇文章的相似性等实例中，都涉及到了向量空间模型（Vector Space Model，简称VSM）和余弦相似度计算相关知识。

第一步，向量空间模型VSM

向量空间模型（Vector Space Model，简称VSM）表示通过向量的方式来表征文本。一个文档（Document）被描述为一系列关键词（Term）的向量。

简言之，判断一篇文章是否是你喜欢的文章，即将文章抽象成一个向量，该向量由n个词Term组成，每个词都有一个权重（Term Weight），不同的词根据自己在文档中的权重来影响文档相关性的重要程度。

第二步，TF-IDF

特征抽取完后，因为每个词语对实体的贡献度不同，所以需要对这些词语赋予不同的权重。计算词项在向量中的权重方法——TF-IDF。

它表示TF（词频）和IDF（倒文档频率）的乘积：

词频（Term Frequency，简称TF）表示特征词出现的次数除以文章总词数：

其中TF表示某个关键词出现的频率，IDF为所有文档的数目除以包含该词语的文档数目的对数值。

|D|表示所有文档的数目，|w∈d|表示包含词语w的文档数目。

由于“是”“的”“这”等词经常会出现，故需要IDF值来降低其权值。所谓降维，就是降低维度。具体到文档相似度计算，就是减少词语的数量。常见的可用于降维的词以功能词和停用词为主（如：”的”，”这”等），事实上，采取降维的策略在很多情况下不仅可以提高效率，还可以提高精度。

最后TF-IDF计算权重越大表示该词条对这个文本的重要性越大。

第三步，余弦相似度计算

这样，就需要一群你喜欢的文章，才可以计算IDF值。依次计算得到你喜欢的文章D=(w1, w2, …, wn)共n个关键词的权重。当你给出一篇文章E时，采用相同的方法计算出E=(q1, q2, …, qn)，然后计算D和E的相似度。计算两篇文章间的相似度就通过两个向量的余弦夹角cos来描述。文本D1和D2的相似性公式如下：

其中分子表示两个向量的点乘积，分母表示两个向量的模的积。

计算过后，就可以得到相似度了。我们也可以人工的选择两个相似度高的文档，计算其相似度，然后定义其阈值。同样，一篇文章和你喜欢的一类文章，可以取平均值或寻找一类文章向量的中心来计算。主要是将语言问题转换为数学问题进行解决。

缺点：计算量太大、添加新文本需要重新训练词的权值、词之间的关联性没考虑等。其中余弦定理为什么能表示文章相似度间参考资料。

实例解释

句子A：我喜欢看电视，不喜欢看电影。句子B：我不喜欢看电视，也不喜欢看电影。

请问怎样才能计算上面两句话的相似程度？

基本思路是：如果这两句话的用词越相似，它们的内容就应该越相似。因此，可以从词频入手，计算它们的相似程度。

第一步，分词。

句子A：我/喜欢/看/电视，不/喜欢/看/电影。句子B：我/不/喜欢/看/电视，也/不/喜欢/看/电影。

第二步，列出所有的词。

我，喜欢，看，电视，电影，不，也。

第三步，计算词频。

句子A：我 1，喜欢 2，看 2，电视 1，电影 1，不 1，也 0。句子B：我 1，喜欢 2，看 2，电视 1，电影 1，不 2，也 1。

第四步，写出词频向量。

句子A：[1, 2, 2, 1, 1, 1, 0] 句子B：[1, 2, 2, 1, 1, 2, 1]

到这里，问题就变成了如何计算这两个向量的相似程度。使用余弦这个公式，我们就可以得到，句子A与句子B的夹角的余弦。

余弦值越接近1，就表明夹角越接近0度，也就是两个向量越相似，这就叫”余弦相似性”。所以，上面的句子A和句子B是很相似的，事实上它们的夹角大约为20.3度。

由此，我们就得到了”找出相似文章”的一种算法：

（1）使用TF-IDF算法，找出两篇文章的关键词；（2）每篇文章各取出若干个关键词（比如20个），合并成一个集合，计算每篇文章对于这个集合中的词的词频（为了避免文章长度的差异，可以使用相对词频）；（3）生成两篇文章各自的词频向量；（4）计算两个向量的余弦相似度，值越大就表示越相似。

代码实现

# -*- coding: utf-8 -*-
import time,re,os,sys,math

# 统计关键词及个数
def CountKey(fileName):
    try:
        # 计算文件行数
        lineNums = len(open(fileName, 'r').readlines())
        # 统计格式 格式<Key:Value> <属性:出现个数>
        i = 0
        table = {}
        source = open(fileName, "r")
        while i < lineNums:
            line = source.readline()
            line = line.rstrip('\n')
            words = line.split(" ")  # 空格分隔
            # 字典插入与赋值
            for word in words:
                if word != "" and word in table:  # 如果存在次数加1
                    num = table[word]
                    table[word] = num + 1
                elif word != "":  # 否则初值为1
                    table[word] = 1
            i = i + 1
        # 键值从大到小排序 函数原型：sorted(dic,value,reverse)
        dic = sorted(table.items(), key=lambda asd: asd[1], reverse=True)
        return dic
    except Exception as e:
        print('Error:', e)
    finally:
        source.close()

# 统计关键词及个数 并计算相似度
def MergeKeys(dic1, dic2):
    # 合并关键词 采用三个数组实现
    arrayKey = []
    for i in range(len(dic1)):
        arrayKey.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in arrayKey:
            # print('has_key', dic2[i][0])
            pass
        else:  # 合并
            arrayKey.append(dic2[i][0])
    # 计算词频 infobox可忽略TF-IDF
    arrayNum1 = [0] * len(arrayKey)
    arrayNum2 = [0] * len(arrayKey)
    # 赋值arrayNum1
    for i in range(len(dic1)):
        key = dic1[i][0]
        value = dic1[i][1]
        j = 0
        while j < len(arrayKey):
            if key == arrayKey[j]:
                arrayNum1[j] = value
                break
            else:
                j = j + 1
    # 赋值arrayNum2
    for i in range(len(dic2)):
        key = dic2[i][0]
        value = dic2[i][1]
        j = 0
        while j < len(arrayKey):
            if key == arrayKey[j]:
                arrayNum2[j] = value
                break
            else:
                j = j + 1
    print('arrayNum1:',arrayNum1)
    print('arrayNum2:',arrayNum2)
    # 计算两个向量的点积,计算两个向量的模
    x = 0
    i = 0
    sq1 = 0
    sq2 = 0
    while i < len(arrayKey):
        x = x + arrayNum1[i] * arrayNum2[i]
        sq1 = sq1 + arrayNum1[i] * arrayNum1[i]  # pow(a,2)
        sq2 = sq2 + arrayNum2[i] * arrayNum2[i]
        i = i + 1
    result = float(x) / (math.sqrt(sq1) * math.sqrt(sq2))
    return result

'''
------------------------------------------------------- 
基本步骤：
    1.分别统计两个文档的关键词
    2.两篇文章的关键词合并成一个集合,相同的合并,不同的添加
    3.计算每篇文章对于这个集合的词的词频 TF-IDF算法计算权重
    4.生成两篇文章各自的词频向量
    5.计算两个向量的余弦相似度,值越大表示越相似                             
------------------------------------------------------- 
'''
# 主函数
def main():
    # 计算文档1的关键词及个数
    fileName1 = "001.txt"
    dic1 = CountKey(fileName1)
    # 计算文档2的关键词及个数
    fileName2 = "002.txt"
    dic2 = CountKey(fileName2)
    # 合并两篇文章的关键词及相似度计算
    result = MergeKeys(dic1, dic2)
    print(result)

if __name__ == '__main__':
    main()

# -*- coding: utf-8 -*-
import time,re,math,jieba

class cos_similar():

    # 统计关键词及个数
    def CountKey(self,title):
        pattern = re.compile(u'[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~！“”#￥%&‘’（），。\-/：；《》=？@【】、|……——·{}~ ]')
        title = re.sub(pattern, ' ', title)
        corpus_iter = jieba.cut(title, cut_all=False)
        words=[]
        for a in corpus_iter:
            words.append(a)
        # 统计格式 格式<Key:Value> <属性:出现个数>
        table = {}
        # 字典插入与赋值
        for word in words:
            if word != "" and word in table:  # 如果存在次数加1
                num = table[word]
                table[word] = num + 1
            elif word != "":  # 否则初值为1
                table[word] = 1
        # 键值从大到小排序 函数原型：sorted(dic,value,reverse)
        dic = sorted(table.items(), key=lambda asd: asd[1], reverse=True)
        return dic

    # 统计关键词及个数 并计算相似度
    def MergeKeys(self,dic1, dic2):
        # 合并关键词 采用三个数组实现
        arrayKey = []
        for i in range(len(dic1)):
            arrayKey.append(dic1[i][0])  # 向数组中添加元素
        for i in range(len(dic2)):
            if dic2[i][0] in arrayKey:
                # print('has_key', dic2[i][0])
                pass
            else:  # 合并
                arrayKey.append(dic2[i][0])
        # 计算词频 infobox可忽略TF-IDF
        arrayNum1 = [0] * len(arrayKey)
        arrayNum2 = [0] * len(arrayKey)
        # 赋值arrayNum1
        for i in range(len(dic1)):
            key = dic1[i][0]
            value = dic1[i][1]
            j = 0
            while j < len(arrayKey):
                if key == arrayKey[j]:
                    arrayNum1[j] = value
                    break
                else:
                    j = j + 1
        # 赋值arrayNum2
        for i in range(len(dic2)):
            key = dic2[i][0]
            value = dic2[i][1]
            j = 0
            while j < len(arrayKey):
                if key == arrayKey[j]:
                    arrayNum2[j] = value
                    break
                else:
                    j = j + 1
        # print('arrayNum1:',arrayNum1)
        # print('arrayNum2:',arrayNum2)
        # 计算两个向量的点积,两个向量的模
        x = 0
        i = 0
        sq1 = 0
        sq2 = 0
        while i < len(arrayKey):
            x = x + arrayNum1[i] * arrayNum2[i]
            sq1 = sq1 + arrayNum1[i] * arrayNum1[i]  # pow(a,2)
            sq2 = sq2 + arrayNum2[i] * arrayNum2[i]
            i = i + 1
        result = float(x) / (math.sqrt(sq1) * math.sqrt(sq2))
        return result

    # 主函数
    def main(self,texta,textb):
        dic1 = self.CountKey(texta)
        dic2 = self.CountKey(textb)
        # 合并关键词及相似度计算
        result = self.MergeKeys(dic1, dic2)
        print(result)
        return result
'''
------------------------------------------------------- 
基本步骤：
    1.分别统计两个文档的关键词
    2.两篇文章的关键词合并成一个集合,相同的合并,不同的添加
    3.计算每篇文章对于这个集合的词的词频 TF-IDF算法计算权重
    4.生成两篇文章各自的词频向量
    5.计算两个向量的余弦相似度,值越大表示越相似                             
------------------------------------------------------- 
'''
if __name__ == '__main__':
    a=time.time()
    texta = u'万科A(000002)销售稳定增长、拿地额/销售额56.35%'
    textb = u'万科A(000002)销售稳定增长、拿地额/销售额56.35％'
    cos_similar().main(texta,textb)
    b = time.time()
    print(b-a)

# -*- coding:utf-8 -*-
import math,re,datetime,time

text1 = "This game is one of the very best. games ive  played. the  ;pictures? " \
        "cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \
        "cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."

def compute_cosine(text_a, text_b):
    # 找单词及词频
    words1 = text_a.split(' ')
    words2 = text_b.split(' ')
    words1_dict = {}
    words2_dict = {}
    for word in words1:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words1_dict:
            num = words1_dict[word]
            words1_dict[word] = num + 1
        elif word != '':
            words1_dict[word] = 1
    for word in words2:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words2_dict:
            num = words2_dict[word]
            words2_dict[word] = num + 1
        elif word != '':
            words2_dict[word] = 1

    dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
    dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)

    # 得到词向量
    words_key = []
    for i in range(len(dic1)):
        words_key.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in words_key:
            # print 'has_key', dic2[i][0]
            pass
        else:  # 合并
            words_key.append(dic2[i][0])
    vect1 = []
    vect2 = []
    for word in words_key:
        if word in words1_dict:
            vect1.append(words1_dict[word])
        else:
            vect1.append(0)
        if word in words2_dict:
            vect2.append(words2_dict[word])
        else:
            vect2.append(0)
    # 计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect1)):
        sum += vect1[i] * vect2[i]
        sq1 += pow(vect1[i], 2)
        sq2 += pow(vect2[i], 2)
    try:
        result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
    except ZeroDivisionError:
        result = 0.0
    return result


if __name__ == '__main__':
    begin_time = datetime.datetime.now()
    begin = time.time()
    i = 0
    while i < 10:
        compute_cosine(text3, text4)
        i += 1
    end_time = datetime.datetime.now()
    end = time.time()
    print("datatime:", end_time - begin_time)
    print("time:", end - begin)
    print(round(0.955,2))
    print(float('%.2f' % 0.956))