计算两个字符串相(或句子)似度的方法1 编辑距离2 余弦相似度3 FuzzyWuzzy

主要方法有:编辑距离、余弦相似度、模糊相似度百分比

1 编辑距离

def levenshtein(first, second):
        ''' 编辑距离算法(LevD) 
            Args: 两个字符串
            returns: 两个字符串的编辑距离 int
        '''
        if len(first) > len(second):
            first, second = second, first
        if len(first) == 0:
            return len(second)
        if len(second) == 0:
            return len(first)
        first_length = len(first) + 1
        second_length = len(second) + 1
        distance_matrix = [list(range(second_length)) for x in range(first_length)]
        # print distance_matrix
        for i in range(1, first_length):
            for j in range(1, second_length):
                deletion = distance_matrix[i - 1][j] + 1
                insertion = distance_matrix[i][j - 1] + 1
                substitution = distance_matrix[i - 1][j - 1]
                if first[i - 1] != second[j - 1]:
                    substitution += 1
                distance_matrix[i][j] = min(insertion, deletion, substitution)
                # print distance_matrix
        return distance_matrix[first_length - 1][second_length - 1]
str1="hello,good moring"
str2="hi,good moring"
edit_distance=levenshtein(str1,str2)
edit_distance
4

2 余弦相似度

import math
import re
import datetime
import time

text1 = "This game is one of the very best. games ive  played. the  ;pictures? " \
        "cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \
        "cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."

def compute_cosine(text_a, text_b):
    # 找单词及词频
    words1 = text_a.split(' ')
    words2 = text_b.split(' ')
    # print(words1)
    words1_dict = {}
    words2_dict = {}
    for word in words1:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        # print(word)
        if word != '' and word in words1_dict: # 这里改动了
            num = words1_dict[word]
            words1_dict[word] = num + 1
        elif word != '':
            words1_dict[word] = 1
        else:
            continue
    for word in words2:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words2_dict:
            num = words2_dict[word]
            words2_dict[word] = num + 1
        elif word != '':
            words2_dict[word] = 1
        else:
            continue
    print(words1_dict)
    print(words2_dict)
    
    # 排序
    dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
    dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)
    print(dic1)
    print(dic2)

    # 得到词向量
    words_key = []
    for i in range(len(dic1)):
        words_key.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in words_key:
            # print 'has_key', dic2[i][0]
            pass
        else:  # 合并
            words_key.append(dic2[i][0])
    # print(words_key)
    vect1 = []
    vect2 = []
    for word in words_key:
        if word in words1_dict:
            vect1.append(words1_dict[word])
        else:
            vect1.append(0)
        if word in words2_dict:
            vect2.append(words2_dict[word])
        else:
            vect2.append(0)
    print(vect1)
    print(vect2)

    # 计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect1)):
        sum += vect1[i] * vect2[i]
        sq1 += pow(vect1[i], 2)
        sq2 += pow(vect2[i], 2)
    try:
        result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
    except ZeroDivisionError:
        result = 0.0
    # print(result)
    return result


if __name__ == '__main__':
    result=compute_cosine(text1, text2)
    print(result)
{'this': 1, 'game': 2, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'real': 1, 'graphics': 1, 'in': 1}
{'this': 1, 'game': 2, 'have': 1, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'now': 1, 'real': 1, 'graphics': 1, 'in': 1}
[('the', 4), ('game', 2), ('this', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[('the', 4), ('game', 2), ('this', 1), ('have', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('now', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0.97

3 FuzzyWuzzy

from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")
97

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏一心无二用,本人只专注于基础图像算法的实现与优化。

SSE图像算法优化系列七:基于SSE实现的极速的矩形核腐蚀和膨胀(最大值和最小值)算法。

  因未测试其他作者的算法时间和效率,本文不敢自称是最快的,但是速度也可以肯定说是相当快的,在一台I5机器上占用单核的资源处理 3000 * 2000的灰度...

3499
来自专栏量子位

谷歌云TPU上可以用Julia啦!0.23秒跑100张图片,Jeff Dean点赞推荐

不久前,Julia Computing官方放出了一篇论文,展示将Julia代码和机器学习模型编译到谷歌云TPU的方法,可以实现在0.23秒内完成100张图片VG...

1133
来自专栏人工智能LeadAI

第一章 | 使用python机器学习

python经常作为机器学习的首选,有一个统计,50%以上的机器学习开发者使用python。在学习机器学习之前需要熟悉以下几个python模块: numpy P...

3855
来自专栏小樱的经验随笔

基于连通性状态压缩的动态规划问题

基于连通性状态压缩的动态规划问题 基于状态压缩的动态规划问题是一类以集合信息为状态且状态总数为指数级的特殊的动态规划问题.在状态压缩的基础上,有一类问题的状态中...

3338
来自专栏周明礼的专栏

WebGL,真正进入三维的世界

WebGL真正强大的地方,在于它为我们提供了三维图像的绘制能力。当然这主要的得益于WebGL的计算速度,要知道,绘制三维图形,我们需要进行大量的(逐顶点甚至是逐...

4.5K3
来自专栏mwangblog

整数的存储:符号加绝对值表示法

1172
来自专栏老马说编程

(34) 随机 / 计算机程序的思维逻辑

随机 本节,我们来讨论随机,随机是计算机程序中一个非常常见的需求,比如说: 各种游戏中有大量的随机,比如扑克游戏洗牌 微信抢红包,抢的红包金额是随机的 北京购...

2416
来自专栏Python数据科学

5种方法教你用Python玩转histogram直方图

直方图是一个可以快速展示数据概率分布的工具,直观易于理解,并深受数据爱好者的喜爱。大家平时可能见到最多就是 matplotlib,seaborn 等高级封装的库...

2511
来自专栏ascii0x03的安全笔记

使用sklearn构建含有标量属性的决策树

网络上使用sklearn生成决策树的资料很多,这里主要说明遇见标量数据的处理。 经查验参考资料,sklearn并非使用了课上以及书上讲的ID3算法,而是选择了C...

3696
来自专栏书山有路勤为径

Keras tutorialThe Happy House

Why are we using Keras? Keras was developed to enable deep learning engineers to...

871

扫码关注云+社区