计算两个字符串相(或句子)似度的方法1 编辑距离2 余弦相似度3 FuzzyWuzzy

主要方法有:编辑距离、余弦相似度、模糊相似度百分比

1 编辑距离

def levenshtein(first, second):
        ''' 编辑距离算法(LevD) 
            Args: 两个字符串
            returns: 两个字符串的编辑距离 int
        '''
        if len(first) > len(second):
            first, second = second, first
        if len(first) == 0:
            return len(second)
        if len(second) == 0:
            return len(first)
        first_length = len(first) + 1
        second_length = len(second) + 1
        distance_matrix = [list(range(second_length)) for x in range(first_length)]
        # print distance_matrix
        for i in range(1, first_length):
            for j in range(1, second_length):
                deletion = distance_matrix[i - 1][j] + 1
                insertion = distance_matrix[i][j - 1] + 1
                substitution = distance_matrix[i - 1][j - 1]
                if first[i - 1] != second[j - 1]:
                    substitution += 1
                distance_matrix[i][j] = min(insertion, deletion, substitution)
                # print distance_matrix
        return distance_matrix[first_length - 1][second_length - 1]
str1="hello,good moring"
str2="hi,good moring"
edit_distance=levenshtein(str1,str2)
edit_distance
4

2 余弦相似度

import math
import re
import datetime
import time

text1 = "This game is one of the very best. games ive  played. the  ;pictures? " \
        "cant descripe the real graphics in the game."
text2 = "this game have/ is3 one of the very best. games ive  played. the  ;pictures? " \
        "cant descriPe now the real graphics in the game."
text3 = "So in the picture i saw a nice size detailed metal puzzle. Eager to try since I enjoy 3d wood puzzles, i ordered it. Well to my disappointment I got in the mail a small square about 4 inches around. And to add more disappointment when I built it it was smaller than the palm of my hand. For the price it should of been much much larger. Don't be fooled. It's only worth $5.00.Update 4/15/2013I have bought and completed 13 of these MODELS from A.C. Moore for $5.99 a piece, so i stand by my comment that thiss one is overpriced. It was still fun to build just like all the others from the maker of this brand.Just be warned, They are small."
text4 = "I love it when an author can bring you into their made up world and make you feel like a friend, confidant, or family. Having a special child of my own I could relate to the teacher and her madcap class. I've also spent time in similar classrooms and enjoyed the uniqueness of each and every child. Her story drew me into their world and had me laughing so hard my family thought I had lost my mind, so I shared the passage so they could laugh with me. Read this book if you enjoy a book with strong women, you won't regret it."

def compute_cosine(text_a, text_b):
    # 找单词及词频
    words1 = text_a.split(' ')
    words2 = text_b.split(' ')
    # print(words1)
    words1_dict = {}
    words2_dict = {}
    for word in words1:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        # print(word)
        if word != '' and word in words1_dict: # 这里改动了
            num = words1_dict[word]
            words1_dict[word] = num + 1
        elif word != '':
            words1_dict[word] = 1
        else:
            continue
    for word in words2:
        # word = word.strip(",.?!;")
        word = re.sub('[^a-zA-Z]', '', word)
        word = word.lower()
        if word != '' and word in words2_dict:
            num = words2_dict[word]
            words2_dict[word] = num + 1
        elif word != '':
            words2_dict[word] = 1
        else:
            continue
    print(words1_dict)
    print(words2_dict)
    
    # 排序
    dic1 = sorted(words1_dict.items(), key=lambda asd: asd[1], reverse=True)
    dic2 = sorted(words2_dict.items(), key=lambda asd: asd[1], reverse=True)
    print(dic1)
    print(dic2)

    # 得到词向量
    words_key = []
    for i in range(len(dic1)):
        words_key.append(dic1[i][0])  # 向数组中添加元素
    for i in range(len(dic2)):
        if dic2[i][0] in words_key:
            # print 'has_key', dic2[i][0]
            pass
        else:  # 合并
            words_key.append(dic2[i][0])
    # print(words_key)
    vect1 = []
    vect2 = []
    for word in words_key:
        if word in words1_dict:
            vect1.append(words1_dict[word])
        else:
            vect1.append(0)
        if word in words2_dict:
            vect2.append(words2_dict[word])
        else:
            vect2.append(0)
    print(vect1)
    print(vect2)

    # 计算余弦相似度
    sum = 0
    sq1 = 0
    sq2 = 0
    for i in range(len(vect1)):
        sum += vect1[i] * vect2[i]
        sq1 += pow(vect1[i], 2)
        sq2 += pow(vect2[i], 2)
    try:
        result = round(float(sum) / (math.sqrt(sq1) * math.sqrt(sq2)), 2)
    except ZeroDivisionError:
        result = 0.0
    # print(result)
    return result


if __name__ == '__main__':
    result=compute_cosine(text1, text2)
    print(result)
{'this': 1, 'game': 2, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'real': 1, 'graphics': 1, 'in': 1}
{'this': 1, 'game': 2, 'have': 1, 'is': 1, 'one': 1, 'of': 1, 'the': 4, 'very': 1, 'best': 1, 'games': 1, 'ive': 1, 'played': 1, 'pictures': 1, 'cant': 1, 'descripe': 1, 'now': 1, 'real': 1, 'graphics': 1, 'in': 1}
[('the', 4), ('game', 2), ('this', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[('the', 4), ('game', 2), ('this', 1), ('have', 1), ('is', 1), ('one', 1), ('of', 1), ('very', 1), ('best', 1), ('games', 1), ('ive', 1), ('played', 1), ('pictures', 1), ('cant', 1), ('descripe', 1), ('now', 1), ('real', 1), ('graphics', 1), ('in', 1)]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
[4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
0.97

3 FuzzyWuzzy

from fuzzywuzzy import fuzz
fuzz.ratio("this is a test", "this is a test!")
97

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏我和未来有约会

简练的视图模型 ViewModel

patterns & practices Developer Center 发布了 Unity Application Block 1.2 for Silver...

2169
来自专栏Pulsar-V

Save Camera Document

#pragma once #include "HCCamera.h" #include <time.h> #include <cstdio> #incl...

2828
来自专栏ml

md5算法原理一窥(其一)

    首先,需要了解的事,md5并不是传说中的加密算法,只是一种散列算法。其加密的算法并不是我们说所的那样固定不变,只是一种映射的关系。 所以解密MD5没有现...

3887
来自专栏菩提树下的杨过

linq to sql取出随机记录/多表查询/将查询出的结果生成xml

在手写sql的年代,如果想从sqlserver数据库随机取几条数据,可以利用order by NewId()轻松实现,要实现多表查询也可以用select * f...

2196
来自专栏搞前端的李蚊子

Html5模拟通讯录人员排序(sen.js)

// JavaScript Document  var PY_Json_Str = ""; var PY_Str_1 = ""; var PY_Str_...

5896
来自专栏MelonTeam专栏

Bitmap 源码阅读笔记

导语: Android 系统上的图片的处理,跟Bitmap 这个类脱不了关系,我们有必要去深入阅读里面的源码,以便在工作中能更好的处理Bitmap相关的问题...

2488
来自专栏码匠的流水账

spring security reactive获取security context

本文主要研究下reactive模式下的spring security context的获取。

1762
来自专栏专知

2018年SCI期刊最新影响因子排行,最高244,人工智能TPAMI9.455

2018年6月26日,最新的SCI影响因子正式发布,涵盖1万2千篇期刊。CA-Cancer J Clin 依然拔得头筹,其影响因子今年再创新高,达244.585...

1272
来自专栏c#开发者

XML Encryption in .Net

XML Encryption in .Net One of the new features being introduced with the Whidbey...

4367
来自专栏一个会写诗的程序员的博客

java.base.jmod

/Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/jmods$ jmod list java....

1112

扫码关注云+社区