blocks|key|1766268|text|您可能想查看一下普林斯顿大学的WordNet项目。一种可能的方法是首先通过停用词列表运行每个短语(删除"a“、"+to”、"the“等”常见“词)。然后，对于每个短语中的其余每个单词，您可以使用基于WordNet的距离度量来计算另一个短语中每个单词之间的语义“相似性”。距离度量可以类似于:从word1到word2在WordNet中必须经过的弧数。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1766269|抱歉，这是相当高层次的。显然我从来没有试过这个。只是一个快速的想法。|1766270|entityMap|0|LINK|mutability|MUTABLE|url|http://wordnet.princeton.edu/^0|F|7|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@$A|O|B|P|1|Q]]|C|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@]|C|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|C|$]]]|G|$H|$5|I|J|K|C|$L|M]]]]

You might want to check into the <a href="http://wordnet.princeton.edu/" rel="noreferrer">WordNet</a> project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2. 

Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.

blocks|key|1766121|text|为此，我会研究潜在语义索引。我相信你可以创建类似于向量空间搜索索引的东西，但语义上相关的术语更接近，即它们之间的角度更小。如果我了解更多，我会在这里发帖。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1766122|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.

blocks|key|1551381|text|很抱歉挖出了一个6年前的问题，但由于我今天刚刚看到这篇文章，我会给出一个答案，以防其他人也在寻找类似的东西。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1551382|cortical.io已经开发了一个过程来计算两个表达式的语义相似度，并且它们有一个demo+of+it+up+on+their+website。它们提供了一个free+API+providing+access+to+the+functionality，因此您可以在自己的应用程序中使用它，而不必自己实现算法。|offset|length|1551383|entityMap|0|LINK|mutability|MUTABLE|url|http://www.cortical.io/demos/similarity-explorer/|1|http://www.cortical.io/developers.html^0|0|16|U|0|28|1A|1|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@$D|R|E|S|1|T]|$D|U|E|V|1|W]]|A|$]]|$1|F|3|-4|5|6|7|X|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]|N|$5|I|J|K|A|$L|O]]]]

Sorry to dig up a 6 year old question, but as I just came across this post today, I'll throw in an answer in case anyone else is looking for something similar.

cortical.io has developed a process for calculating the semantic similarity of two expressions and they have a <a href="http://www.cortical.io/demos/similarity-explorer/" rel="nofollow">demo of it up on their website</a>. They offer a <a href="http://www.cortical.io/developers.html" rel="nofollow">free API providing access to the functionality</a>, so you can use it in your own application without having to implement the algorithm yourself.

blocks|key|1766194|text|一种简单的解决方案是使用字符n元语法向量的点积。这在排序变化上是健壮的(许多编辑距离度量不是)，并捕获了围绕词干的许多问题。它还防止了完全语义理解的AI-complete问题。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1766195|要计算n元语法向量，只需选取一个值n(例如，3)，并将短语中的每个3个单词序列散列为一个向量。将向量归一化为单位长度，然后取不同向量的点积来检测相似性。|1766196|此方法已在J.+Mitchell+and+M.+Lapata,+“Composition+in+Distributional+Models+of+Semantics,”+Cognitive+Science,+vol.+34,+no.+8,+pp.+1388–1429,+Nov.+2010.,+DOI+10.1111/j.1551-6709.2010.01106.x+中进行了描述|offset|length|1766197|entityMap|0|LINK|mutability|MUTABLE|url|http://onlinelibrary.wiley.com/doi/10.1111/j.1551-6709.2010.01106.x/abstract^0|0|0|5|4Z|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|R|8|@]|9|@$F|S|G|T|1|U]]|A|$]]|$1|H|3|-4|5|6|7|V|8|@]|9|@]|A|$]]]|I|$J|$5|K|L|M|A|$N|O]]]]

One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.

To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-word sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.

This approach has been described in 
<a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1551-6709.2010.01106.x/abstract" rel="nofollow noreferrer">J. Mitchell and M. Lapata, “Composition in Distributional Models of Semantics,” Cognitive Science, vol. 34, no. 8, pp. 1388–1429, Nov. 2010., DOI 10.1111/j.1551-6709.2010.01106.x </a>

blocks|key|1551272|text|我会看看统计技术，这些技术考虑了每个单词出现在句子中的概率。这将允许你不那么重视流行的单词，如'+and+'，'or'，'the‘，而更重视那些出现得不那么频繁的单词，因此它们是一个更好的区分因素。例如，如果你有两句话：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1551273|1)+smith-waterman算法给出了两个字符串之间的相似性度量。2)我们回顾了smith-waterman算法，发现它对我们的项目来说已经足够好了。|1551274|事实上，这两个句子共享单词"smith-waterman“和单词"algorithms”(它们不像'+and+'，'or‘等那样常见)，这将使您可以说，这两个句子可能确实在谈论同一个主题。|1551275|总而言之，我建议你看看:+1)字符串相似性度量；2)统计方法；|1551276|希望这能有所帮助。|1551277|entityMap^0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|M|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|N|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|O|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|P|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|K|$]]

I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:

1) The smith-waterman algorithm gives you a similarity measure between two strings.
2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.

The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.

Summarizing, I would suggest you have a look at:
 1) String similarity measures;
 2) Statistic methods;

Hope this helps.

blocks|key|1551332|text|试试SimService，它提供了一项计算top-n相似单词和短语相似度的服务。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1551333|entityMap|0|LINK|mutability|MUTABLE|url|http://swoogle.umbc.edu/SimService/^0|2|A|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

Try <a href="http://swoogle.umbc.edu/SimService/" rel="nofollow">SimService</a>, which provides a service for computing top-n similar words and phrase similarity.

blocks|key|1550982|text|这需要你的算法真正知道你在说什么。它可以通过比较单词和寻找同义词等简单的形式来完成，但任何一种准确的结果都需要某种形式的智能。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1550983|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|D|8|@]|9|@]|A|$]]|$1|B|3|-4|5|6|7|E|8|@]|9|@]|A|$]]]|C|$]]

This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.

blocks|key|1766469|text|以http://mkusner.github.io/publications/WMD.pdf为例，本文描述了一种名为单词移动距离的算法，该算法试图揭示语义相似度。它依赖于word2vec规定的相似度得分。将这一点与GoogleNews-vectors+Negative300整合在一起，可以产生理想的结果。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1766470|entityMap|0|LINK|mutability|MUTABLE|url|http://mkusner.github.io/publications/WMD.pdf^0|1|19|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

Take a look at <a href="http://mkusner.github.io/publications/WMD.pdf" rel="nofollow noreferrer">http://mkusner.github.io/publications/WMD.pdf</a> This paper describes an algorithm called Word Mover distance that tries to uncover semantic similarity. It relies on the similarity scores as dictated by word2vec. Integrating this with GoogleNews-vectors-negative300 yields desirable results.

input: phrase 1, phrase 2

output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing

Is there an algorithm that tells the semantic similarity of two phrases

输入:短语1、短语2输出:语义相似度值(介于0和1之间)，或这两个短语谈论同一事物的概率

问有没有一个算法可以判断两个短语的语义相似度
EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有没有一个算法可以判断两个短语的语义相似度EN

回答 8

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有没有一个算法可以判断两个短语的语义相似度
EN