文章/答案/技术大牛

发布

社区首页 >问答首页 >如何编写程序来查找某些单词是否相似？

问如何编写程序来查找某些单词是否相似？
EN

Stack Overflow用户

提问于 2013-01-04 07:29:38

回答 5查看 5.4K关注 0票数 5

例如：“学院”、“功课”和“学院”属于同一类，“论文”、“奖学金”、“金钱”也属于同一类。这是ML还是NLP问题？

machine-learning

nlp

回答 5

Stack Overflow用户

发布于 2013-02-01 09:26:30

这取决于你对相似的定义有多严格。

机器学习技术

正如others所指出的，您可以使用latent semantic analysis或相关的latent Dirichlet allocation。

语义相似度与WordNet

与pointed out一样，您可能希望使用现有资源来完成以下工作。

许多研究论文(example)使用术语语义相似度。基本思想是计算这通常是通过查找图上两个单词之间的distance来完成的，其中如果一个单词是其父单词的一种类型，则它是一个子单词。例如："songbird“是"bird”的子词。如果您愿意，语义相似度可以用作创建集群的距离度量。

示例实现

此外，如果您对某些语义相似性度量的值设置阈值，则可以得到布尔值True或False。这是我创建的一个要点(word_similarity.py)，它使用NLTK's的WordNet语料库阅读器。希望这能为你指明正确的方向，并给你更多的搜索词。

def sim(word1, word2, lch_threshold=2.15, verbose=False):
    """Determine if two (already lemmatized) words are similar or not.

    Call with verbose=True to print the WordNet senses from each word
    that are considered similar.

    The documentation for the NLTK WordNet Interface is available here:
    http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
    """
    from nltk.corpus import wordnet as wn
    results = []
    for net1 in wn.synsets(word1):
        for net2 in wn.synsets(word2):
            try:
                lch = net1.lch_similarity(net2)
            except:
                continue
            # The value to compare the LCH to was found empirically.
            # (The value is very application dependent. Experiment!)
            if lch >= lch_threshold:
                results.append((net1, net2))
    if not results:
        return False
    if verbose:
        for net1, net2 in results:
            print net1
            print net1.definition
            print net2
            print net2.definition
            print 'path similarity:'
            print net1.path_similarity(net2)
            print 'lch similarity:'
            print net1.lch_similarity(net2)
            print 'wup similarity:'
            print net1.wup_similarity(net2)
            print '-' * 79
    return True

输出示例

>>> sim('college', 'academy')
True

>>> sim('essay', 'schoolwork')
False

>>> sim('essay', 'schoolwork', lch_threshold=1.5)
True

>>> sim('human', 'man')
True

>>> sim('human', 'car')
False

>>> sim('fare', 'food')
True

>>> sim('fare', 'food', verbose=True)
Synset('fare.n.04')
the food and drink that are regularly served or consumed
Synset('food.n.01')
any substance that can be metabolized by an animal to give energy and build tissue
path similarity:
0.5
lch similarity:
2.94443897917
wup similarity:
0.909090909091
-------------------------------------------------------------------------------
True

>>> sim('bird', 'songbird', verbose=True)
Synset('bird.n.01')
warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings
Synset('songbird.n.01')
any bird having a musical call
path similarity:
0.25
lch similarity:
2.25129179861
wup similarity:
0.869565217391
-------------------------------------------------------------------------------
True

>>> sim('happen', 'cause', verbose=True)
Synset('happen.v.01')
come to pass
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
Synset('find.v.01')
come upon, as if by accident; meet with
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
True

票数 15

Stack Overflow用户

发布于 2013-01-04 07:32:50

我想您可以使用ML和NLP技术来构建自己的这种关联数据库，但您也可以考虑查询现有的资源，如WordNet来完成这项工作。

票数 3

Stack Overflow用户

发布于 2013-01-09 07:30:09

如果您有大量与感兴趣的主题相关的文档，则可能需要查看Latent Direchlet Allocation。LDA是一种相当标准的NLP技术，它自动将单词聚集到主题中，其中单词之间的相似度由同一文档中的搭配确定(如果这更适合您的需要，您可以将单个句子视为文档)。

您将发现许多可用的LDA工具包。我们需要更多关于你的问题的详细信息，然后才能推荐一个。无论如何，我不是一个足够的专家来提出这个建议，但我至少可以建议你看看LDA。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/14148986

复制

相似问题

问如何编写程序来查找某些单词是否相似？
EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何编写程序来查找某些单词是否相似？EN

回答 5

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何编写程序来查找某些单词是否相似？
EN