问python中的序列匹配算法
EN

Stack Overflow用户

提问于 2018-05-24 02:17:55

回答 3查看 1.3K关注 0票数 11

我有一个这样的句子列表：

errList = [ 'Ragu ate lunch but didnt have Water for drinks',
            'Rams ate lunch but didnt have Gatorade for drinks',
            'Saya ate lunch but didnt have :water for drinks',
            'Raghu ate lunch but didnt have water for drinks',
            'Hanu ate lunch but didnt have -water for drinks',
            'Wayu ate lunch but didnt have water for drinks',
            'Viru ate lunch but didnt have .water 4or drinks',

            'kk ate lunch & icecream but did have Water for drinks',
            'M ate lunch &and icecream but did have Gatorade for drinks',
            'Parker ate lunch icecream but didnt have :water for drinks',
            'Sassy ate lunch and icecream but didnt have water for drinks',
            'John ate lunch and icecream but didnt have -water for drinks',
            'Pokey ate lunch and icecream but didnt have Water for drinks',
            'Laila ate lunch and icecream but did have water 4or drinks',
        ]

我想找出列表中每个元素中句子的最长短语/部分(短语必须超过2个单词)的个数？在以下示例中，输出将更接近于此(最长短语作为键，计数作为值)：

{ 'ate lunch but didnt have': 7,
  'water for drinks': 7,
  'ate lunch and icecream': 4,
  'didnt have water': 3,
  'didnt have Water': 2    # case sensitives
}

使用re模块是不可能的，因为问题接近于序列匹配，或者可能使用nltk，或者可能使用scikit-learn？我对NLP和scikit有一定的了解，但还不足以解决这个问题？如果我解决了这个问题，我会在这里发布它。

difflib

python

scikit-learn

nltk

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-05-31 16:48:17

使用带有一点numpy foo的scikit-learn也不是太痛苦。不过需要注意的是，这里只有预处理的默认值，如果你对数据集中的标点符号感兴趣，那么你需要调整它。

from sklearn.feature_extraction.text import CountVectorizer

# Find all the phrases >2 up to the max length 
cv = CountVectorizer(ngram_range=(3, max([len(x.split(' ')) for x in errList])))
# Get the counts of the phrases
err_counts = cv.fit_transform(errList)
# Get the sum of each of the phrases
err_counts = err_counts.sum(axis=0)
# Mess about with the types, sparsity is annoying
err_counts = np.squeeze(np.asarray(err_counts))
# Retrieve the actual phrases that we're working with
feat_names = np.array(cv.get_feature_names())

# We don't have to sort here, but it's nice to if you want to print anything
err_counts_sorted = err_counts.argsort()[::-1]
feat_names = feat_names[err_counts_sorted]
err_counts = err_counts[err_counts_sorted]

# This is the dictionary that you were after
err_dict = dict(zip(feat_names, err_counts))

以下是排名前几位的输出

11 but didnt have
10 have water for drinks
10 have water for
10 water for drinks
10 but didnt have water
10 didnt have water
9 but didnt have water for drinks
9 but didnt have water for
9 didnt have water for drinks
9 didnt have water for

票数 6

Stack Overflow用户

发布于 2018-06-02 04:17:07

如果你不想麻烦外部库，你可以只使用stdlib来完成(尽管它可能比一些替代方案慢得多)：

import collections
import itertools

def gen_ngrams(sentence):
    words = sentence.split() # or re.findall('\b\w+\b'), or whatever
    n_words = len(words)
    for i in range(n_words - 2):
        for j in range(i + 3, n_words):
            yield ' '.join(words[i: j]) # Assume normalization of spaces


def count_ngrams(sentences):
    return collections.Counter(
        itertools.chain.from_iterable(
            gen_ngrams(sentence) for sentence in sentences
        )
    )

counts = count_ngrams(errList)
dict(counts.most_common(10))

这让你明白了：

{'but didnt have': 11,
 'ate lunch but': 7,
 'ate lunch but didnt': 7,
 'ate lunch but didnt have': 7,
 'lunch but didnt': 7,
 'lunch but didnt have': 7,
 'icecream but didnt': 4,
 'icecream but didnt have': 4,
 'ate lunch and': 4,
 'ate lunch and icecream': 4}

票数 5

Stack Overflow用户

发布于 2018-05-24 18:39:43

不是完整的解决方案，但为了帮助您，下面的内容将为您提供一本ngram和count的字典。然后，正如Bill Bell在评论中指出的那样，下一步是过滤掉较短的子序列。这(正如评论中也指出的)意味着决定你的最大长度，或者实际上是什么定义了一个短语……

from nltk import ngrams, word_tokenize
from collections import defaultdict
min_ngram_length = 1
max_ngram_length = max([len(x) for x in errList])
d = defaultdict(int)
for item in errList:
    for i in range(min_ngram_length, max_ngram_length):
        for ngram in ngrams(word_tokenize(item), i):
            d[ngram] += 1
for pair in sorted(d.items(), key = lambda x: x[1], reverse=True):
    print(pair)

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50494956

复制

相似问题

问python中的序列匹配算法
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python中的序列匹配算法EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问python中的序列匹配算法
EN