首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >python中的序列匹配算法

python中的序列匹配算法
EN

Stack Overflow用户
提问于 2018-05-24 02:17:55
回答 3查看 1.3K关注 0票数 11

我有一个这样的句子列表:

代码语言:javascript
复制
errList = [ 'Ragu ate lunch but didnt have Water for drinks',
            'Rams ate lunch but didnt have Gatorade for drinks',
            'Saya ate lunch but didnt have :water for drinks',
            'Raghu ate lunch but didnt have water for drinks',
            'Hanu ate lunch but didnt have -water for drinks',
            'Wayu ate lunch but didnt have water for drinks',
            'Viru ate lunch but didnt have .water 4or drinks',

            'kk ate lunch & icecream but did have Water for drinks',
            'M ate lunch &and icecream but did have Gatorade for drinks',
            'Parker ate lunch icecream but didnt have :water for drinks',
            'Sassy ate lunch and icecream but didnt have water for drinks',
            'John ate lunch and icecream but didnt have -water for drinks',
            'Pokey ate lunch and icecream but didnt have Water for drinks',
            'Laila ate lunch and icecream but did have water 4or drinks',
        ]

我想找出列表中每个元素中句子的最长短语/部分(短语必须超过2个单词)的个数?在以下示例中,输出将更接近于此(最长短语作为键,计数作为值):

代码语言:javascript
复制
{ 'ate lunch but didnt have': 7,
  'water for drinks': 7,
  'ate lunch and icecream': 4,
  'didnt have water': 3,
  'didnt have Water': 2    # case sensitives
}

使用re模块是不可能的,因为问题接近于序列匹配,或者可能使用nltk,或者可能使用scikit-learn?我对NLP和scikit有一定的了解,但还不足以解决这个问题?如果我解决了这个问题,我会在这里发布它。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-05-31 16:48:17

使用带有一点numpy foo的scikit-learn也不是太痛苦。不过需要注意的是,这里只有预处理的默认值,如果你对数据集中的标点符号感兴趣,那么你需要调整它。

代码语言:javascript
复制
from sklearn.feature_extraction.text import CountVectorizer

# Find all the phrases >2 up to the max length 
cv = CountVectorizer(ngram_range=(3, max([len(x.split(' ')) for x in errList])))
# Get the counts of the phrases
err_counts = cv.fit_transform(errList)
# Get the sum of each of the phrases
err_counts = err_counts.sum(axis=0)
# Mess about with the types, sparsity is annoying
err_counts = np.squeeze(np.asarray(err_counts))
# Retrieve the actual phrases that we're working with
feat_names = np.array(cv.get_feature_names())

# We don't have to sort here, but it's nice to if you want to print anything
err_counts_sorted = err_counts.argsort()[::-1]
feat_names = feat_names[err_counts_sorted]
err_counts = err_counts[err_counts_sorted]

# This is the dictionary that you were after
err_dict = dict(zip(feat_names, err_counts))

以下是排名前几位的输出

代码语言:javascript
复制
11 but didnt have
10 have water for drinks
10 have water for
10 water for drinks
10 but didnt have water
10 didnt have water
9 but didnt have water for drinks
9 but didnt have water for
9 didnt have water for drinks
9 didnt have water for
票数 6
EN

Stack Overflow用户

发布于 2018-06-02 04:17:07

如果你不想麻烦外部库,你可以只使用stdlib来完成(尽管它可能比一些替代方案慢得多):

代码语言:javascript
复制
import collections
import itertools

def gen_ngrams(sentence):
    words = sentence.split() # or re.findall('\b\w+\b'), or whatever
    n_words = len(words)
    for i in range(n_words - 2):
        for j in range(i + 3, n_words):
            yield ' '.join(words[i: j]) # Assume normalization of spaces


def count_ngrams(sentences):
    return collections.Counter(
        itertools.chain.from_iterable(
            gen_ngrams(sentence) for sentence in sentences
        )
    )

counts = count_ngrams(errList)
dict(counts.most_common(10))

这让你明白了:

代码语言:javascript
复制
{'but didnt have': 11,
 'ate lunch but': 7,
 'ate lunch but didnt': 7,
 'ate lunch but didnt have': 7,
 'lunch but didnt': 7,
 'lunch but didnt have': 7,
 'icecream but didnt': 4,
 'icecream but didnt have': 4,
 'ate lunch and': 4,
 'ate lunch and icecream': 4}
票数 5
EN

Stack Overflow用户

发布于 2018-05-24 18:39:43

不是完整的解决方案,但为了帮助您,下面的内容将为您提供一本ngram和count的字典。然后,正如Bill Bell在评论中指出的那样,下一步是过滤掉较短的子序列。这(正如评论中也指出的)意味着决定你的最大长度,或者实际上是什么定义了一个短语……

代码语言:javascript
复制
from nltk import ngrams, word_tokenize
from collections import defaultdict
min_ngram_length = 1
max_ngram_length = max([len(x) for x in errList])
d = defaultdict(int)
for item in errList:
    for i in range(min_ngram_length, max_ngram_length):
        for ngram in ngrams(word_tokenize(item), i):
            d[ngram] += 1
for pair in sorted(d.items(), key = lambda x: x[1], reverse=True):
    print(pair)
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50494956

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档