我有一个这样的句子列表:
errList = [ 'Ragu ate lunch but didnt have Water for drinks',
'Rams ate lunch but didnt have Gatorade for drinks',
'Saya ate lunch but didnt have :water for drinks',
'Raghu ate lunch but didnt have water for drinks',
'Hanu ate lunch but didnt have -water for drinks',
'Wayu ate lunch but didnt have water for drinks',
'Viru ate lunch but didnt have .water 4or drinks',
'kk ate lunch & icecream but did have Water for drinks',
'M ate lunch &and icecream but did have Gatorade for drinks',
'Parker ate lunch icecream but didnt have :water for drinks',
'Sassy ate lunch and icecream but didnt have water for drinks',
'John ate lunch and icecream but didnt have -water for drinks',
'Pokey ate lunch and icecream but didnt have Water for drinks',
'Laila ate lunch and icecream but did have water 4or drinks',
]
我想找出列表中每个元素中句子的最长短语/部分(短语必须超过2个单词)的个数?在以下示例中,输出将更接近于此(最长短语作为键,计数作为值):
{ 'ate lunch but didnt have': 7,
'water for drinks': 7,
'ate lunch and icecream': 4,
'didnt have water': 3,
'didnt have Water': 2 # case sensitives
}
使用re模块是不可能的,因为问题接近于序列匹配,或者可能使用nltk,或者可能使用scikit-learn?我对NLP和scikit有一定的了解,但还不足以解决这个问题?如果我解决了这个问题,我会在这里发布它。
发布于 2018-05-31 16:48:17
使用带有一点numpy
foo的scikit-learn
也不是太痛苦。不过需要注意的是,这里只有预处理的默认值,如果你对数据集中的标点符号感兴趣,那么你需要调整它。
from sklearn.feature_extraction.text import CountVectorizer
# Find all the phrases >2 up to the max length
cv = CountVectorizer(ngram_range=(3, max([len(x.split(' ')) for x in errList])))
# Get the counts of the phrases
err_counts = cv.fit_transform(errList)
# Get the sum of each of the phrases
err_counts = err_counts.sum(axis=0)
# Mess about with the types, sparsity is annoying
err_counts = np.squeeze(np.asarray(err_counts))
# Retrieve the actual phrases that we're working with
feat_names = np.array(cv.get_feature_names())
# We don't have to sort here, but it's nice to if you want to print anything
err_counts_sorted = err_counts.argsort()[::-1]
feat_names = feat_names[err_counts_sorted]
err_counts = err_counts[err_counts_sorted]
# This is the dictionary that you were after
err_dict = dict(zip(feat_names, err_counts))
以下是排名前几位的输出
11 but didnt have
10 have water for drinks
10 have water for
10 water for drinks
10 but didnt have water
10 didnt have water
9 but didnt have water for drinks
9 but didnt have water for
9 didnt have water for drinks
9 didnt have water for
发布于 2018-06-02 04:17:07
如果你不想麻烦外部库,你可以只使用stdlib来完成(尽管它可能比一些替代方案慢得多):
import collections
import itertools
def gen_ngrams(sentence):
words = sentence.split() # or re.findall('\b\w+\b'), or whatever
n_words = len(words)
for i in range(n_words - 2):
for j in range(i + 3, n_words):
yield ' '.join(words[i: j]) # Assume normalization of spaces
def count_ngrams(sentences):
return collections.Counter(
itertools.chain.from_iterable(
gen_ngrams(sentence) for sentence in sentences
)
)
counts = count_ngrams(errList)
dict(counts.most_common(10))
这让你明白了:
{'but didnt have': 11,
'ate lunch but': 7,
'ate lunch but didnt': 7,
'ate lunch but didnt have': 7,
'lunch but didnt': 7,
'lunch but didnt have': 7,
'icecream but didnt': 4,
'icecream but didnt have': 4,
'ate lunch and': 4,
'ate lunch and icecream': 4}
发布于 2018-05-24 18:39:43
不是完整的解决方案,但为了帮助您,下面的内容将为您提供一本ngram和count的字典。然后,正如Bill Bell在评论中指出的那样,下一步是过滤掉较短的子序列。这(正如评论中也指出的)意味着决定你的最大长度,或者实际上是什么定义了一个短语……
from nltk import ngrams, word_tokenize
from collections import defaultdict
min_ngram_length = 1
max_ngram_length = max([len(x) for x in errList])
d = defaultdict(int)
for item in errList:
for i in range(min_ngram_length, max_ngram_length):
for ngram in ngrams(word_tokenize(item), i):
d[ngram] += 1
for pair in sorted(d.items(), key = lambda x: x[1], reverse=True):
print(pair)
https://stackoverflow.com/questions/50494956
复制相似问题