blocks|key|2524830|text|更新|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|2524831|从scikit-learn+0.14开始，格式已更改为：|2524832|n_grams+=+CountVectorizer(ngram_range=(1,+5))|code-block|syntax|javascript|2524833|完整示例：|2524834|test_str1+=+"I+need+to+get+most+popular+ngrams+from+text.+Ngrams+length+must+be+from+1+to+5+words."
test_str2+=+"I+know+how+to+exclude+bigrams+from+trigrams,+but+i+need+better+solutions."

from+sklearn.feature_extraction.text+import+CountVectorizer

c_vec+=+CountVectorizer(ngram_range=(1,+5))

#+input+to+fit_transform()+should+be+an+iterable+with+strings
ngrams+=+c_vec.fit_transform([test_str1,+test_str2])

#+needs+to+happen+after+fit_transform()
vocab+=+c_vec.vocabulary_

count_values+=+ngrams.toarray().sum(axis=0)

#+output+n-grams
for+ng_count,+ng_text+in+sorted([(count_values[i],k)+for+k,i+in+vocab.items()],+reverse=True):
++++print(ng_count,+ng_text)|2524835|它输出以下内容(请注意，删除单词I并不是因为它是一个停用词(它不是)，而是因为它的长度：https://stackoverflow.com/a/20743758/)：|CODE|2524836|>+(3,+u'to')
>+(3,+u'from')
>+(2,+u'ngrams')
>+(2,+u'need')
>+(1,+u'words')
>+(1,+u'trigrams+but+need+better+solutions')
>+(1,+u'trigrams+but+need+better')
...|2524837|如今，这应该/可以简单得多，imo。你可以尝试像textacy这样的东西，但这有时会有它自己的复杂性，比如初始化一个Doc，它目前在v.0.6.2+as+shown+on+their+docs中不起作用。If+doc+initialization+worked+as+promised，从理论上讲，下面的方法可以工作(但它不能)：|2524838|test_str1+=+"I+need+to+get+most+popular+ngrams+from+text.+Ngrams+length+must+be+from+1+to+5+words."
test_str2+=+"I+know+how+to+exclude+bigrams+from+trigrams,+but+i+need+better+solutions."

import+textacy

#+some+version+of+the+following+line
doc+=+textacy.Doc([test_str1,+test_str2])

ngrams+=+doc.to_bag_of_terms(ngrams={1,+5},+as_strings=True)
print(ngrams)|2524839|旧答案|2524840|自scikit+Learn0.11以来，WordNGramAnalyzer确实已被弃用。创建n-gram和获取词频现在在sklearn.feature_extraction.text.CountVectorizer中组合在一起。您可以创建范围从1到5的所有n元语法，如下所示：|2524841|n_grams+=+CountVectorizer(min_n=1,+max_n=5)|2524842|更多的例子和信息可以在scikit+learn关于text+feature+extraction的文档中找到。|2524843|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/a/20743758/|1|https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc.to_bag_of_terms|2|https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc|3|https://stackoverflow.com/q/51431112/|4|http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer|5|http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction^0|0|2|0|0|0|0|0|G|1|18|11|0|0|0|O|7|O|7|1|22|M|2|2U|14|3|0|0|0|3|0|K|H|1O|1B|4|0|0|P|N|5|0^^$0|@$1|2|3|4|5|6|7|1P|8|@$9|1Q|A|1R|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|1S|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|1T|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|1U|8|@]|D|@]|E|$]]|$1|O|3|P|5|J|7|1V|8|@]|D|@]|E|$K|L]]|$1|Q|3|R|5|6|7|1W|8|@$9|1X|A|1Y|B|S]]|D|@$9|1Z|A|20|1|21]]|E|$]]|$1|T|3|U|5|J|7|22|8|@]|D|@]|E|$K|L]]|$1|V|3|W|5|6|7|23|8|@$9|24|A|25|B|S]]|D|@$9|26|A|27|1|28]|$9|29|A|2A|1|2B]|$9|2C|A|2D|1|2E]]|E|$]]|$1|X|3|Y|5|J|7|2F|8|@]|D|@]|E|$K|L]]|$1|Z|3|10|5|6|7|2G|8|@$9|2H|A|2I|B|C]]|D|@]|E|$]]|$1|11|3|12|5|6|7|2J|8|@$9|2K|A|2L|B|S]]|D|@$9|2M|A|2N|1|2O]]|E|$]]|$1|13|3|14|5|J|7|2P|8|@]|D|@]|E|$K|L]]|$1|15|3|16|5|6|7|2Q|8|@]|D|@$9|2R|A|2S|1|2T]]|E|$]]|$1|17|3|-4|5|6|7|2U|8|@]|D|@]|E|$]]]|18|$19|$5|1A|1B|1C|E|$1D|1E]]|1F|$5|1A|1B|1C|E|$1D|1G]]|1H|$5|1A|1B|1C|E|$1D|1I]]|1J|$5|1A|1B|1C|E|$1D|1K]]|1L|$5|1A|1B|1C|E|$1D|1M]]|1N|$5|1A|1B|1C|E|$1D|1O]]]]

update

Since scikit-learn 0.14 the format has changed to:

<pre><code>n_grams = CountVectorizer(ngram_range=(1, 5))
</code></pre>

Full example:

<pre><code>test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

from sklearn.feature_extraction.text import CountVectorizer

c_vec = CountVectorizer(ngram_range=(1, 5))

# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])

# needs to happen after fit_transform()
vocab = c_vec.vocabulary_

count_values = ngrams.toarray().sum(axis=0)

# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
 print(ng_count, ng_text)
</code></pre>

which outputs the following (note that the word <code>I</code> is removed not because it's a stopword (it's not) but because of its length: <a href="https://stackoverflow.com/a/20743758/">https://stackoverflow.com/a/20743758/</a>):

<pre><code>&gt; (3, u'to')
&gt; (3, u'from')
&gt; (2, u'ngrams')
&gt; (2, u'need')
&gt; (1, u'words')
&gt; (1, u'trigrams but need better solutions')
&gt; (1, u'trigrams but need better')
...
</code></pre>

This should/could be much simpler these days, imo. You can try things like <a href="https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc.to_bag_of_terms" rel="noreferrer"><code>textacy</code></a>, but that can come with its own complications sometimes, like initializing a Doc, which doesn't work currently with v.0.6.2 <a href="https://textacy.readthedocs.io/en/latest/api_reference.html#textacy.doc.Doc" rel="noreferrer">as shown on their docs</a>. <a href="https://stackoverflow.com/q/51431112/">If doc initialization worked as promised</a>, in theory the following would work (but it doesn't):

<pre><code>test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

import textacy

# some version of the following line
doc = textacy.Doc([test_str1, test_str2])

ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)
</code></pre>

old answer

<code>WordNGramAnalyzer</code> is indeed deprecated since scikit-learn 0.11. Creating n-grams and getting term frequencies is now combined in <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer" rel="noreferrer">sklearn.feature_extraction.text.CountVectorizer</a>. You can create all n-grams ranging from 1 till 5 as follows:

<pre><code>n_grams = CountVectorizer(min_n=1, max_n=5)
</code></pre>

More examples and information can be found in scikit-learn's documentation about <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction" rel="noreferrer">text feature extraction</a>.

blocks|key|2734338|text|如果您想生成原始的ngram(也许可以自己计算它们)，也可以使用nltk.util.ngrams(sequence,+n)。它将为任何n值生成一个ngram序列。它具有填充选项，请参阅文档。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|2734339|entityMap^0|W|T|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|K|8|@]|D|@]|E|$]]]|G|$]]

If you want to generate the raw ngrams (and count them yourself, perhaps), there's also <code>nltk.util.ngrams(sequence, n)</code>. It will generate a sequence of ngrams for any value of n. It has options for padding, see the documentation.

blocks|key|2734354|text|看看http://nltk.org/_modules/nltk/util.html，我认为在幕后，nltk.util.bigrams()和nltk.util.trigrams()是使用nltk.util.ngrams()实现的。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|2734355|entityMap|0|LINK|mutability|MUTABLE|url|http://nltk.org/_modules/nltk/util.html^0|2|13|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@$A|M|B|N|1|O]]|C|$]]|$1|D|3|-4|5|6|7|P|8|@]|9|@]|C|$]]]|E|$F|$5|G|H|I|C|$J|K]]]]

Looking at <a href="http://nltk.org/_modules/nltk/util.html" rel="nofollow">http://nltk.org/_modules/nltk/util.html</a> I think under the hood nltk.util.bigrams() and nltk.util.trigrams() are implemented using nltk.util.ngrams()

I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words. 

I know how to get bigrams and trigrams. For example:

<pre><code>bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)
</code></pre>

However, i found out that scikit-learn can get ngrams with various length. For example I can get ngrams with length from 1 to 5.

<pre><code>v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))
</code></pre>

But WordNGramAnalyzer is now deprecated. My question is: How can i get N best word collocations from my text, with collocations length from 1 to 5. Also i need to get FreqList of this collocations/ngrams.

Can i do that with nltk/scikit ? I need to get combinations of ngrams with various lengths from one text ? 

For example using NLTK bigrams and trigrams where many situations in which my trigrams include my bitgrams, or my trigrams are part of bigger 4-grams. For example:

bitgrams: hello my
trigrams: hello my name

I know how to exclude bigrams from trigrams, but i need better solutions.

Python List of Ngrams with frequencies

我需要从文本中获取最流行的ngram。Ngram长度必须介于1到5个单词之间。我知道如何得到二元模型和三元模型。例如：bigram_measures = nltk.collocations.BigramAssocMeasures()finder = nltk.collocations.BigramCollocation...

问带有频率的Ngram的Python列表
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问带有频率的Ngram的Python列表EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问带有频率的Ngram的Python列表
EN