问蟒蛇的N-gram，四克，五克，六克？
EN

Stack Overflow用户

提问于 2013-07-09 00:35:31

回答 15查看 207.3K关注 0票数 156

我正在寻找一种将文本拆分成n-gram的方法。通常我会这样做：

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道nltk只提供二元和三元，但是有没有办法把我的文本分成四个、五个甚至一百个呢？

谢谢!

python

string

nltk

n-gram

回答 15

Stack Overflow用户

回答已采纳

发布于 2013-07-09 20:10:39

其他用户给出的基于原生python的很好的答案。但这里是nltk方法(以防万一，OP会因为重新发明nltk库中已有的东西而受到惩罚)。

在nltk中有一个人们很少使用的ngram module。这并不是因为ngram很难阅读，而是基于ngram训练模型，其中n>3将导致大量数据稀疏。

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

票数 249

Stack Overflow用户

发布于 2013-07-09 00:54:06

我很惊讶这一点还没有出现：

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

票数 74

Stack Overflow用户

发布于 2015-08-31 17:28:46

仅使用nltk工具

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

输出示例

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

为了使ngram保持数组格式，只需删除' '.join

票数 18

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/17531684

复制

相似问题

问蟒蛇的N-gram，四克，五克，六克？
EN

回答 15

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问蟒蛇的N-gram，四克，五克，六克？EN

回答 15

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问蟒蛇的N-gram，四克，五克，六克？
EN