问CountVectorizer与ngram在Python中的结合
EN

Stack Overflow用户

提问于 2017-12-19 04:37:54

回答 1查看 6.7K关注 0票数 1

有一项任务，分类男性和女性的名字，使用ngram。所以，有一个数据文件，比如：

    name    is_male
Dorian      1
Jerzy       1
Deane       1
Doti        0
Betteann    0
Donella     0

具体的要求是使用

from nltk.util import ngrams

对于此任务，要创建ngram (n=2,3,4)

我列了一个名字，然后用ngram：

from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

test_ngrams = []
for name in name_list:
    test_ngrams.append(list(ngrams(name,3)))

现在，我需要以某种方式将所有这些矢量化以用于分类，我尝试

X_train = count_vect.fit_transform(test_ngrams)

累犯：

AttributeError: 'list' object has no attribute 'lower'

我知道列表是错误的输入类型，请有人解释一下我该如何做，这样我以后就可以使用MultinomialNB了。我这样做完全正确吗？提前感谢！

scikit-learn

nltk

countvectorizer

python

云点播特惠1元起

提供制作上传、存储、转码、媒体处理、媒体 AI、加速分发播放、版权保护等一体化的高品质媒体服务

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-12-19 11:54:53

您正在将一系列列表传递给向量化器，这就是您要接收AttributeError的原因。相反，您应该传递一个可迭代的字符串。来自CountVectorizer 文档

fit_transform(raw_documents，y=None) 学习词汇词典，并返回术语-文档矩阵。这相当于fit，然后是转换，但更有效地实现。 参数： raw_documents：可迭代生成str、unicode或file对象的可迭代性。

为了回答您的问题，CountVectorizer能够使用ngram_range创建N-克(下面是生成比格)：

count_vect = CountVectorizer(ngram_range=(2,2))

corpus = [
    'This is the first document.',
    'This is the second second document.',
]
X = count_vect.fit_transform(corpus)

print(count_vect.get_feature_names())
['first document', 'is the', 'second document', 'second second', 'the first', 'the second', 'this is']

更新：

由于您提到必须使用NLTK生成ngram，所以我们需要覆盖CountVectorizer的部分默认行为。即，将原始字符串转换为特性的analyzer：

分析器：string，{‘word’，‘char’，‘char_wb’}或可调用 ..。如果传递可调用函数，则用于从未处理的原始输入中提取功能序列。

由于我们已经提供了ngram，所以一个身份函数就足够了：

count_vect = CountVectorizer(
    analyzer=lambda x:x
)

完整示例，结合NLTK ngram和CountVectorizer:

corpus = [
    'This is the first document.',
    'This is the second second document.',
]

def build_ngrams(text, n=2):
    tokens = text.lower().split()
    return list(nltk.ngrams(tokens, n))

corpus = [build_ngrams(document) for document in corpus]

count_vect = CountVectorizer(
    analyzer=lambda x:x
)

X = count_vect.fit_transform(corpus)
print(count_vect.get_feature_names())
[('first', 'document.'), ('is', 'the'), ('second', 'document.'), ('second', 'second'), ('the', 'first'), ('the', 'second'), ('this', 'is')]