背景:
1)我有以下代码,可以使用nltk包删除stopwords
:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in stopwords.words('english')]
2)此代码用于删除the
等stopwords
,如下所示:
['dog', 'barks', 'tree', 'sees', 'squirrel']
3)我用下面的代码修改了stopwords
,保留了单词not
:
to_remove = ['not']
new_stopwords = set(stopwords.words('english')).difference(to_remove)
问题:
4)但是当我通过以下代码使用new_stopwords
时:
your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in new_stopwords.words('english')]
5)因为new_stopwords
是一个set
,所以我得到了以下错误:
AttributeError: 'set' object has no attribute 'words'
问题:
6)如何使用新定义的new_stopwords
获得想要的输出:
['dog', 'not','barks', 'tree', 'sees', 'squirrel']
发布于 2019-05-27 03:59:31
您已经非常接近了,但是您对错误消息的理解是错误的:问题不在于如您所说的"new_stopwords
is a set
",而在于"set
没有words
属性“。
new_stopwords
是一个集合,这意味着你可以直接在列表理解中使用它:
filtered_words = [word for word in lower_tokens if word not in new_stopwords]
您还可以省去制作修改后的停用词列表的麻烦,只需使用两个条件:
keep_list = ['not']
filtered_words = [word for word in lower_tokens if (word not in stopwords.words("english")) or (word in keep_list)]
https://stackoverflow.com/questions/56316811
复制相似问题