问删除修改后的停用词
EN

Stack Overflow用户

提问于 2019-05-27 03:42:38

回答 1查看 139关注 0票数 2

背景：

1)我有以下代码，可以使用nltk包删除stopwords：

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in stopwords.words('english')]

2)此代码用于删除the等stopwords，如下所示：

['dog', 'barks', 'tree', 'sees', 'squirrel']

3)我用下面的代码修改了stopwords，保留了单词not：

to_remove = ['not']
new_stopwords = set(stopwords.words('english')).difference(to_remove)

问题：

4)但是当我通过以下代码使用new_stopwords时：

your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in new_stopwords.words('english')]

5)因为new_stopwords是一个set，所以我得到了以下错误：

AttributeError: 'set' object has no attribute 'words'

问题：

6)如何使用新定义的new_stopwords获得想要的输出：

['dog', 'not','barks', 'tree', 'sees', 'squirrel']

python-3.x

set

nltk

list-comprehension

stop-words

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-05-27 03:59:31

您已经非常接近了，但是您对错误消息的理解是错误的:问题不在于如您所说的"new_stopwords is a set"，而在于"set没有words属性“。

new_stopwords是一个集合，这意味着你可以直接在列表理解中使用它：

filtered_words = [word for word in lower_tokens if word not in new_stopwords]

您还可以省去制作修改后的停用词列表的麻烦，只需使用两个条件：

keep_list = ['not']
filtered_words = [word for word in lower_tokens if (word not in stopwords.words("english")) or (word in keep_list)]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56316811

复制

相似问题

问删除修改后的停用词
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除修改后的停用词EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问删除修改后的停用词
EN