问spaCy中的自定义停止词不起作用
EN

Stack Overflow用户

提问于 2018-04-05 00:58:43

回答 1查看 423关注 0票数 0

我在使用spaCy stop words时遇到了麻烦。任何帮助都将不胜感激。我正在把TED谈话记录加载到熊猫数据框中

df['parsed_transcript'] = df['transcript'].apply(nlp)

#making a list of stop words to add
my_stop_words = ["thing", "people", "way", "year", " year " "time",  "lot", "day"]

#adding the list to the stop words
for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

#filtering out stop words and all non noun words
def preprocess_texts(texts_as_csv_column):
#Takes a column from a pandas datafram and converts it into a list of nouns.
    lemmas = []
    for doc in texts_as_csv_column: 
    # Append the lemmas of all nouns that are not stop words
        lemma = ([token.lemma_ for token in doc if token.pos_ == 'NOUN' and not token.is_stop])
        lemmas.append(lemma)

    return lemmas

现在，如果我计算一下“年”这个词，它减少了大约4000次，但它仍然出现了超过8000次。

count = 0
for row in df['list_of_words']:
    for word in row:
        if word == "year":
            count +=1

 print(count)

有些令牌被完全删除，有些被部分删除，有些则根本不被删除。我已经尝试添加尾随和前导空白，但没有帮助。你知道我可能做错了什么吗？谢谢

spacy

stop-words

回答 1

Stack Overflow用户

发布于 2018-04-07 07:00:39

代码看起来是正确的，只是您在my_stop_words中有两次year，并且在第二个实例和time之间没有逗号，这在文档中将被解释为year time。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49656529

复制

相似问题

问spaCy中的自定义停止词不起作用
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spaCy中的自定义停止词不起作用EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问spaCy中的自定义停止词不起作用
EN