我在使用spaCy stop words时遇到了麻烦。任何帮助都将不胜感激。我正在把TED谈话记录加载到熊猫数据框中
df['parsed_transcript'] = df['transcript'].apply(nlp)
#making a list of stop words to add
my_stop_words = ["thing", "people", "way", "year", " year " "time", "lot", "day"]
#adding the list to the stop words
for stopword in my_stop_words:
lexeme = nlp.vocab[stopword]
lexeme.is_stop = True
#filtering out stop words and all non noun words
def preprocess_texts(texts_as_csv_column):
#Takes a column from a pandas datafram and converts it into a list of nouns.
lemmas = []
for doc in texts_as_csv_column:
# Append the lemmas of all nouns that are not stop words
lemma = ([token.lemma_ for token in doc if token.pos_ == 'NOUN' and not token.is_stop])
lemmas.append(lemma)
return lemmas现在,如果我计算一下“年”这个词,它减少了大约4000次,但它仍然出现了超过8000次。
count = 0
for row in df['list_of_words']:
for word in row:
if word == "year":
count +=1
print(count)有些令牌被完全删除,有些被部分删除,有些则根本不被删除。我已经尝试添加尾随和前导空白,但没有帮助。你知道我可能做错了什么吗?谢谢
发布于 2018-04-07 07:00:39
代码看起来是正确的,只是您在my_stop_words中有两次year,并且在第二个实例和time之间没有逗号,这在文档中将被解释为year time。
https://stackoverflow.com/questions/49656529
复制相似问题