我有一个数据帧,其中有一个对象列和超过100,000行,如下所示:
    df['words']
 0 the
 1 to
 2 of
 3 a
 4 with
 5 as
 6 job
 7 mobil
 8 market
 9 think
 10....不带停用字的期望输出:
   df['words']
 0 way
 1 http
 2 internet
 3 car
 4 do
 5 want
 6 work
 7 uber
 8....有没有一种方法可以使用gensim、spacy或nltk在单个专栏中遍历常见的停用词?
我试过了:
from gensim.parsing.preprocessing import remove_stopwords
stopwords.words('english')
df['words'] = df['words'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(" ".join(x)))但这会导致:
TypeError: can only join an iterable发布于 2021-05-31 17:19:01
使用nltk删除停用词。导入包
import pandas as pd
from nltk.corpus import stopwords创建停用词列表
stop_words = stopwords.words('english')
stop_words[:10]然后,
df['newword'] = list(map(lambda line: list(filter(lambda word: word not in stop_words, line)), df.words))
dfhttps://stackoverflow.com/questions/67770734
复制相似问题