blocks|key|733355|text|from+nltk.corpus+import+stopwords
#+...
filtered_words+=+[word+for+word+in+word_list+if+word+not+in+stopwords.words('english')]|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|733356|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
</code></pre>

blocks|key|949605|text|您还可以执行set+diff，例如：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|949606|list(set(nltk.regexp_tokenize(sentence,+pattern,+gaps=True))+-+set(nltk.corpus.stopwords.words('english')))|code-block|syntax|javascript|949607|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You could also do a set diff, for example:

<pre><code>list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
</code></pre>

blocks|key|733672|text|要排除所有类型的停用词，包括nltk停用词，您可以这样做：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|733673|from+stop_words+import+get_stop_words
from+nltk.corpus+import+stopwords

stop_words+=+list(get_stop_words('en'))+++++++++#About+900+stopwords
nltk_words+=+list(stopwords.words('english'))+#About+150+stopwords
stop_words.extend(nltk_words)

output+=+[w+for+w+in+word_list+if+not+w+in+stop_words]|code-block|syntax|javascript|733674|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

To exclude all type of stop-words including nltk stop-words, you could do something like this: 

<pre><code>from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]
</code></pre>

blocks|key|949369|text|我假设您有一个要删除停用词的单词列表(word_list)。你可以这样做：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|949370|filtered_word_list+=+word_list[:]+#make+a+copy+of+the+word_list
for+word+in+word_list:+#+iterate+over+word_list
++if+word+in+stopwords.words('english'):+
++++filtered_word_list.remove(word)+#+remove+word+from+filtered_word_list+if+it+is+a+stopword|code-block|syntax|javascript|949371|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

<pre><code>filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
 if word in stopwords.words('english'): 
 filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
</code></pre>

blocks|key|949862|text|有一个非常简单、轻量级的python包stop-words就是为了这个目的。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|949863|首先使用：pip+install+stop-words安装软件包|949864|然后，您可以使用列表理解删除一行中的单词：|949865|from+stop_words+import+get_stop_words

filtered_words+=+[word+for+word+in+dataset+if+word+not+in+get_stop_words('english')]|code-block|syntax|javascript|949866|这个包非常轻量级，易于下载(与nltk不同)，适用于Python+2和Python+3，它还提供了许多其他语言的停用词，例如：|949867|++++Arabic
++++Bulgarian
++++Catalan
++++Czech
++++Danish
++++Dutch
++++English
++++Finnish
++++French
++++German
++++Hungarian
++++Indonesian
++++Italian
++++Norwegian
++++Polish
++++Portuguese
++++Romanian
++++Russian
++++Spanish
++++Swedish
++++Turkish
++++Ukrainian|949868|entityMap^0|J|A|0|5|M|0|0|0|Q|8|Z|8|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@$9|V|A|W|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|X|8|@$9|Y|A|Z|B|C]]|D|@]|E|$]]|$1|H|3|I|5|6|7|10|8|@]|D|@]|E|$]]|$1|J|3|K|5|L|7|11|8|@]|D|@]|E|$M|N]]|$1|O|3|P|5|6|7|12|8|@$9|13|A|14|B|C]|$9|15|A|16|B|C]]|D|@]|E|$]]|$1|Q|3|R|5|L|7|17|8|@]|D|@]|E|$M|N]]|$1|S|3|-4|5|6|7|18|8|@]|D|@]|E|$]]]|T|$]]

There's a very simple light-weight python package <code>stop-words</code> just for this sake.

Fist install the package using:
<code>pip install stop-words</code>

Then you can remove your words in one line using list comprehension:

<pre><code>from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

</code></pre>

This package is very light-weight to download (unlike nltk), works for both <code>Python 2</code> and <code>Python 3</code> ,and it has stop words for many other languages like:

<pre><code> Arabic
 Bulgarian
 Catalan
 Czech
 Danish
 Dutch
 English
 Finnish
 French
 German
 Hungarian
 Indonesian
 Italian
 Norwegian
 Polish
 Portuguese
 Romanian
 Russian
 Spanish
 Swedish
 Turkish
 Ukrainian
</code></pre>

blocks|key|949912|text|以下是我对此的看法，以防你想立即将答案转换为字符串(而不是过滤后的单词列表)：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|949913|STOPWORDS+=+set(stopwords.words('english'))
text+=++'+'.join([word+for+word+in+text.split()+if+word+not+in+STOPWORDS])+#+delete+stopwords+from+text|code-block|syntax|javascript|949914|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

<pre><code>STOPWORDS = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
</code></pre>

blocks|key|949658|text|你可以使用这个函数，你应该注意到你需要降低所有的单词|type|unstyled|depth|inlineStyleRanges|entityRanges|data|949659|from+nltk.corpus+import+stopwords

def+remove_stopwords(word_list):
++++++++processed_word_list+=+[]
++++++++for+word+in+word_list:
++++++++++++word+=+word.lower()+#+in+case+they+arenet+all+lower+cased
++++++++++++if+word+not+in+stopwords.words("english"):
++++++++++++++++processed_word_list.append(word)
++++++++return+processed_word_list|code-block|syntax|javascript|949660|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

you can use this function, you should notice that you need to lower all the words

<pre class="lang-python prettyprint-override"><code>from nltk.corpus import stopwords

def remove_stopwords(word_list):
 processed_word_list = []
 for word in word_list:
 word = word.lower() # in case they arenet all lower cased
 if word not in stopwords.words("english"):
 processed_word_list.append(word)
 return processed_word_list
</code></pre>

blocks|key|949672|text|使用filter|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|949673|from+nltk.corpus+import+stopwords
#+...++
filtered_words+=+list(filter(lambda+word:+word+not+in+stopwords.words('english'),+word_list))|code-block|syntax|javascript|949674|entityMap|0|LINK|mutability|MUTABLE|url|http://book.pythontips.com/en/latest/map_filter.html^0|2|6|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@]|9|@$A|R|B|S|1|T]]|C|$]]|$1|D|3|E|5|F|7|U|8|@]|9|@]|C|$G|H]]|$1|I|3|-4|5|6|7|V|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]]]

using <a href="http://book.pythontips.com/en/latest/map_filter.html" rel="nofollow noreferrer">filter</a>:

<pre><code>from nltk.corpus import stopwords
# ... 
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
</code></pre>

blocks|key|950054|text|尽管这个问题有点老，但这里有一个新的库，值得一提的是，它可以做额外的任务。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|950055|在某些情况下，您不希望只删除停用词。相反，您可能希望找到文本数据中的停用词，并将其存储在列表中，这样您就可以找到数据中的噪音，并使其更具交互性。|950056|该库名为'textfeatures'。您可以按如下方式使用它：|offset|length|style|CODE|950057|!+pip+install+textfeatures
import+textfeatures+as+tf
import+pandas+as+pd|code-block|syntax|javascript|950058|例如，假设您有以下一组字符串：|950059|texts+=+[
++++"blue+car+and+blue+window",
++++"black+crow+in+the+window",
++++"i+see+my+reflection+in+the+window"]

df+=+pd.DataFrame(texts)+#+Convert+to+a+dataframe
df.columns+=+['text']+#+give+a+name+to+the+column
df|950060|现在，调用stopwords()函数并传递所需的参数：|950061|tf.stopwords(df,"text","stopwords")+#+extract+stop+words
df[["text","stopwords"]].head()+#+give+names+to+columns|950062|结果将是：|950063|++++text+++++++++++++++++++++++++++++++++stopwords
0+++blue+car+and+blue+window+++++++++++++[and]
1+++black+crow+in+the+window+++++++++++++[in,+the]
2+++i+see+my+reflection+in+the+window++++[i,+my,+in,+the]|950064|正如您所看到的，最后一列包含了文档(记录)中包含的停用词。|950065|entityMap^0|0|0|4|E|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|14|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|15|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|16|8|@$F|17|G|18|H|I]]|9|@]|A|$]]|$1|J|3|K|5|L|7|19|8|@]|9|@]|A|$M|N]]|$1|O|3|P|5|6|7|1A|8|@]|9|@]|A|$]]|$1|Q|3|R|5|L|7|1B|8|@]|9|@]|A|$M|N]]|$1|S|3|T|5|6|7|1C|8|@]|9|@]|A|$]]|$1|U|3|V|5|L|7|1D|8|@]|9|@]|A|$M|N]]|$1|W|3|X|5|6|7|1E|8|@]|9|@]|A|$]]|$1|Y|3|Z|5|L|7|1F|8|@]|9|@]|A|$M|N]]|$1|10|3|11|5|6|7|1G|8|@]|9|@]|A|$]]|$1|12|3|-4|5|6|7|1H|8|@]|9|@]|A|$]]]|13|$]]

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.
In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.
The library is called <code>'textfeatures'</code>. You can use it as follows:
<pre><code>! pip install textfeatures
import textfeatures as tf
import pandas as pd
</code></pre>
For example, suppose you have the following set of strings:
<pre><code>texts = [
 &quot;blue car and blue window&quot;,
 &quot;black crow in the window&quot;,
 &quot;i see my reflection in the window&quot;]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df
</code></pre>
Now, call the stopwords() function and pass the parameters you want:
<pre><code>tf.stopwords(df,&quot;text&quot;,&quot;stopwords&quot;) # extract stop words
df[[&quot;text&quot;,&quot;stopwords&quot;]].head() # give names to columns
</code></pre>
The result is going to be:
<pre><code> text stopwords
0 blue car and blue window [and]
1 black crow in the window [in, the]
2 i see my reflection in the window [i, my, in, the]
</code></pre>
As you can see, the last column has the stop words included in that docoument (record).

blocks|key|733927|text|如果您的数据被存储为Pandas+DataFrame，您可以在remove_stopwords中使用default提供的NLTK停用词列表。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|733928|import+pandas+as+pd
import+texthero+as+hero
df['text_without_stopwords']+=+hero.remove_stopwords(df['text'])|code-block|syntax|javascript|733929|entityMap|0|LINK|mutability|MUTABLE|url|https://texthero.org/docs/api/texthero.preprocessing.remove_stopwords^0|A|G|V|G|1E|7|0|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@$9|T|A|U|B|C]|$9|V|A|W|B|C]]|D|@$9|X|A|Y|1|Z]]|E|$]]|$1|F|3|G|5|H|7|10|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|11|8|@]|D|@]|E|$]]]|L|$M|$5|N|O|P|E|$Q|R]]]]

In case your data are stored as a <code>Pandas DataFrame</code>, you can use <code>remove_stopwords</code> from textero that use the NLTK stopwords list by <a href="https://texthero.org/docs/api/texthero.preprocessing.remove_stopwords" rel="nofollow noreferrer">default</a>.

<pre class="lang-py prettyprint-override"><code>import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])
</code></pre>

blocks|key|949970|text|from+nltk.corpus+import+stopwords+

from+nltk.tokenize+import+word_tokenize+

example_sent+=+"This+is+a+sample+sentence,+showing+off+the+stop+words+filtration."

++
stop_words+=+set(stopwords.words('english'))+
++
word_tokens+=+word_tokenize(example_sent)+
++
filtered_sentence+=+[w+for+w+in+word_tokens+if+not+w+in+stop_words]+
++
filtered_sentence+=+[]+
++
for+w+in+word_tokens:+
++++if+w+not+in+stop_words:+
++++++++filtered_sentence.append(w)+
++
print(word_tokens)+
print(filtered_sentence)+|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|949971|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = &quot;This is a sample sentence, showing off the stop words filtration.&quot;

 
stop_words = set(stopwords.words('english')) 
 
word_tokens = word_tokenize(example_sent) 
 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
 
filtered_sentence = [] 
 
for w in word_tokens: 
 if w not in stop_words: 
 filtered_sentence.append(w) 
 
print(word_tokens) 
print(filtered_sentence) 
</code></pre>

blocks|key|733984|text|我将向您展示一些示例，首先我从数据框(twitter_df)中提取文本数据，以便进一步处理，如下所示|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|733985|+++++from+nltk.tokenize+import+word_tokenize
+++++tweetText+=+twitter_df['text']|code-block|syntax|javascript|733986|然后，我使用以下方法进行标记化|733987|+++++from+nltk.tokenize+import+word_tokenize
+++++tweetText+=+tweetText.apply(word_tokenize)|733988|然后，为了删除停用词，|733989|+++++from+nltk.corpus+import+stopwords
+++++nltk.download('stopwords')

+++++stop_words+=+set(stopwords.words('english'))
+++++tweetText+=+tweetText.apply(lambda+x:[word+for+word+in+x+if+word+not+in+stop_words])
+++++tweetText.head()|733990|我想这会对你有帮助|733991|entityMap^0|J|A|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|Z|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|10|8|@]|D|@]|E|$]]|$1|M|3|N|5|H|7|11|8|@]|D|@]|E|$I|J]]|$1|O|3|P|5|6|7|12|8|@]|D|@]|E|$]]|$1|Q|3|R|5|H|7|13|8|@]|D|@]|E|$I|J]]|$1|S|3|T|5|6|7|14|8|@]|D|@]|E|$]]|$1|U|3|-4|5|6|7|15|8|@]|D|@]|E|$]]]|V|$]]

I will show you some example
First I extract the text data from the data frame (<code>twitter_df</code>) to process further as following
<pre><code> from nltk.tokenize import word_tokenize
 tweetText = twitter_df['text']
</code></pre>
Then to tokenize I use the following method
<pre><code> from nltk.tokenize import word_tokenize
 tweetText = tweetText.apply(word_tokenize)
</code></pre>
Then, to remove stop words,
<pre><code> from nltk.corpus import stopwords
 nltk.download('stopwords')

 stop_words = set(stopwords.words('english'))
 tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
 tweetText.head()
</code></pre>
I Think this will help you

blocks|key|949622|text|+++import+sys
print+("enter+the+string+from+which+you+want+to+remove+list+of+stop+words")
userstring+=+input().split("+")
list+=["a","an","the","in"]
another_list+=+[]
for+x+in+userstring:
++++if+x+not+in+list:+++++++++++#+comparing+from+the+list+and+removing+it
++++++++another_list.append(x)++#+it+is+also+possible+to+use+.remove
for+x+in+another_list:
+++++print(x,end='+')

+++#+2)+if+you+want+to+use+.remove+more+preferred+code
++++import+sys
++++print+("enter+the+string+from+which+you+want+to+remove+list+of+stop+words")
++++userstring+=+input().split("+")
++++list+=["a","an","the","in"]
++++another_list+=+[]
++++for+x+in+userstring:
++++++++if+x+in+list:+++++++++++
++++++++++++userstring.remove(x)++
++++for+x+in+userstring:+++++++++++
++++++++print(x,end+=+'+')+
++++#the+code+will+be+like+this|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|949623|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code> import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
 if x not in list: # comparing from the list and removing it
 another_list.append(x) # it is also possible to use .remove
for x in another_list:
 print(x,end=' ')

 # 2) if you want to use .remove more preferred code
 import sys
 print ("enter the string from which you want to remove list of stop words")
 userstring = input().split(" ")
 list =["a","an","the","in"]
 another_list = []
 for x in userstring:
 if x in list: 
 userstring.remove(x) 
 for x in userstring: 
 print(x,end = ' ') 
 #the code will be like this
</code></pre>

So I have a dataset that I would like to remove stop words from using 

<pre><code>stopwords.words('english')
</code></pre>

I'm struggling how to use this within my code to just simply take out these words. I have a list of the words from this dataset already, the part i'm struggling with is comparing to this list and removing the stop words.
Any help is appreciated.

How to remove stop words using nltk or python

因此，我有一个数据集，我想删除要使用的停用词stopwords.words('english')我正在努力如何在我的代码中使用它来简单地去掉这些单词。我已经有了这个数据集中的单词列表，我正在努力的部分是与这个列表进行比较并删除停用的单词。任何帮助都是非常感谢的。

问如何使用nltk或python删除停用词
EN

回答 13

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用nltk或python删除停用词EN

回答 13

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用nltk或python删除停用词
EN