blocks|key|352776|text|看看nltk为here提供的其他标记化选项。例如，您可以定义一个标记器，该标记器将字母数字字符序列选作令牌，并删除其他所有字符：|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|352777|from+nltk.tokenize+import+RegexpTokenizer

tokenizer+=+RegexpTokenizer(r'\w%2B')
tokenizer.tokenize('Eighty-seven+miles+to+go,+yet.++Onward!')|code-block|syntax|javascript|352778|输出：|352779|['Eighty',+'seven',+'miles',+'to',+'go',+'yet',+'Onward']|352780|entityMap|0|LINK|mutability|MUTABLE|url|http://www.nltk.org/api/nltk.tokenize.html^0|7|4|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@$A|V|B|W|1|X]]|C|$]]|$1|D|3|E|5|F|7|Y|8|@]|9|@]|C|$G|H]]|$1|I|3|J|5|6|7|Z|8|@]|9|@]|C|$]]|$1|K|3|L|5|F|7|10|8|@]|9|@]|C|$G|H]]|$1|M|3|-4|5|6|7|11|8|@]|9|@]|C|$]]]|N|$O|$5|P|Q|R|C|$S|T]]]]

Take a look at the other tokenizing options that nltk provides <a href="http://www.nltk.org/api/nltk.tokenize.html">here</a>. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

<pre><code>from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
</code></pre>

Output:

<pre><code>['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
</code></pre>

blocks|key|135030|text|下面的代码将删除所有标点符号以及非字母字符。从他们的书里抄来的。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|135031|http://www.nltk.org/book/ch01.html|offset|length|135032|import+nltk

s+=+"I+can't+do+this+now,+because+I'm+so+tired.++Please+give+me+some+time.+@+sd++4+232"

words+=+nltk.word_tokenize(s)

words=[word.lower()+for+word+in+words+if+word.isalpha()]

print(words)|code-block|syntax|javascript|135033|输出|135034|['i',+'ca',+'do',+'this',+'now',+'because',+'i',+'so',+'tired',+'please',+'give',+'me',+'some',+'time',+'sd']|135035|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|Y|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|V|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|W|8|@]|9|@$D|X|E|Y|1|Z]]|A|$]]|$1|F|3|G|5|H|7|10|8|@]|9|@]|A|$I|J]]|$1|K|3|L|5|6|7|11|8|@]|9|@]|A|$]]|$1|M|3|N|5|H|7|12|8|@]|9|@]|A|$I|J]]|$1|O|3|-4|5|6|7|13|8|@]|9|@]|A|$]]]|P|$Q|$5|R|S|T|A|$U|C]]]]

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

<a href="http://www.nltk.org/book/ch01.html" rel="noreferrer">http://www.nltk.org/book/ch01.html</a> 

<pre><code>import nltk

s = "I can't do this now, because I'm so tired. Please give me some time. @ sd 4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)
</code></pre>

output

<pre><code>['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
</code></pre>

blocks|key|134938|text|正如注释以sent_tokenize()开头所指出的，因为word_tokenize()只适用于单个句子。您可以使用filter()过滤掉标点符号。如果你有一个unicode字符串，确保它是一个unicode对象(而不是像‘utf-8’这样的编码的'str‘)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|134939|from+nltk.tokenize+import+word_tokenize,+sent_tokenize

text+=+'''It+is+a+blue,+small,+and+extraordinary+ball.+Like+no+other'''
tokens+=+[word+for+sent+in+sent_tokenize(text)+for+word+in+word_tokenize(sent)]
print+filter(lambda+word:+word+not+in+',-',+tokens)|code-block|syntax|javascript|134940|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8'). 

<pre><code>from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
</code></pre>

blocks|key|352785|text|我只用了下面的代码，去掉了所有的标点符号：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|352786|tokens+=+nltk.wordpunct_tokenize(raw)

type(tokens)

text+=+nltk.Text(tokens)

type(text)++

words+=+[w.lower()+for+w+in+text+if+w.isalpha()]|code-block|syntax|javascript|352787|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

I just used the following code, which removed all the punctuation:

<pre><code>tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text) 

words = [w.lower() for w in text if w.isalpha()]
</code></pre>

blocks|key|352819|text|我认为您需要某种类型的正则表达式匹配(以下代码使用Python+3编写)：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|352820|import+string
import+re
import+nltk

s+=+"I+can't+do+this+now,+because+I'm+so+tired.++Please+give+me+some+time."
l+=+nltk.word_tokenize(s)
ll+=+[x+for+x+in+l+if+not+re.fullmatch('['+%2B+string.punctuation+%2B+']%2B',+x)]
print(l)
print(ll)|code-block|syntax|javascript|352821|输出：|352822|['I',+'ca',+"n't",+'do',+'this',+'now',+',',+'because',+'I',+"'m",+'so',+'tired',+'.',+'Please',+'give',+'me',+'some',+'time',+'.']
['I',+'ca',+"n't",+'do',+'this',+'now',+'because',+'I',+"'m",+'so',+'tired',+'Please',+'give',+'me',+'some',+'time']|352823|在大多数情况下应该工作得很好，因为它删除了标点符号，同时保留了"n't“这样的记号，这是无法从正则表达式记号赋予器(如wordpunct_tokenize+)获得的。|offset|length|style|CODE|352824|entityMap^0|0|0|0|0|1N|I|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|W|8|@$M|X|N|Y|O|P]]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]]|R|$]]

I think you need some sort of regular expression matching (the following code is in Python 3):

<pre class="lang-py prettyprint-override"><code>import string
import re
import nltk

s = "I can't do this now, because I'm so tired. Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)
</code></pre>

Output:

<pre><code>['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']
</code></pre>

Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as <code>wordpunct_tokenize</code>.

blocks|key|352797|text|我使用以下代码删除标点符号：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|352798|import+nltk
def+getTerms(sentences):
++++tokens+=+nltk.word_tokenize(sentences)
++++words+=+[w.lower()+for+w+in+tokens+if+w.isalnum()]
++++print+tokens
++++print+words

getTerms("hh,+hh3h.+wo+shi+2+4+A+.+fdffdf.+A&&B+")|code-block|syntax|javascript|352799|如果你想检查一个标记是否是一个有效的英文单词，你可能需要PyEnchant|offset|length|352800|教程：|352801|+import+enchant
+d+=+enchant.Dict("en_US")
+d.check("Hello")
+d.check("Helo")
+d.suggest("Helo")|352802|entityMap|0|LINK|mutability|MUTABLE|url|http://pythonhosted.org/pyenchant/^0|0|0|S|9|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|X|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Y|8|@]|9|@$I|Z|J|10|1|11]]|A|$]]|$1|K|3|L|5|6|7|12|8|@]|9|@]|A|$]]|$1|M|3|N|5|D|7|13|8|@]|9|@]|A|$E|F]]|$1|O|3|-4|5|6|7|14|8|@]|9|@]|A|$]]]|P|$Q|$5|R|S|T|A|$U|V]]]]

I use this code to remove punctuation:

<pre><code>import nltk
def getTerms(sentences):
 tokens = nltk.word_tokenize(sentences)
 words = [w.lower() for w in tokens if w.isalnum()]
 print tokens
 print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&amp;&amp;B ")
</code></pre>

And If you want to check whether a token is a valid English word or not, you may need <a href="http://pythonhosted.org/pyenchant/" rel="noreferrer">PyEnchant</a>

Tutorial:

<pre><code> import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")
</code></pre>

blocks|key|135103|text|只是通过@rmalouf添加到解决方案中，这将不包括任何数字，因为\w%2B等同于a-zA-Z0-9_|type|unstyled|depth|inlineStyleRanges|entityRanges|data|135104|from+nltk.tokenize+import+RegexpTokenizer
tokenizer+=+RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven+miles+to+go,+yet.++Onward!')|code-block|syntax|javascript|135105|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

Just adding to the solution by @rmalouf, this will not include any numbers because \w+ is equivalent to [a-zA-Z0-9_]

<pre><code>from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')
</code></pre>

blocks|key|352958|text|您可以不使用nltk+(python+3.x)在一行中完成此操作。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|352959|import+string
string_text=+string_text.translate(str.maketrans('','',string.punctuation))|code-block|syntax|javascript|352960|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You can do it in one line without nltk (python 3.x).

<pre><code>import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))
</code></pre>

blocks|key|135070|text|删除标点符号(它将删除。以及使用以下代码处理标点符号的一部分)|type|unstyled|depth|inlineStyleRanges|entityRanges|data|135071|++++++++tbl+=+dict.fromkeys(i+for+i+in+range(sys.maxunicode)+if+unicodedata.category(chr(i)).startswith('P'))
++++++++text_string+=+text_string.translate(tbl)+#text_string+don't+have+punctuation
++++++++w+=+word_tokenize(text_string)++#now+tokenize+the+string+|code-block|syntax|javascript|135072|示例输入/输出：|135073|direct+flat+in+oberoi+esquire.+3+bhk+2195+saleable+1330+carpet.+rate+of+14500+final+plus+1%25+floor+rise.+tax+approx+9%25+only.+flat+cost+with+parking+3.89+cr+plus+taxes+plus+possession+charger.+middle+floor.+north+door.+arey+and+oberoi+woods+facing.+53%25+paymemt+due.+1%25+transfer+charge+with+buyer.+total+cost+around+4.20+cr+approx+plus+possession+charges.+rahul+soni|135074|['direct',+'flat',+'oberoi',+'esquire',+'3',+'bhk',+'2195',+'saleable',+'1330',+'carpet',+'rate',+'14500',+'final',+'plus',+'1',+'floor',+'rise',+'tax',+'approx',+'9',+'flat',+'cost',+'parking',+'389',+'cr',+'plus',+'taxes',+'plus',+'possession',+'charger',+'middle',+'floor',+'north',+'door',+'arey',+'oberoi',+'woods',+'facing',+'53',+'paymemt',+'due',+'1',+'transfer',+'charge',+'buyer',+'total',+'cost',+'around',+'420',+'cr',+'approx',+'plus',+'possession',+'charges',+'rahul',+'soni']|offset|length|style|CODE|135075|entityMap^0|0|0|0|0|0|DM|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@]|A|$]]|$1|I|3|J|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|K|3|L|5|6|7|W|8|@$M|X|N|Y|O|P]]|9|@]|A|$]]|$1|Q|3|-4|5|6|7|Z|8|@]|9|@]|A|$]]]|R|$]]

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

<pre><code> tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
 text_string = text_string.translate(tbl) #text_string don't have punctuation
 w = word_tokenize(text_string) #now tokenize the string 
</code></pre>

Sample Input/Output:

<pre><code>direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni
</code></pre>

<code>['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']
</code>

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use <code>nltk.word_tokenize()</code>, I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also <code>word_tokenize</code> doesn't work with multiple sentences: dots are added to the last word.

How to get rid of punctuation using NLTK tokenizer?

Python

我刚刚开始使用NLTK，我不太明白如何从文本中获取单词列表。如果我使用nltk.word_tokenize()，我会得到一个单词和标点符号的列表。相反，我只需要文字。我怎样才能摆脱标点符号？此外，word_tokenize不能处理多个句子:在最后一个单词上添加圆点。

问如何使用NLTK标记器消除标点符号？
EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用NLTK标记器消除标点符号？EN

回答 9

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用NLTK标记器消除标点符号？
EN