我使用的是python 3.764位。nltk版本3.4.5。
当我尝试使用word_tokenize将nltk.book中的text6转换为令牌时,我得到了错误。
import nltk
from nltk.tokenize import word_tokenize
from nltk.book import *
tokens=word_tokenize(text6)
代码在idle 3.7中完成
下面是我执行最后一条语句时的错误。
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
tokens=word_tokenize(text6)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\__init__.py", line 144, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1277, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1331, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1331, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1321, in span_tokenize
for sl in slices:
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1362, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 318, in _pair_iter
prev = next(it)
File "C:\Users\admin\AppData\Local\Programs\Python\Python37\lib\site-packages\nltk\tokenize\punkt.py", line 1335, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object
请帮帮忙。提前谢谢。
在进行一些故障排除时,我创建了一个示例nltk.text.Text对象,并尝试使用nltk.word_tokenize对其进行标记化。尽管如此,我还是得到了相同的错误。请看下面的截图。
但是,当对字符串调用nltk.word_tokenize()时,它是有效的。
>>> tt="Python is a programming language"
>>> tokens2=nltk.word_tokenize(tt) #Not throwing error
>>> type(tt)
<class 'str'>
>>> type(text6)
<class 'nltk.text.Text'>
>>>
发布于 2021-10-06 16:03:23
检查nltk数据文件夹。以及它所期望的位置。
发布于 2020-07-07 11:30:02
尝试使用:
nltk.download('punkt')
https://stackoverflow.com/questions/61041217
复制相似问题