这是我正在尝试的代码,但是代码正在生成一个错误。
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
file_content = open("Dictionary.txt").read()
tokens = nltk.word_tokenize(file_content)
# sent_tokenize is one of instances of
# PunktSentenceTokenizer from the nltk.tokenize.punkt module
tokenized = sent_tokenize(tokens)
for i in tokenized:
# Word tokenizers is used to find the words
# and punctuation in a string
wordsList = nltk.word_tokenize(i)
# removing stop words from wordList
wordsList = [w for w in wordsList if not w in stop_words]
# Using a Tagger. Which is part-of-speech
# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)
print(tagged)
错误:
回溯(最近一次调用):文件"tag.py",第12行,在sent_tokenize(令牌)文件 第105行,在sent_tokenize中返回tokenizer.tokenize(文本)文件tokenizer.tokenize 第1269行,在标记返回列表( "/home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py",(self.sentences_from_text(text,realign_boundaries))中) 第1323行,在"/home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py",中返回[text:e for s,e in self.span_tokenize(text,realign_boundaries)] File self.span_tokenize 第1323行,作为回报[text:e for s,e in self.span_tokenize(text,realign_boundaries)] File realign_boundaries 第1313行,用于片中sl的span_tokenize : File span_tokenize 第1354行,_realign_boundaries for sl1,sl2 in _pair_iter(片):File sl1 第317行,在"/home/mahadev/anaconda3/lib/python3.7/site-packages/nltk/tokenize/punkt.py",prev = next(it) _pair_iter中 第1327行,在self._lang_vars.period_context_re().finditer(text):TypeError中用于匹配的_slices_from_text :预期的字符串或类似字节的对象
发布于 2019-02-28 09:19:48
不知道您的代码应该做什么,但是您得到的错误是由令牌变量的数据类型造成的。它需要字符串,但它得到了一个不同数据类型的列表。
您应该将该行更改为:
tokens = str(nltk.word_tokenize(file_content))
https://stackoverflow.com/questions/54920358
复制相似问题