我需要使用python3和nltk来规范来自意大利语wiki的文本,我有一个问题。大多数单词都是可以的,但是有些单词映射的不正确,更确切地说,是一些符号。
例如:
“水果\xE3”、“n\xE2\xba”、“citt\xe3”
我确信问题就在符号中,如à,è.
代码:
# coding: utf8
import os
from nltk import corpus, word_tokenize, ConditionalFreqDist
it_sw_plus = corpus.stopwords.words('italian') + ['doc', 'https']
#it_folder_names = ['AA', 'AB', 'AC', 'AD', 'AE', 'AF']
it_path = os.listdir('C:\\Users\\1\\projects\\i')
it_corpora = []
def normalize(raw_text):
tokens = word_tokenize(raw_text)
norm_tokens = []
for token in tokens:
if token not in it_sw_plus and token.isalpha():
token = token.lower().encode('utf8')
norm_tokens.append(token)
return norm_tokens
for folder_name in it_path:
path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
files_list = os.listdir(path_to_files)
for file_name in files_list:
file_path = path_to_files + '\\' + file_name
text_file = open(file_path)
raw_text = text_file.read().decode('utf8')
norm_tokens = normalize(raw_text)
it_corpora.append(norm_tokens)
print(it_corpora)
我如何解决这个问题?我正在运行Win7(rus)。
当我尝试这段代码时:
import io
with open('C:\\Users\\1\\projects\\i\\AA\\wiki_00', 'r', encoding='utf8') as fin:
for line in fin:
print (line)
在PowerShell中:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to <undefined>
在Python命令行中:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\1\projects\i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
3: character maps to <undefined>
当我尝试请求时:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
90: character maps to <undefined>
发布于 2016-01-04 08:46:25
如果您知道python2
中的编码,请尝试在读取文件时指定编码。
import io
with io.open(filename, 'r', encoding='latin-1') as fin:
for line in fin:
print line # line should be encoded as latin-1
但是在您的例子中,您发布的文件不是latin1
文件,而是utf8
文件,在python3
中
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/GiteItAwayNow/TrueTry/master/it'
>>> response = urllib.request.urlopen(url)
>>> data = response.read()
>>> text = data.decode('utf8')
>>> print (text) # this prints the file perfectly.
在utf8中读取“python2
”文件
import io
with io.open(filename, 'r', encoding='utf8') as fin:
for line in fin:
print (line) # line should be encoded as utf8
读取“utf8”文件,在python3
中
with open(filename, 'r', encoding='utf8') as fin:
for line in fin:
print (line) # line should be encoded as utf8
作为一种良好的实践,在处理文本数据时,尽可能使用unicode和python3。请看一下
此外,如果尚未安装此模块以便在windows控制台上打印utf8,则应尝试:
pip install win-unicode-console
或者下载这个:console-0.4.zip,然后是python setup.py install
https://stackoverflow.com/questions/34594768
复制