目前,我正在使用praw从Reddit上的各种subreddits中提取评论,并试图计算它们的情感并将它们添加到数据库中。它的工作方式是从包含subreddit名称的文件中读取,以便知道要从哪个subreddit中提取注释。
with open('subs.txt') as f:
for line in f:
string = line.strip()
for submission in reddit.subreddit(string).hot(limit=10):
subreddit = reddit.subreddit(line.strip())
name = str(subreddit.display_name)
comments = submission.comments.list()
for c in comments:
if isinstance(c, MoreComments):
continue
#print c.body
author = c.author
score = c.score
created_at = c.created_utc
upvotes = c.ups
#print c.score
comment_sentiment = getSentiment(c.body)
subreddit_sentiment += comment_sentiment
num_comments += 1我目前实现的工具可以正常工作,直到达到某个注释时才会抛出以下错误消息:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: unexpected end of data我看过很多不同的问题,在这里人们遇到了同样的问题,但给出的解决方案似乎无助于我的问题。
完整的堆栈跟踪如下:
Traceback (most recent call last):
File "extract.py", line 48, in <module>
comment_sentiment = getSentiment(c.body)
File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 93, in getSentiment
tagged_sentences = makeTag(pos_tag_text, max_key_size, dictionary)
File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 106, in makeTag
return [addTag(sentence, max_key_size, dictionary) for sentence in postagged_sentences]
File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 119, in addTag
expression_word = ' '.join([word[0] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: unexpected end of data我一直在绞尽脑汁想办法解决这个问题,不幸的是我迷路了。这是否与读取包含subreddits的文件有关,还是涉及使用praw提取数据的限制?我试图孤立这个问题,但似乎无法摆脱这个错误。
有人能帮我解决这个问题吗?我会感谢你的任何见解。非常感谢。
编辑: sentiment_analysis.py
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import pandas as pd
import nltk
import yaml
import sys
import os
import re
//splitting the text initially
def splitString(text):
nltk_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()
sentences = nltk_splitter.tokenize(text)
tokenized_sentences = [nltk_tokenizer.tokenize(sentence) for sentence in sentences]
return tokenized_sentences
def tagWords(sentence,max_key_size, dictionary, tag_stem=False):
# Tag all possible sentences
tagged_sentence = []
length = len(sentence)
if max_key_size == 0:
max_key_size = length
i = 0
while (i < length):
j = min(i + max_key_size, length)
tagged = False
while (j > i):
expression_word = ' '.join([word[0] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore') // here is where it gets caught
expression_stem = ' '.join([word[1] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')
if tag_stem == True:
word = expression_word
else:
word = expression_word
....发布于 2018-04-29 00:14:36
尝试显式编码字符串:
c.body.encode('utf-8')https://stackoverflow.com/questions/50082157
复制相似问题