文章/答案/技术大牛

发布

社区首页 >问答首页 >UnicodeDecodeError:数据问题的意外结束

问UnicodeDecodeError:数据问题的意外结束
EN

Stack Overflow用户

提问于 2018-04-29 00:04:05

回答 1查看 2.3K关注 0票数 1

目前，我正在使用praw从Reddit上的各种subreddits中提取评论，并试图计算它们的情感并将它们添加到数据库中。它的工作方式是从包含subreddit名称的文件中读取，以便知道要从哪个subreddit中提取注释。

with open('subs.txt') as f:
    for line in f:
        string = line.strip()


        for submission in reddit.subreddit(string).hot(limit=10):
            subreddit = reddit.subreddit(line.strip())
            name = str(subreddit.display_name)
            comments = submission.comments.list()
            for c in comments:
                if isinstance(c, MoreComments):
                    continue
                #print c.body
                author = c.author
                score = c.score
                created_at = c.created_utc
                upvotes = c.ups
                #print c.score
                comment_sentiment = getSentiment(c.body)
                subreddit_sentiment += comment_sentiment
                num_comments += 1

我目前实现的工具可以正常工作，直到达到某个注释时才会抛出以下错误消息：

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: unexpected end of data

我看过很多不同的问题，在这里人们遇到了同样的问题，但给出的解决方案似乎无助于我的问题。

完整的堆栈跟踪如下：

Traceback (most recent call last):


File "extract.py", line 48, in <module>
    comment_sentiment = getSentiment(c.body)
  File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 93, in getSentiment
    tagged_sentences = makeTag(pos_tag_text, max_key_size, dictionary)
  File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 106, in makeTag
    return [addTag(sentence, max_key_size, dictionary) for sentence in postagged_sentences]
  File "/Users/b38/Desktop/FlaskApp/sentiment_analysis.py", line 119, in addTag
    expression_word = ' '.join([word[0] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: unexpected end of data

我一直在绞尽脑汁想办法解决这个问题，不幸的是我迷路了。这是否与读取包含subreddits的文件有关，还是涉及使用praw提取数据的限制？我试图孤立这个问题，但似乎无法摆脱这个错误。

有人能帮我解决这个问题吗？我会感谢你的任何见解。非常感谢。

编辑： sentiment_analysis.py

 # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import sys
    reload(sys)
    sys.setdefaultencoding('utf8')
    import pandas as pd
    import nltk
    import yaml
    import sys
    import os
    import re

//splitting the text initially 
def splitString(text):
    nltk_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
    nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()
    sentences = nltk_splitter.tokenize(text)
    tokenized_sentences = [nltk_tokenizer.tokenize(sentence) for sentence in sentences]
    return tokenized_sentences

def tagWords(sentence,max_key_size, dictionary, tag_stem=False):
    # Tag all possible sentences
    tagged_sentence = []
    length = len(sentence)
    if max_key_size == 0:
        max_key_size = length
    i = 0
    while (i < length):
        j = min(i + max_key_size, length)
        tagged = False
        while (j > i):
            expression_word = ' '.join([word[0] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')         // here is where it gets caught 
            expression_stem = ' '.join([word[1] for word in sentence[i:j]]).lower().encode('utf-8',errors='ignore')

            if tag_stem == True:
                word = expression_word
            else:
                word = expression_word
             ....

praw

python

unicode

python-unicode

回答 1

Stack Overflow用户

发布于 2018-04-29 00:14:36

尝试显式编码字符串：

c.body.encode('utf-8')

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50082157

复制

相似问题

问UnicodeDecodeError:数据问题的意外结束
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UnicodeDecodeError:数据问题的意外结束EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UnicodeDecodeError:数据问题的意外结束
EN