问如何捕获由于mysql数据中无效的连续字节而导致的UnicodeDecodeError
EN

Stack Overflow用户

提问于 2018-07-15 19:36:11

回答 1查看 1.5K关注 0票数 1

我正在将数千万行的文本数据从mysql移动到搜索引擎，但无法成功处理其中一个检索到的字符串中的Unicode错误。我尝试对检索到的字符串进行显式编码和解码，以使Python抛出Unicode异常并了解问题所在。

这个异常是在我的笔记本电脑上运行了数千万行后抛出的(叹息...)，但我无法捕捉到它，跳过该行并继续前进，这就是我想要的。mysql数据库中的所有文本都应该是utf-8。

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte

下面是我使用Mysql Connector/Python建立的连接

cnx = mysql.connector.connect(user='root', password='<redacted>',
                          host='127.0.0.1',
                          database='bloggz',
                          charset='utf-8')

以下是数据库字符设置：

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR 
Variable_name LIKE 'collation%';

+--------------------------+-----------------+

Variable_name |值

+--------------------------+-----------------+

character_set_client | utf8

character_set_connection | utf8

character_set_database | utf8

character_set_filesystem |二进制

character_set_results | utf8

character_set_server | utf8

character_set_system | utf8

collation_connection | utf8_general_ci

collation_database | utf8_general_ci

collation_server | utf8_general_ci

+--------------------------+-----------------+

下面我的异常处理有什么问题？请注意，变量"last_feeds_id“也没有打印出来，但这可能只是证明了except子句不起作用。

last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:

  try:
    # to catch UnicodeErrors and see where the prolem lies
    # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
    # also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error

    # feeds.URL is varchar(255) in mysql
    enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
    dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.title is varchar(600) in mysql
    enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
    dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.html is text in mysql
    enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
    dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

    data = {"timestamp":ts,
            "url":dec_url,
           "bid":bid,
           "title":dec_title,
           "html":dec_html}
    es.index(index="blogposts",
            doc_type="blogpost",
            body=data)
  except UnicodeDecodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeEncodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

mysql

python-3.x

utf-8

mysql-python

unicode-string

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-07-16 00:56:35

它抱怨十六进制的ED。你期待的是急性I：í吗？如果是这样，那么您拥有的文本不是UTF-8编码，而是cp1250、dec8、latin1、latin2、latin5中的一个。

您的Python源代码是否以

# -*- coding: utf-8 -*-

请参阅

此外，请查看“最佳实践”

你有charset='utf-8'；我不确定，但也许应该是charset='utf8'。 UTF-8就是人们所说的字符集。MySQL将其3字节子集称为utf8。注意没有破折号。

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51347997

复制

相似问题

问如何捕获由于mysql数据中无效的连续字节而导致的UnicodeDecodeError
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何捕获由于mysql数据中无效的连续字节而导致的UnicodeDecodeErrorEN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何捕获由于mysql数据中无效的连续字节而导致的UnicodeDecodeError
EN