文章/答案/技术大牛

发布

社区首页 >问答首页 >Python3.4-在写入文件时删除或忽略表情符号

问Python3.4-在写入文件时删除或忽略表情符号
EN

Stack Overflow用户

提问于 2014-05-19 17:37:24

回答 2查看 3.5K关注 0票数 0

我试图解析XML文件并将内容写入纯文本文件。到目前为止，这个程序一直工作到一个表情符号字符，然后Python抛出以下错误：

UnicodeEncodeError: 'charmap' codec can't encode characters in position 177-181: character maps to <undefined>

我到了错误位置，并在XML文件中找到了以下表情符号：

我的问题是如何将它们编码到unicode，或者在写入文件时完全删除/忽略它们。

当我将print()输出到控制台时，它的输出非常完美，但是在写入文件时会抛出一个错误。

我搜索过Google和这里，但我得到的唯一答案是它们已经被编码到unicode了。你看到的是我的，文字？我不确定我说得对不对。

此外，我正在处理的XML文件具有以下格式：

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<?xml-stylesheet type="text/xsl" href="sms.xsl"?>
<smses count="1">
  <sms protocol="0" address="+00000000000" date="1346772606199" type="1" subject="null" body="Lorem ipsum dolor sit amet, consectetur adipisicing elit," toa="null" sc_toa="null" service_center="+00000000000" read="1" status="-1" locked="0" date_sent="1346772343000" readable_date="Sep 4, 2012 10:30:06 AM" contact_name="John Doe" />
</smses>

python

xml

unicode

emoji

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-05-19 17:54:11

你有两个选择：

选择一个能够处理Emoji码点的编码。您已经打开了您的文件，以便使用默认的编解码器(这取决于您的系统)编写，或者选择不支持代码点的显式编码。 UTF编码可以很好地处理代码点；我在这里选择UTF-8：以open(文件名，'w'，编码=‘utf8 8’)作为输出文件:outfile.write(您的数据)
设置错误处理模式，以替换编解码器无法处理的代码点、转义序列或完全忽略它们。请参阅 function errors参数：错误是一个可选字符串，它指定如何处理编码和解码错误--这不能在二进制模式中使用。可以使用各种标准的错误处理程序，但是在codecs.register_error()中注册的任何错误处理名称也是有效的。标准名称是：

- `'strict'` to raise a `ValueError` exception if there is an encoding error. The default value of `None` has the same effect.
- `'ignore'` ignores errors. Note that ignoring encoding errors can lead to data loss.
- `'replace'` causes a replacement marker (such as `'?'`) to be inserted where there is malformed data.
- `'surrogateescape'` will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the `surrogateescape` error handler is used when writing data. This is useful for processing files in an unknown encoding.
- `'xmlcharrefreplace'` is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference `&#nnn;`.
- `'backslashreplace'` (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

因此，使用errors='ignore'打开文件不会编写Emoji代码点，而不会引发错误：

使用open(文件名，'w'，错误=‘忽略’)作为输出文件:outfile.write(您的数据)

演示：

>>> a_ok = 'The U+1F44C OK HAND SIGN codepoint: \U0001F44C'
>>> print(a_ok)
The U+1F44C OK HAND SIGN codepoint: 
>>> a_ok.encode('utf8')
b'The U+1F44C OK HAND SIGN codepoint: \xf0\x9f\x91\x8c'
>>> a_ok.encode('cp1251', errors='ignore')
b'The U+1F44C OK HAND SIGN codepoint: '
>>> a_ok.encode('cp1251', errors='replace')
b'The U+1F44C OK HAND SIGN codepoint: ?'
>>> a_ok.encode('cp1251', errors='xmlcharrefreplace')
b'The U+1F44C OK HAND SIGN codepoint: &#128076;'
>>> a_ok.encode('cp1251', errors='backslashreplace')
b'The U+1F44C OK HAND SIGN codepoint: \\U0001f44c'

请注意，'surrogateescape'选项空间有限，实际上只对解码一个编码未知的文件非常有用；无论如何，它都不能处理Emoji。

票数 4

Stack Overflow用户

发布于 2014-05-19 17:44:33

(编辑:这个答案与Python2.x有关，而不是Python3.x)

目前，您正在使用默认编码将unicode字符串写入文件，这不支持表情符号(或者，就这一点而言，您可能真的想要大量字符)。您可以使用支持所有unicode字符的UTF-8编码来编写。

与其执行file.write( data )，不如尝试file.write( data.encode("utf-8") )。

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23743878

复制

相似问题

问Python3.4-在写入文件时删除或忽略表情符号
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3.4-在写入文件时删除或忽略表情符号EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python3.4-在写入文件时删除或忽略表情符号
EN