文章/答案/技术大牛

发布

社区首页 >问答首页 >从文件中删除非Unicode字符

问从文件中删除非Unicode字符
EN

Stack Overflow用户

提问于 2018-03-24 16:55:01

回答 1查看 241关注 0票数 0

我知道这是一个重复的问题，但到目前为止，我已经努力地尝试了所有的解决方案。有谁能帮忙从文件中去掉像\xc3\xa2\xc2\x84\xc2\xa2这样的吸盘吗？

我目前正在努力清理的文件内容是:B‘烤洋葱汤’，‘b’'2磅大黄葱，薄片‘，'3大葱，薄片’，'4枝百里香‘，'1/4杯橄榄油’，'Kosher盐和新鲜磨碎的黑胡椒‘，'1杯白葡萄酒’，'2汤匙香槟醋‘，’2杯酸奶油‘，'1/2杯切碎新鲜碎屑’，'1/4杯纯希腊酸奶‘，’一切调味和我百里香‘，鳕鱼角波\xc3\xA2\xc2\x84\xc2\xA2马铃薯片供“食用”

我尝试过使用re.sub('^\x00-\x7F+'，‘'，任何文本)，但似乎什么也没有得到。我怀疑这里没有被当作一个特殊的角色。

non-unicode

python-2.7

ascii

non-ascii-characters

python-unicode

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-03-24 17:12:53

你可以这样做：

>>> f = open("test.txt","r")
>>> whatevertext = f.read()
>>> print whatevertext
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves\xc3\xa2\xc2\x84\xc2\xa2 Potato Chips for serving']"""

>>> import re
>>> result = re.sub('\\\\x[a-f|0-9]+','',whatevertext)
>>> print result
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves Potato Chips for serving']"""

>>>

在这个正则表达式中，每个斜杠都用斜杠转义，在x之后，我们知道可以是0-9中的数字，也可以是a-f中的字母。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49467359

复制

相似问题

问从文件中删除非Unicode字符
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文件中删除非Unicode字符EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从文件中删除非Unicode字符
EN