Python脚本处理unicode字符时的解决方法

原创

华科云商小徐

发布于 2024-03-12 10:16:22

23900

代码可运行

文章被收录于专栏：小徐学爬虫小徐学爬虫

运行总次数：0

代码可运行

我们在Python中，可以使用Unicode编码来表示字符。Unicode是一种字符集，它为世界上几乎所有的字符都分配了一个唯一的数字，这个数字被称为码点。在Python中，在使用Unicode字符出现的问题又该

如何解决？

1、问题背景

在编写一个递归遍历目录树、列出所有.flac文件并从相应目录/子目录/文件名中提取艺术家、专辑和标题并将其写入文件的Python脚本时，发现代码在找到unicode字符时会出现错误。

import os, glob, re

def scandirs(path):
    for currentFile in glob.glob(os.path.join(path, '*')):
        if os.path.isdir(currentFile):
            scandirs(currentFile)
        if os.path.splitext(currentFile)[1] == ".flac":
            rpath = os.path.relpath(currentFile)
            print "**DEBUG** rpath =", rpath
            title = os.path.basename(currentFile)
            title = re.findall(u'\d\d\s(.*).flac', title, re.U)
            title = title[0].decode("utf8")
            print "**DEBUG** title =", title
            fpath = os.path.split(os.path.dirname(currentFile))
            artist = fpath[0][2:]
            print "**DEBUG** artist =", artist
            album = fpath[1]
            print "**DEBUG** album =", album
            out = "%s | %s | %s | %s\n" % (rpath, artist, album, title)
            flist = open('filelist.tmp', 'a')
            flist.write(out)
            flist.close()

scandirs('./')

# 结果

**DEBUG** rpath = Thriftworks/Fader/Thriftworks - Fader - 01 180°.flac
**DEBUG** title = 180°
**DEBUG** artist = Thriftworks
**DEBUG** album = Fader
Traceback (most recent call last):
  File "decflac.py", line 25, in <module>
    scandirs('./')
  File "decflac.py", line 7, in scandirs
    scandirs(currentFile)
  File "decflac.py", line 7, in scandirs
    scandirs(currentFile)
  File "decflac.py", line 20, in scandirs
    out = "%s | %s | %s | %s\n" % (rpath, artist, album, title)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 46: ordinal not in range(128)

2、解决方案

从Python 2.x升级到Python 3.x版本，因为Python 3.x版本内置了对unicode字符的支持，无需进行额外的处理。
在Python 2.x版本中，可以使用unicode()函数将字符串转换成unicode编码，并使用encode()函数将其转换为utf-8编码，然后再进行处理。
可以在代码中使用sys.setdefaultencoding('utf-8')将默认编码设置为utf-8，这样就可以直接处理unicode字符而无需进行额外的编码转换。

import os, glob, re

def scandirs(path):
    for currentFile in glob.glob(os.path.join(path, '*')):
        if os.path.isdir(currentFile):
            scandirs(currentFile)
        if os.path.splitext(currentFile)[1] == ".flac":
            rpath = os.path.relpath(currentFile)
            print "**DEBUG** rpath =", rpath
            title = os.path.basename(currentFile)
            title = re.findall(u'\d\d\s(.*).flac', title, re.U)
            title = title[0].decode("utf8")
            print "**DEBUG** title =", title
            fpath = os.path.split(os.path.dirname(currentFile))
            artist = fpath[0][2:]
            print "**DEBUG** artist =", artist
            album = fpath[1]
            print "**DEBUG** album =", album
            out = "%s | %s | %s | %s\n" % (rpath, artist, album, title)
            flist = open('filelist.tmp', 'a')
            flist.write(out.encode('utf-8')) # 将其转换为utf-8编码
            flist.close()

scandirs('./')

Python 3中的字符串默认使用Unicode编码，因此可以直接使用Unicode字符。例如，'你好'表示包含中文字符的字符串。

上面就是今天的所有内容了，如果各位还有不懂的，一起留言讨论。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

爬虫