有一个关于写文件的问题。
当我使用data.to_csv('/home/bio_kang/Learning/Python/film_project/top250_film_info.csv', index=None, encoding='gbk')时,它给了我一个错误,即'gbk' codec can't encode character '\u2022' in position 32: illegal multibyte sequence。
这些数据来自https://movie.douban.com/top250网站。我用requests,beautifulsoup和re把他们从网站上弄出来。
这是我的部分代码:
uni_num = []
years = []
countries = []
directors = []
actors = []
descriptions = []
for i in range(250):
with open('/home/bio_kang/Learning/Python/film_project/film_info/film_{}.html'.format(i), 'rb') as f:
film_info = f.read().decode('utf-8','ignore')
pattern_uni_num = re.compile(r'<span class="pl">IMDb:</span> (.*?)<br/>') # unique number
pattern_year = re.compile(r'<span class="year">\((.*?)\)</span>') # year
pattern_country = re.compile(r'<span class="pl">制片国家/地区:</span>(.*?)<br/>') # country
pattern_director = re.compile(r'<meta content=(.*?) property="video:director"/>') # director
pattern_actor = re.compile(r'<meta content="(.*?)" property="video:actor"/>') # actors
pattern_description = re.compile(r'<meta content="(.*?)property="og:description">') # description
uni_num.append(str(re.findall(pattern_uni_num, film_info)).strip("[]").strip("'"))
years.append(str(re.findall(pattern_year, film_info)).strip("[]").strip("'"))
countries.append(str(re.findall(pattern_country, film_info)).strip("[]").strip("'").split('/')[0])
directors.append(re.findall(pattern_director, film_info))
actors.append(re.findall(pattern_actor, film_info))
descriptions.append(str(re.findall(pattern_description, film_info)).strip('[]').strip('\''))
raw_data = {'encoding':uni_num, 'name':names, 'description':descriptions, 'country':countries, 'director':new_director, 'actor':new_actor, 'vote':new_votes, 'score':scores, 'year':years, 'link':urls }
data = pd.DataFrame(raw_data)
data.to_csv('/home/bio_kang/Learning/Python/film_project/top250_film_info.csv', index=None, encoding='gbk')发布于 2022-11-23 19:28:40
试一试:
open('...','rb',encoding='utf-8')或utf-16
https://stackoverflow.com/questions/73540470
复制相似问题