爬虫路上踩的第一个坑:
就这么几行代码,为了获取baidu主页的网页源代码,一直报错,如标题,在网上查了许多,最终解决了
import urllib.request
import time
import platform
import os
import sys
import io
def clear():
'''该函数用于清屏 '''
print('内容较多,显示3秒后翻页')
time.sleep(3)
OS = platform.system()
if (OS == 'Windows'):
os.system('cls')
else:
os.system('clear')
def linkBaidu():
url = 'http://www.baidu.com'
try:
response = urllib.request.urlopen(url,timeout=3)
result = response.read().decode('utf-8','ignore')
#result = result.encode('GBK','ignore')
except Exception as e:
print("网络地址错误")
exit()
with open('baidu.txt', 'w') as fp:
fp.write(result)
print("获取url信息 : response.geturl() : %s" %response.geturl())
print("获取返回代码 : response.getcode() : %s" %response.getcode())
print("获取返回信息 : response.info() : %s" %response.info())
print("获取的网页内容已存入当前目录的baidu.txt中,请自行查看")
if __name__ == '__main__':
linkBaidu()
即解码之后重新编码,然后将字节流转换为字符串
#第一步
result = result.encode('GBK','ignore')
#第二步
fp.write(str(result))#字节流类型转换为字符串
然后就OK了
输出如下:
获取url信息 : response.geturl() : http://www.baidu.com 获取返回代码 : response.getcode() : 200 获取返回信息 : response.info() : Bdpagetype: 1 Bdqid: 0xec36b3870004e4e7 Cache-Control: private Content-Type: text/html Cxy_all: baidu+2a11a08485d2a15f9348ad46d5b91ff9 Date: Thu, 28 Mar 2019 13:26:04 GMT Expires: Thu, 28 Mar 2019 13:25:05 GMT P3p: CP=" OTI DSP COR IVA OUR IND COM " Server: BWS/1.1 Set-Cookie: BAIDUID=FF75FD7466E5ADC2E8EBAB609E90671E:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Set-Cookie: BIDUPSID=FF75FD7466E5ADC2E8EBAB609E90671E; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Set-Cookie: PSTM=1553779564; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com Set-Cookie: delPer=0; path=/; domain=.baidu.com Set-Cookie: BDSVRTM=0; path=/ Set-Cookie: BD_HOME=0; path=/ Set-Cookie: H_PS_PSSID=1429_28777_21084_28771_28724_28557_28697_28584_26350_28519_28626_22158; path=/; domain=.baidu.com Vary: Accept-Encoding X-Ua-Compatible: IE=Edge,chrome=1 Connection: close Transfer-Encoding: chunked
关于该问题的更多答案参见:https://www.crifan.com/unicodeencodeerror_gbk_codec_can_not_encode_character_in_position_illegal_multibyte_sequence/
获取的网页内容已存入当前目录的baidu.txt中,请自行查看