我的爬虫可以工作,但我无法下载我在.html文件中爬行的网站的正文。如果我写测试(‘self.html_fil.write’),那么它工作得很好。我不知道如何把tulpe转换成字符串。
我使用Python 3.6
蜘蛛:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
self.html_file = open(self.path_to_html, 'w')
def parse(self, response):
url = response.url
self.html_file.write(response.body)
self.html_file.close()
yield {
'url': url
}
跟踪:
Traceback (most recent call last):
File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes
发布于 2017-09-06 13:28:31
实际的问题是你得到的是字节码。您需要将其转换为字符串格式。有许多方法可以将字节格式转换为字符串格式。您可以使用
self.html_file.write(response.body.decode("utf-8"))
而不是
self.html_file.write(response.body)
您还可以使用
self.html_file.write(response.text)
发布于 2018-07-05 06:05:28
考虑到上面的响应,并尽可能增加with
语句的使用,示例应该重写如下:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
def parse(self, response):
with open(self.path_to_html, 'w') as html_file:
html_file.write(response.text)
yield {
'url': response.url
}
但是只能从parse
方法访问html_file
。
https://stackoverflow.com/questions/46067258
复制相似问题