试图将数据刮到howlongtbeat.com
到目前为止,所有工作都有效,但是在网址中遇到问题
到目前为止这是我的代码:
import csv, re
from bs4 import BeautifulSoup as soup
import requests
flag = False
with open('filename.csv', 'w') as f:
write = csv.writer(f)
for i in range(1, 100):
s = soup(requests.get('https://howlongtobeat.com/game.php?id={i}').text, 'html.parser')
if not flag: #write header to file once
write.writerow(['Name', 'Length']+[re.sub('[:\n]+', '', i.find('strong').text) for i in s.find_all('div', {'class':'profile_info'})])
flag = True
content = s.find('div', {"class":'profile_header shadow_text'})
if content:
name = s.find('div', {"class":'profile_header shadow_text'}).text
length = [[i.find('h5').text, i.find("div").text] for i in s.find_all('li', {'class':'time_100'})]
stats = [re.sub('\n+[\w\s]+:\n+', '', i.text) for i in s.find_all('div', {'class':'profile_info'})]
我的csv没有填满
如何让这个工作正常?
发布于 2018-12-17 10:41:25
有些页面可能不包含预期的标签,这就是s.find('div',{“class”:'profile_header shadow_text'})为None的原因。 例如,检查id = 3。
你应该检查find()是不是在提取文本之前
content = s.find('div', {"class":'profile_header shadow_text'})
if content:
name = s.find('div', {"class":'profile_header shadow_text'}).text
length = [[i.find('h5').text, i.find("div").text] for i in s.find_all('li', {'class':'time_100'})]
stats = [re.sub('\n+[\w\s]+:\n+', '', i.text) for i in s.find_all('div', {'class':'profile_info'})]
另一种解决方法是使用try / except跳过出现问题的页面:
try:
name = s.find('div', {"class":'profile_header shadow_text'}).text
length = [[i.find('h5').text, i.find("div").text] for i in s.find_all('li', {'class':'time_100'})]
stats = [re.sub('\n+[\w\s]+:\n+', '', i.text) for i in s.find_all('div', {'class':'profile_info'})]
except:
continue
https://stackoverflow.com/questions/-100003078
复制相似问题