我正在尝试从一个网站上抓取一些信息,但是输出结果与网页的html不同。我试图从网页中获取的内容在
<div class="page-content">但在我的漂亮的shows对象中,它显示为:
<div class="page-content loading"></div>在组织中什么都没有。我试着去找我想要的东西,但是一无所获。我还尝试了html5lib和lxml解析器,但这并没有改变输出。是不是浏览器运行了某种javascript代码,使我无法获取完整的网页html或其他内容?我是新手,所以任何建议都将不胜感激。
下面是我的脚本:
URL = 'https://zone4.ca/race/2020-11-08/c91ec8f6/results'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all("div", class_="racer-row")
print(results)
print(soup)发布于 2021-03-14 00:43:17
是的,它肯定是通过javascript查询加载内容的。您可以复制这些查询的内容(标头、有效负载...)并通过requests库手动发送它们,或者(最好是imo)使用浏览器模拟驱动程序(如selenium )来抓取动态页面。
发布于 2021-03-14 01:06:04
数据通过JavaScript动态加载。但是您可以使用此脚本构造Ajax请求并解析一些数据:
import re
import json
import requests
from datetime import datetime, timezone
url = 'https://zone4.ca/race/2020-11-08/c91ec8f6/results/'
html_doc = requests.get(url).text
data = re.search(r'callback\((\{.*\})\)', html_doc, flags=re.S).group(1).replace("'", '"')
data = json.loads(re.sub(r'([^\s]+):', r'"\1":', data))
data_url = "https://zone4.ca/public/data/race.json?url={url}&page={page}&channel_id={channelID}&channel_class=StandardRace&entity_id={entityID}"
feed = requests.get(data_url.format(**data)).json()
# uncomment this to print all data:
# print(json.dumps(feed, indent=4))
for racer in feed['tree']['_child_racers']:
    print(racer['first_name'][0], racer['last_name'][0])
    for t in racer['_child_timedentitys']:        
        for i in range(1, 12):
            time = t.get('time_{}_list'.format(i))
            if not time:
                continue
            dtobj = datetime.fromtimestamp(time[0][0] / 1_000_000, timezone.utc)
            print('\tLap {}: {}'.format(i, dtobj))打印:
Tim Shea
    Lap 1: 2020-11-08 14:40:54.611000+00:00
    Lap 2: 2020-11-08 14:45:17.259000+00:00
    Lap 3: 2020-11-08 14:49:48.259000+00:00
    Lap 4: 2020-11-08 14:54:18.778000+00:00
    Lap 5: 2020-11-08 14:58:52.099000+00:00
    Lap 6: 2020-11-08 15:03:17.700000+00:00
    Lap 7: 2020-11-08 15:07:44.818000+00:00
    Lap 8: 2020-11-08 15:12:18.896000+00:00
    Lap 9: 2020-11-08 15:16:52.010000+00:00
    Lap 10: 2020-11-08 15:21:18.897000+00:00
    Lap 11: 2020-11-08 15:25:55.058000+00:00
Zachary Steinman
    Lap 1: 2020-11-08 14:41:32.912000+00:00
    Lap 2: 2020-11-08 14:46:29.458000+00:00
    Lap 3: 2020-11-08 14:51:29.970000+00:00
    Lap 4: 2020-11-08 14:56:30.875000+00:00
    Lap 5: 2020-11-08 15:01:40.057000+00:00
    Lap 6: 2020-11-08 15:06:47.620000+00:00
    Lap 7: 2020-11-08 15:11:58.790000+00:00
    Lap 8: 2020-11-08 15:17:09.099000+00:00
    Lap 9: 2020-11-08 15:22:14.819000+00:00
    Lap 10: 2020-11-08 15:27:19.859000+00:00
Kent Williams
    Lap 1: 2020-11-08 14:42:40.399000+00:00
    Lap 2: 2020-11-08 14:48:33.714000+00:00
...and so on.https://stackoverflow.com/questions/66615780
复制相似问题