我想从一个动态网页https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18刮多个表我已经尝试了以下代码,但收到以下错误。我想得到显示在底部的输出。
df = pd.DataFrame()
driver = webdriver.Chrome('/Users/alau/Downloads/chromedriver')
driver.get('https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18')
res = driver.execute_script('return document.documentElement.outerHTML')
time.sleep(3)
driver.quit()
soup = BeautifulSoup(res, 'lxml')
tables = soup.find_all('table', {'class':'bigborder'})
subheads = soup.find_all('td', {'class':'subheader'}).text.replace('\n','!')
def tableDataText(tables):
rows = []
trs = tables.find_all('tr')
headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
return rows
result_table = tableDataText(bt_table)
df = pd.DataFrame(result_table[1:], columns=result_table[0])
AttributeError: ResultSet对象没有特性'find_all‘。您可能会将一列项目视为单个项目。当您打算调用find_all()时,您是否调用了find()?
输出
发布于 2020-09-24 16:18:18
您必须发送一个带有anti-bot
cookie的POST
请求来获取响应中的HTML
。
下面是如何使用BeautifulSoup
实现这一点
import pandas as pd
import requests
from bs4 import BeautifulSoup
cookies = {
"BotMitigationCookie_9518109003995423458": "381951001600933518cRI6X6LoZp9tUD7Ls04ETZpx41s=",
}
url = "https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18"
response = requests.post(url, cookies=cookies).text
soup = BeautifulSoup(response, "html.parser").find_all("table", {"class": "bigborder"})
columns = [
"Horse", "Jockey", "Trainer", "Draw", "Gear", "LBW",
"Running Position", "Time", "Result", "Comment",
]
def get_data():
for table in soup.find_all("table", {"class": "bigborder"}):
for tr in table.find_all("tr", {"bgcolor": "#eeeeee"}):
yield [
i.find("font").getText().strip().replace(";", "")
for i in tr.find_all("td")
]
df = pd.DataFrame([table for table in get_data()], columns=columns)
df.to_csv("data.csv", index=False)
这将为您提供:
发布于 2020-09-24 17:49:27
import pandas as pd
import requests
cookies = {
'BotMitigationCookie_9518109003995423458': '343775001600940465b2KTzJpwY5pXpiVNIRRi97Z3ELk='
}
def main(url):
r = requests.post(url, cookies=cookies)
df = pd.read_html(r.content, header=0, attrs={'class': 'bigborder'})
new = pd.concat(df, ignore_index=True)
print(new)
new.to_csv("data.csv", index=False)
main("https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18")
输出:view-online
Horse ... Comment
0 LARSON (D199) ... Being freshened up; led all the way to score.
1 PRIVATE ROCKET (C367) ... Sat behind the leader; ran on comfortably.
2 WIND N GRASS (D197) ... Slightly slow to begin; made progress under a ...
3 VOYAGE WARRIOR (C247) ... In 2nd position; slightly weakened late.
4 BEAUTY RUSH (C475) ... Bounded on jumping; settled midfield.
.. ... ... ...
59 BUNDLE OF DELIGHT (D236) ... Raced along the rail; ran on OK when persuaded.
60 GOOD DAYS (A333) ... Hit the line well when clear at 300m.
61 YOU HAVE MY WORD (V149) ... Well tested in the Straight; moved better than...
62 PLIKCLONE (D003) ... Average to begin; raced under his own steam.
63 REEVE'S MUNTJAC (C174) ... The stayer raced under his own steam to stretc...
[64 rows x 10 columns]
https://stackoverflow.com/questions/64041582
复制相似问题