我尝试了下面的脚本来抓取网页上的表格。
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue'
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, 'html.parser')
soup.find('tbody')
然而,这张桌子是不能拉的。即使是"pd.read_html“行也不起作用。这有什么原因吗?
发布于 2022-08-22 22:26:06
所需的表数据在html注释下面。通过删除注释,您可以仅使用熊猫提取表数据。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url= 'https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue'
res = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup =BeautifulSoup(res,'lxml')
table = soup.select_one('#div_results')
df = pd.read_html(str(table))[0]
d = df.droplevel(0, axis=1)
print(d)
输出:
G Date Day School Unnamed: 4_level_1 Opponent ... Diff W L T Streak Notes
0 19 2019-10-05 Sat Penn State (12) NaN Purdue ... 28 15 3 1 W 9 NaN
1 18 2016-10-29 Sat Penn State (24) @ Purdue ... 38 14 3 1 W 8 NaN
2 17 2013-11-16 Sat Penn State NaN Purdue ... 24 13 3 1 W 7 NaN
3 16 2012-11-03 Sat Penn State @ Purdue ... 25 12 3 1 W 6 NaN
4 15 2011-10-15 Sat Penn State NaN Purdue ... 5 11 3 1 W 5 NaN
5 14 2008-10-04 Sat Penn State (6) @ Purdue ... 14 10 3 1 W 4 NaN
6 13 2007-11-03 Sat Penn State NaN Purdue ... 7 9 3 1 W 3 NaN
7 12 2006-10-28 Sat Penn State @ Purdue ... 12 8 3 1 W 2 NaN
8 11 2005-10-29 Sat Penn State (11) NaN Purdue ... 18 7 3 1 W 1 NaN
9 10 2004-10-09 Sat Penn State NaN Purdue (9) ... -7 6 3 1 L 2 NaN
10 9 2003-10-11 Sat Penn State @ Purdue (18) ... -14 6 2 1 L 1 NaN
11 8 2000-09-30 Sat Penn State NaN Purdue (22) ... 2 6 1 1 W 6 NaN
12 7 1999-10-23 Sat Penn State (2) @ Purdue (16) ... 6 5 1 1 W 5 NaN
13 6 1998-10-17 Sat Penn State (12) NaN Purdue ... 18 4 1 1 W 4 NaN
14 5 1997-11-15 Sat Penn State (6) @ Purdue (19) ... 25 3 1 1 W 3 NaN
15 4 1996-10-12 Sat Penn State (10) NaN Purdue ... 17 2 1 1 W 2 NaN
16 3 1995-10-14 Sat Penn State (20) @ Purdue ... 3 1 1 1 W 1 NaN
17 2 1952-09-27 Sat Penn State NaN Purdue ... 0 0 1 1 T 1 NaN
18 1 1951-11-03 Sat Penn State @ Purdue ... -28 0 1 0 L 1 NaN
[19 rows x 16 columns]
发布于 2022-08-22 22:08:39
<table>
存储在HTML注释<!-- -->
中,因此beautifulsoup
通常不会看到它。要解析它,可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup, Comment
url = "https://www.sports-reference.com/cfb/play-index/rivals.cgi?request=1&school_id=penn-state&opp_id=purdue"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
}
pageTree = requests.get(url, headers=headers)
soup = BeautifulSoup(pageTree.content, "html.parser")
df = pd.read_html("\n".join(soup.find_all(text=Comment)))[0]
df = df.droplevel(0, axis=1)
print(df)
指纹:
G Date Day School Unnamed: 4_level_1 Opponent Conf Unnamed: 7_level_1 Pts Opp Diff W L T Streak Notes
0 19 2019-10-05 Sat Penn State (12) NaN Purdue Big Ten W 35 7 28 15 3 1 W 9 NaN
1 18 2016-10-29 Sat Penn State (24) @ Purdue Big Ten W 62 24 38 14 3 1 W 8 NaN
2 17 2013-11-16 Sat Penn State NaN Purdue Big Ten W 45 21 24 13 3 1 W 7 NaN
3 16 2012-11-03 Sat Penn State @ Purdue Big Ten W 34 9 25 12 3 1 W 6 NaN
4 15 2011-10-15 Sat Penn State NaN Purdue Big Ten W 23 18 5 11 3 1 W 5 NaN
5 14 2008-10-04 Sat Penn State (6) @ Purdue Big Ten W 20 6 14 10 3 1 W 4 NaN
6 13 2007-11-03 Sat Penn State NaN Purdue Big Ten W 26 19 7 9 3 1 W 3 NaN
7 12 2006-10-28 Sat Penn State @ Purdue Big Ten W 12 0 12 8 3 1 W 2 NaN
8 11 2005-10-29 Sat Penn State (11) NaN Purdue Big Ten W 33 15 18 7 3 1 W 1 NaN
9 10 2004-10-09 Sat Penn State NaN Purdue (9) Big Ten L 13 20 -7 6 3 1 L 2 NaN
10 9 2003-10-11 Sat Penn State @ Purdue (18) Big Ten L 14 28 -14 6 2 1 L 1 NaN
11 8 2000-09-30 Sat Penn State NaN Purdue (22) Big Ten W 22 20 2 6 1 1 W 6 NaN
12 7 1999-10-23 Sat Penn State (2) @ Purdue (16) Big Ten W 31 25 6 5 1 1 W 5 NaN
13 6 1998-10-17 Sat Penn State (12) NaN Purdue Big Ten W 31 13 18 4 1 1 W 4 NaN
14 5 1997-11-15 Sat Penn State (6) @ Purdue (19) Big Ten W 42 17 25 3 1 1 W 3 NaN
15 4 1996-10-12 Sat Penn State (10) NaN Purdue Big Ten W 31 14 17 2 1 1 W 2 NaN
16 3 1995-10-14 Sat Penn State (20) @ Purdue Big Ten W 26 23 3 1 1 1 W 1 NaN
17 2 1952-09-27 Sat Penn State NaN Purdue Western T 20 20 0 0 1 1 T 1 NaN
18 1 1951-11-03 Sat Penn State @ Purdue Western L 0 28 -28 0 1 0 L 1 NaN
https://stackoverflow.com/questions/73451375
复制相似问题