首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用美丽汤解析运动数据引用

使用美丽汤解析运动数据引用
EN

Stack Overflow用户
提问于 2021-09-08 12:17:53
回答 2查看 79关注 0票数 0

我是Python新手,在这个项目中,我想创建一个循环来解析所有NFL团队在https://www.pro-football-reference.com/teams/上的排名数据。我首先创建了一个作为目录的数据框架,如下所示。

代码语言:javascript
运行
复制
my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atalanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
                  ['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
                  ['dal','Dalls_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
                  ['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
                  ['sgd','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
                  ['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
                  ['phi','Philidophia_Eagles'],['pt','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
                  ['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])

team_list = pd.DataFrame(my_array, columns=['code','teams'])

下面是我用来解析所有32个网页的循环:

代码语言:javascript
运行
复制
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [url_base+str(i) for i in team_list['code']]
for url in url_list:
    page = requests.get(url).text
    soup = bs(page)

for table in soup.find_all('table'):
    headers = []
    for i in table.find_all('th', scope = "col"):
        title=i.text.strip()
        headers.append(title)

    table_data = []
    for tr in table.find_all("tr"): 
        t_row = {}
        for td, th in zip(tr.find_all("td"), headers): 
            t_row[th] = td.text.replace('\n', '').strip()
    table_data.append(t_row)

然而,结果却是一个空列表。我的密码有什么问题吗?谢谢!

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-09-10 09:10:20

下面是不使用pandas.read_html()的逻辑

代码语言:javascript
运行
复制
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atlanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
                  ['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
                  ['dal','Dallas_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
                  ['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
                  ['sdg','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
                  ['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
                  ['phi','Philadelphia_Eagles'],['pit','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
                  ['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])



url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [(url_base+str(i[0]), i[1]) for i in my_array]
rows = []
for url, team in url_list:
    print('Gathering: %s' %team)
    response = requests.get(url)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    table = soup.find('table', {'id':'team_index'})
    headers = [x.text.strip() for x in table.find_all('tr')[1].find_all('th')]
    
    trs = table.find_all('tr')[2:]
    
    for tr in trs:
        year = tr.find('th').text.strip()
        if year == 'Year' or year == '':
            continue
        data = [year] + [x.text.strip() for x in tr.find_all('td')]
        
        rows.append(data)
        
final_table = pd.DataFrame(rows, columns=headers)
票数 0
EN

Stack Overflow用户

发布于 2021-09-09 15:21:35

代码很少有问题:

首先(如前所述),您的缩进不在这里,所以您需要修复它。创建汤对象时,您希望解析第一个循环中的表。其次,zip在这里返回一个对象,您需要做一些类似于list(zip(x, y))的事情来执行您想要做的事情--迭代它。第三,即使这样做,当您在这里使用zip (创建您的字典)时,您也希望使用标头作为键,而不是作为值。第四,标头是多索引的,所以当您使用td进行压缩时,它们不会对齐。第五(再次提到),您需要在循环之前初始化您的table_data,否则它只会在迭代期间覆盖自己。

最后,考虑使用pandas.read_html()。它在引擎盖下使用漂亮的汤,可以为您解析表,然后您只需要做最少的工作来清理桌子。

此外,我还修复了数组中的几个错误(您也可以从https://www.pro-football-reference.com/teams/表中获取那些hrefs和团队名称,但是您如何硬编码它应该会很好,而且这些链接不会很快改变,如果有的话):

  1. 'Atalanta_Falcons' -> 'Atlanta_Falcons'
  2. 'Dalls_Cowboys' -> -> 'Philadelphia Eagles'
  3. 'sgd'-> 'sdg'
  4. 'pt' -> 'pit'

代码:

代码语言:javascript
运行
复制
import pandas as pd
import numpy as np

my_array = np.array([['crd','Arizona_Cardinals'],['atl','Atlanta_Falcons'],['rav','Baltimore_Ravens'],['buf','Buffalo_Bills'],
                  ['car','Carolina_Panthers'],['chi','Chicago_Bears'],['cin','Cincinnati_Bengals'],['cle','Cleveland_Browns'],
                  ['dal','Dallas_Cowboys'],['den','Denver_Broncos'],['det','Detroit_Lions'],['gnb','Green_Bay_Packers'],['htx','Houston_Texans'],
                  ['clt','Indianapolis_Colts'],['jax','Jacksonville_Jaguars'],['kan','Kansas_City_Chiefs'],['rai','Las_Vegas_Raiders'],
                  ['sdg','Los_Angeles_Chargers'],['ram','Los_Angeles_Rams'],['mia','Miami_Dolphins'],['min','Minnesota_Vikings'],
                  ['nwe','New_England_Patriots'],['nor','New_Orleans_Saints'],['nyg','New_York_Giants'],['nyj','New_York_Jets'],
                  ['phi','Philadelphia_Eagles'],['pit','Pittsburgh_Steelers'],['sfo','San_Francisco_49ers'],['sea','Seattle_Seahawks'],
                  ['tam','Tampa_Bay_Buccaneers'],['oti','Tennessee_Titans'],['was','Washington_Football_Team']])


final_table = pd.DataFrame()
url_base = 'https://www.pro-football-reference.com/teams/'
url_list = [(url_base+str(i[0]), i[1]) for i in my_array]
for url, team in url_list:
    print('Gathering: %s' %team)
    
    # Gets full unfiltered table
    table = pd.read_html(url, header=1)[0]
    
    #Drop those sub header rows
    table = table[table['Year'].ne('Year')]
    
    #Drop the null rows
    table = table.dropna(subset = ['Year'])
    
    # Append to your final dataframe
    final_table = final_table.append(table, sort=False).reset_index(drop=True)

输出:

代码语言:javascript
运行
复制
print(final_table)
      Year   Lg                 Tm  W   L  ...    MoV   SoS    SRS  OSRS  DSRS
0     2021  NFL  Arizona Cardinals  0   0  ...    NaN   NaN    NaN   NaN   NaN
1     2020  NFL  Arizona Cardinals  8   8  ...    2.7  -0.1    2.6   1.5   1.0
2     2019  NFL  Arizona Cardinals  5  10  ...   -5.1   1.8   -3.2  -0.3  -2.9
3     2018  NFL  Arizona Cardinals  3  13  ...  -12.5   1.0  -11.5  -9.6  -1.9
4     2017  NFL  Arizona Cardinals  8   8  ...   -4.1   0.4   -3.7  -4.0   0.2
   ...  ...                ... ..  ..  ...    ...   ...    ...   ...   ...
2089  1936  NFL   Boston Redskins*  7   5  ...    3.3  -3.0    0.3  -1.0   1.3
2090  1935  NFL    Boston Redskins  2   8  ...   -5.3  -0.8   -6.1  -6.1   0.0
2091  1934  NFL    Boston Redskins  6   6  ...    1.1  -0.8    0.2  -1.7   2.0
2092  1933  NFL    Boston Redskins  5   5  ...    0.5   1.4    1.9  -0.8   2.7
2093  1932  NFL      Boston Braves  4   4  ...   -2.4  -1.6   -4.0  -4.0  -0.1

[2094 rows x 29 columns]
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69102977

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档