文章/答案/技术大牛

发布

问Python站点解析
EN

Stack Overflow用户

提问于 2014-02-01 18:37:09

回答 3查看 1.1K关注 0票数 3

我正试图从一个使用Beautifulsoup4的网站上抓取一些体育数据，但在弄清楚如何继续下去的过程中遇到了一些困难。我不太擅长HTML，而且似乎无法理解最后一点必要的语法。一旦数据被解析，我将把它插入到Pandas数据文件中。我在试着从主队、客场队和得分中抽身。到目前为止，这是我的代码：

from bs4 import BeautifulSoup
import urllib2
import csv

url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

def has_class_but_no_id(tag):
    return tag.has_attr('score')

writer = csv.writer(open("webScraper.csv", "w"))

for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
    print(tag)

下面是一个示例输出：

<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>

我需要把主队(纽卡斯尔)、比分(0-3)和客场(桑德兰)放在三个不同的区域。本质上，我只能尝试从每个标记中提取“值”，而且似乎无法理解bs4中的语法。我需要类似于tag.value属性，但我在文档中找到的只是一个tag.name或tag.attrs。任何帮助或指示都将不胜感激！

web-scraping

beautifulsoup

python

回答 3

Stack Overflow用户

回答已采纳

发布于 2014-02-01 20:13:55

每个分数单元位于一个<td class='match-details'>元素中，在这些元素上循环提取匹配细节。

在那里，您可以使用.stripped_strings生成器从子元素中提取文本；只需将其传递给''.join()，以获取标记中包含的所有字符串。为便于解析，分别选择team-home、score和team-away：

for match in soup.find_all('td', class_='match-details'):
    home_tag = match.find('span', class_='team-home')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='score')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='team-away')
    away = away_tag and ''.join(away_tag.stripped_strings)

有了额外的print，这就提供了：

>>> for match in soup.find_all('td', class_='match-details'):
...     home_tag = match.find('span', class_='team-home')
...     home = home_tag and ''.join(home_tag.stripped_strings)
...     score_tag = match.find('span', class_='score')
...     score = score_tag and ''.join(score_tag.stripped_strings)
...     away_tag = match.find('span', class_='team-away')
...     away = away_tag and ''.join(away_tag.stripped_strings)
...     if home and score and away:
...         print home, score, away
... 
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.

票数 3

Stack Overflow用户

发布于 2014-02-01 18:48:58

您可以使用tag.string propery获取标记的值。

有关详细信息，请参阅文档。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

票数 1

Stack Overflow用户

发布于 2019-03-25 10:30:34

由于重定向到这里：https://www.bbc.com/sport/football/premier-league/scores-fixtures

这是对接受的答案的更新，它仍然是正确的。如果你编辑你的答案，我会删除这个答案。

for match in soup.find_all('article', class_='sp-c-fixture'):
    home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
    away = away_tag and ''.join(away_tag.stripped_strings)
    if home and score and away:
        print(home, score, away)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/21501949

复制

相似问题

问Python站点解析
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python站点解析EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python站点解析
EN