我正试图从一个使用Beautifulsoup4的网站上抓取一些体育数据,但在弄清楚如何继续下去的过程中遇到了一些困难。我不太擅长HTML,而且似乎无法理解最后一点必要的语法。一旦数据被解析,我将把它插入到Pandas数据文件中。我在试着从主队、客场队和得分中抽身。到目前为止,这是我的代码:
from bs4 import BeautifulSoup
import urllib2
import csv
url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
def has_class_but_no_id(tag):
return tag.has_attr('score')
writer = csv.writer(open("webScraper.csv", "w"))
for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
print(tag)下面是一个示例输出:
<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>我需要把主队(纽卡斯尔)、比分(0-3)和客场(桑德兰)放在三个不同的区域。本质上,我只能尝试从每个标记中提取“值”,而且似乎无法理解bs4中的语法。我需要类似于tag.value属性,但我在文档中找到的只是一个tag.name或tag.attrs。任何帮助或指示都将不胜感激!
发布于 2019-03-25 10:30:34
由于重定向到这里:https://www.bbc.com/sport/football/premier-league/scores-fixtures
这是对接受的答案的更新,它仍然是正确的。如果你编辑你的答案,我会删除这个答案。
for match in soup.find_all('article', class_='sp-c-fixture'):
home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
home = home_tag and ''.join(home_tag.stripped_strings)
score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
score = score_tag and ''.join(score_tag.stripped_strings)
away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
away = away_tag and ''.join(away_tag.stripped_strings)
if home and score and away:
print(home, score, away)https://stackoverflow.com/questions/21501949
复制相似问题