首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >美汤抛出“索引错误”

美汤抛出“索引错误”
EN

Stack Overflow用户
提问于 2013-09-24 18:54:16
回答 2查看 133关注 0票数 1

我正在使用Python2.7BeautifulSoup3.2抓取一个网站。我对这两种语言都很陌生,但从文档中我得到了一点开始。

我正在阅读下一篇文档:http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#contents http://thepcspy.com/read/scraping-websites-with-python/

我现在所做的和拥有的(部分失败):

代码语言:javascript
运行
复制
# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup

# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)

# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)

# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
hometeams = [tag.contents[1] for tag in hometeamsTd]

# From the source HTML page, search and store all <td class="home">..</td> and it's content
awayteamsTd = soup.findAll('td', { "class" : "away" })
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]

tag.contents hometeamsTd的内容如下所示:

代码语言:javascript
运行
复制
[
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Harkemase Boys', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6077" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'RKC Waalwijk', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-427" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'PSV', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Ajax', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-2" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'SC Heerenveen', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-14" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Feyenoord', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-9" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />]
]

tag.contents awayteamsTd的内容如下所示:

代码语言:javascript
运行
复制
[
    [u'Away-team'], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-13" />, u'NEC', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-11" />, u'Heracles', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428" />, u'Stormvogels Telstar', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-419" />, u'FC Volendam', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-7" />, u'FC Twente', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-415" />, u'FC Dordrecht', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />]
]

我试图解决的问题,但还没有完全解决的是:

  • 代码awayteams = [tag.contents[1] for tag in awayteamsTd]通过一个错误:IndexError: list index out of range。这当然是正确的,因为正如您在awayteamsTd,的tag.contents输出中看到的,有第一个条目[u'Away-team']。这就是它失败的原因。但是我如何删除/跳过这个呢?
  • 在家庭团队中,所有输出都正常工作,但我想排除文本荷兰KNVB Beker发生的情况。
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-09-24 19:08:16

问题是“离开”单元格(列名)在td中,其中包含"away“类:

代码语言:javascript
运行
复制
<thead class="title">
    ...
    <tr class="sub">
      ...  
      <td>Home-team</td>
      <td></td>
      <td class="away">Away-team</td>
      <td class="broadcast">Broadcast</td>
    </tr>
  </thead>
</thead>

只需使用切片跳过它:

代码语言:javascript
运行
复制
awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]

此外,如果要将Dutch KNVB Beker排除在主队列表之外,请向列表理解表达式添加一个条件:

代码语言:javascript
运行
复制
hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']
票数 1
EN

Stack Overflow用户

发布于 2013-09-24 19:06:54

代码语言:javascript
运行
复制
awayteams = []
for tag in awayteamsTd:
    if len(tag.contents) > 1:
        awayteams.append(tag.contents[1])
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/18989742

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档