文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Python请求打开页面时出错

问使用Python请求打开页面时出错
EN

Stack Overflow用户

提问于 2013-12-13 18:52:22

回答 1查看 286关注 0票数 1

我在使用Python请求包打开页面时遇到了一些问题。页面在浏览器中打开很好，但是当尝试获取站点时，程序只会挂起。

import bs4, requests

link = "http://s6.mediastreaming.it:8080/"
r = requests.get(link)
data=r.text

soup = bs4.BeautifulSoup(data)
soup.prettify()
print soup

任何帮助理解为什么这个网站挂起将不胜感激。代码可以与http://google.com作为链接很好地工作。

编辑:添加信息帮助。这是一个流媒体音乐网站。我只想刮一下页面上写着当前歌曲是什么的部分。这就是全部。但是，也许这个页面是一个音乐流媒体网站这一事实导致了这个问题？我只想要消息来源的文本。没别的了。

编辑2:尝试了下面的方法，看看我是否确实得到了一个流，而不是我想要的页面。

import bs4, requests

link = "http://s6.mediastreaming.it:8080/"
r = requests.get(link,stream=True)

filename = "testfile.txt"

with open(filename,'wb') as fd:
    for chunk in r.iter_content(100):
        fd.write(chunk)

下面是我得到的输出：

ICY 200 OK
icy-notice1:<BR>This stream requires <a href="http://www.winamp.com/">Winamp</a><BR>
icy-notice2:SHOUTcast Distributed Network Audio Server/Linux v1.9.8<BR>
icy-name:Pig Radio - The Best Electronic & Indie Pop/Rock 24/7
icy-genre:Eclectic
icy-url:http://www.pigradio.com
content-type:audio/mpeg
icy-pub:1
icy-br:128

&Oç)goiYQŠ < 6Ê !‡À¡ö³ ‡/OFÌ)8…¨ÐU!ðiÁP¡¢ãÅ..........

当我在浏览器中打开页面并查看源代码时，会得到以下内容：

<HTML><HEAD><meta http-equiv="Content-Language" content="en-us"><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><meta http-equiv="Pragma" content="no-cache"><meta http-equiv="Expires" content="Mon, 01 Jan 1990 12:00:00 GMT"><title>SHOUTcast Administrator</title><style type="text/css"><!--a:link {color: blue; font-family:Arial, Helvetica; font-size:9pt;}a:visited {color: blue; font-family:Arial, Helvetica; font-size:9pt;}a:hover {color: red; font-family:Arial, Helvetica; font-size:9pt; }.default {color: White; font-family:Arial, Helvetica; font-size:9pt; font-weight: normal}.ST {color: White; font-family:Arial, Helvetica; font-size:8pt; font-weight: normal}.logoText {color: red; font-family: Arial Black, Helvetica, sans-serif; font-size: 25pt; font-weight: normal; letter-spacing : -2.5px;}.flagText {color: blue; font-family: webdings; font-size: 36pt; font-weight: normal; }.ltv {color: blue; font-family: Arial, Helvetica, sans-serif; font-size: 9pt; font-weight: normal;}.tnl {color: black; font-family: Arial, Helvetica, sans-serif; font-size: 10pt; font-weight: bold; text-decoration: none;}--></style></HEAD><BODY topmargin=0 leftmargin=0 marginheight=0 marginwidth=0 bgcolor=#000000 text=#EEEEEE link=#001155 vlink=#001155 alink=#FF0000><font class=default><table width=100% border=0 cellpadding=0 cellspacing=0><tr><td height=50><font class=flagText>U</font><font class=logoText>&nbsp;SHOUTcast D.N.A.S. Status</font></td></tr><tr><td height=14 align=right><font class=ltv><a id=ltv href="http://www.shoutcast.com/">SHOUTcast Server Version 1.9.8/Linux</a></font></td></tr><tr><td bgcolor=#DDDDDD height=20 align=center><table width=100% border=0 cellpadding=0 cellspacing=0><tr><td align=center><font class=tnl><a id=tnl href="index.html">Status</a></font></td><td align=center><font class=tnl>&nbsp;|&nbsp;</font></td><td align=center><font class=tnl><a id=tnl href="played.html">Song History</a></font></td><td align=center><font class=tnl>&nbsp;|&nbsp;</font></td><td align=center><font class=tnl><a id=tnl href="listen.pls">Listen</a></font></td><td align=center><font class=tnl>&nbsp;|&nbsp;</font></td><td align=center><font class=tnl><a id=tnl href="home.html">Stream URL</a></font></td><td align=center><font class=tnl>&nbsp;|&nbsp;</font></td><td align=center><font class=tnl><a id=tnl href="admin.cgi">Admin Login</a></font></td></tr></table></td></tr></table><br><table cellpadding=5 cellspacing=0 border=0 width=100%><tr><td bgcolor=#000025 colspan=2 align=center><font class=ST>Current Stream Information</font></td></tr></table><table cellpadding=2 cellspacing=0 border=0 align=center><tr><td width=100 nowrap><font class=default>Server Status: </font></td><td><font class=default><b>Server is currently up and public.</b></td></tr><tr><td width=100 nowrap><font class=default>Stream Status: </font></td><td><font class=default><b>Stream is up at 128 kbps with <B>45 of 300 listeners (45 unique)</b></b></td></tr><tr><td width=100 nowrap><font class=default>Listener Peak: </font></td><td><font class=default><b>200</b></td></tr><tr><td width=100 nowrap><font class=default>Average Listen Time: </font></td><td><font class=default><b>7h&nbsp;13m&nbsp;12s</b></td></tr><tr><td width=100 nowrap><font class=default>Stream Title: </font></td><td><font class=default><b>Pig Radio - The Best Electronic & Indie Pop/Rock 24/7</b></td></tr><tr><td width=100 nowrap><font class=default>Content Type: </font></td><td><font class=default><b>audio/mpeg</b></td></tr><tr><td width=100 nowrap><font class=default>Stream Genre: </font></td><td><font class=default><b>Eclectic</b></td></tr><tr><td width=100 nowrap><font class=default>Stream URL: </font></td><td><font class=default><b><a href="http://www.pigradio.com">http://www.pigradio.com</a></b></td></tr><tr><td width=100 nowrap><font class=default>Stream AIM: </font></td><td><font class=default><b><a href="aim:goim?screenname=N/A">N/A</a></b></td></tr><tr><td width=100 nowrap><font class=default>Stream IRC: </font></td><td><font class=default><b><a href="http://www.shoutcast.com/chat.phtml?dc=N%2FA">N/A</a></b></td></tr><tr><td width=100 nowrap><font class=default>**Current Song: </font></td><td><font class=default><b>Midlake - Young Bride (Cassettes Won't Listen Remix)**</b></td></tr></table><br><table cellpadding=0 cellspacing=0 border=0 width=100%> <tr><td bgcolor=#DDDDDD  nowrap colspan=5 align=center><table cellspacing=0 cellpadding=0 border=0><tr><td><font class=ltv>Written by Stephen 'Tag Loomis, Tom Pepper and Justin Frankel</font></td></tr></table></td></tr><tr><td nowrap colspan=5 align=center><font class=ST><b><a href="http://www.shoutcast.com/disclaimer.phtml">Copyright Nullsoft Inc</a><a href="/llamacookie">.</a> 1998-2004</b></font></td></tr></table></font></body></html>

显然，这是我想为“当前歌曲”抓取的最后一点HTML，我如何才能得到这个HTML呢？？

Edit3:解决它！我使用Wireshark捕获浏览器发送的GET，并将那里看到的所有参数添加到Python的头中。看起来是这样的：

import bs4, requests, urllib2

link = "http://173.255.137.244:8080/"

filename = "testfile.txt"

payload = {'Host':'173.255.137.244:8080','User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:26.0) Gecko/20100101 Firefox/26.0','Accept-Encoding':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language':'en-US,en;q=0.5','Accept-Encoding':'gzip, deflate'}
r = requests.get(link,headers=payload)
data = r.text

soup = bs4.BeautifulSoup(data)
soup.prettify()
print soup

这个故事的寓意。Wireshark很酷。

python-2.7

beautifulsoup

python-requests

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-12-14 00:11:49

该网站似乎有一些防止蜘蛛爬行的方法。

如果爬行页面不需要身份验证，而且爬行时蜘蛛有问题，那么通常有两种方法：

在发出请求时添加http请求头，这使请求就像通过web浏览器访问页面一样。一些站点通常会检查User-Agent字段以确定它是否是蜘蛛。
在发出请求时添加一些需要的cookie。有些站点在响应时需要从cookie获得一些信息。

除了wireshark有点复杂之外，使用chrome中的chrome开发工具观察网络请求及其头是一个不错的选择。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/20573890

复制

相似问题

问使用Python请求打开页面时出错
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python请求打开页面时出错EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python请求打开页面时出错
EN