我是BeautifulSoup的新手,并试图从以下网站提取数据:http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
我试图提取每个类别(食物、住房、衣服、交通、个人护理和娱乐)的汇总百分比。因此,就以上所提供的链接而言,我想提取的百分比是: 48%、129%、63%、43%、42%、42%和72%。
不幸的是,我使用BeautifulSoup的当前Python代码提取了以下百分比: 12%、85%、63%、21%、42%和48%。我不知道为何会这样。这里的任何帮助都将不胜感激!这是我的代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
page = urllib2.urlopen(url)
soup_expatistan = BeautifulSoup(page)
page.close()
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
published_date = expatistan_title.find("th",class_="percent")
print(published_date.span.string)
发布于 2014-05-03 18:14:16
我无法确定确切原因,但这似乎是一个与urllib2
有关的问题。只需更改为requests
,它就开始工作了。以下是代码:
import requests
from bs4 import BeautifulSoup
url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
expatistan_table = soup_expatistan.find("table", class_="comparison")
expatistan_titles = expatistan_table.find_all("tr", class_="expandable")
for expatistan_title in expatistan_titles:
published_date = expatistan_title.find("th", class_="percent")
print(published_date.span.string)
您可以使用pip
来安装requests
$ pip install requests
编辑
这个问题确实与urllib2
有关。根据请求中的用户代理设置,www.expatistan.com
服务器的响应似乎不同。为了获得与urllib2
相同的响应,您必须执行以下操作:
url = "http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city"
request = urllib2.Request(url)
opener = urllib2.build_opener()
request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0')
page = opener.open(request).read()
https://stackoverflow.com/questions/23446975
复制相似问题