我想从彭博网站上搜集数据。"IBVC:IND股票市场指数“下的数据需要被剔除。
到目前为止,我的代码如下:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/58.0.3029.110 Safari/537.36 '
}
res = requests.get("https://www.bloomberg.com/quote/IBVC:IND", headers=headers)
soup = bs(res.content, 'html.parser')
# print(soup)
itmes = soup.find("div", {"class": "snapshot__0569338b snapshot"})
open_ = itmes.find("span", {"class": "priceText__1853e8a5"}).text
print(open_)
prev_close = itmes.find("span", {"class": "priceText__1853e8a5"}).text我无法在HTML中找到所需的值。我应该用哪个图书馆来处理这个问题?我目前正在使用BeautifulSoup和请求。
发布于 2019-09-23 21:29:27
正如其他答案所示,内容是通过JavaScript生成的,因此不存在于普通html中。对于给定的问题,提出了两种不同的攻角。
Selenium,又名“大枪”:这将使您在浏览器中自动执行几乎所有的任务。虽然在速度上要付出一定的代价。API Request aka深思:这并不总是可行的。然而,如果是这样的话,效率就会高得多。我要详细说明第二个问题。@ViniciusDAvila已经为这种解决方案绘制了典型的蓝图:导航到站点,检查网络并确定哪个请求负责获取数据。
一旦完成,剩下的就是执行的问题了:
刮板
import requests
import json
from urllib.parse import quote
# Constants
HEADERS = {
'Host': 'www.bloomberg.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': '*/*',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.bloomberg.com/quote/',
'DNT': '1',
'Connection': 'keep-alive',
'TE': 'Trailers'
}
URL_ROOT = 'https://www.bloomberg.com/markets2/api/datastrip'
URL_PARAMS = 'locale=en&customTickerList=true'
VALID_TYPE = {'currency', 'index'}
# Scraper
def scraper(object_id: str = None, object_type: str = None, timeout: int = 5) -> list:
"""
Get the Bloomberg data for the given object.
:param object_id: The Bloomberg identifier of the object.
:param object_type: The type of the object. (Currency or Index)
:param timeout: Maximal number of seconds to wait for a response.
:return: The data formatted as dictionary.
"""
object_type = object_type.lower()
if object_type not in VALID_TYPE:
return list()
# Build headers and url
object_append = '%s:%s' % (object_id, 'IND' if object_type == 'index' else 'CUR')
headers = HEADERS
headers['Referer'] += object_append
url = '%s/%s?%s' % (URL_ROOT, quote(object_append), URL_PARAMS)
# Make the request and check response status code
response = requests.get(url=url, headers=headers)
if response.status_code in range(200, 230):
return response.json()
return list()测试
# Index
object_id, object_type = 'IBVC', 'index'
data = scraper(object_id=object_id, object_type=object_type)
print('The open price for %s %s is: %d' % (object_type, object_id, data[0]['openPrice']))
# The open price for index IBVC is: 50094
# Exchange rate
object_id, object_type = 'EUR', 'currency'
data = scraper(object_id=object_id, object_type=object_type)
print('The open exchange rate for USD per {} is: {}'.format(object_id, data[0]['openPrice']))
# The open exchange rate for USD per EUR is: 1.0993发布于 2019-09-23 14:35:47
因为这不是静态页面,所以您需要向彭博API发出请求。要了解方法,请转到页面,检查元素并选择"Network",然后通过"XHR“筛选并查找JSON类型。重新加载页面。我这么做了,相信这就是你想要的:链接
发布于 2019-09-23 15:40:15
因为所需的值是动态加载的。在这种情况下,您可以尝试使用selenium和BeautifulSoup。下面是供您参考的示例代码:
import time
import os
from selenium import webdriver
from bs4 import BeautifulSoup
# put the driver in the folder of this code
driver = webdriver.Chrome(os.getcwd() + '/chromedriver')
driver.get("https://www.bloomberg.com/quote/IBVC:IND")
time.sleep(3)
real_soup = BeautifulSoup(driver.page_source, 'html.parser')
open_ = real_soup.find("span", {"class": "priceText__1853e8a5"}).text
print(f"Price: {open_}")
time.sleep(3)
driver.quit()输出:
Price: 50,083.00您可以搜索色度驱动器,并下载一个基于您的铬版本。
https://stackoverflow.com/questions/58064494
复制相似问题