首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何使用python请求对过滤的结果进行web抓取?

如何使用python请求对过滤的结果进行web抓取?
EN

Stack Overflow用户
提问于 2020-06-18 20:08:39
回答 1查看 870关注 0票数 0

我试图从这个网站https://www.gurufocus.com/insider/summary的过滤结果。现在我只能从第一页得到信息。但我真正想做的是过滤几个行业并获取相关数据(您可以在过滤器区域看到“行业”)。但是当我选择这个行业时,网站URL不会改变,我也不能直接从URL中刮到。我看到一些人说你可以用requests.post来获取数据,但我真的不知道这是怎么回事。

这是我现在的一些密码。

代码语言:javascript
运行
复制
TradeUrl = "https://www.gurufocus.com/insider/summary"
r = requests.get(TradeUrl)
data=r.content
soup=BeautifulSoup(data, 'html.parser')

ticker = []
for tk in soup.find_all('td',{'class': 'table-stock-info', 'data-column': 'Ticker'}):
    ticker.append(tk.text)

如果我只需要金融服务业的股票,我该怎么办?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-06-18 22:08:45

按照建议使用post请求的问题是,请求需要一个授权令牌,该令牌有一个到期时间。您可以在Chrome或Firefox中看到post请求,如果您右键单击页面->,选择Inspect ->,选择Network,然后选择Industry,单击POST请求,然后单击Cookies,就会看到cookie password_grant_custom.client.expires,它的时间戳是授权何时停止工作。

但是,您可以使用selenium从所有页面中抓取数据。

首先安装Selenium:

代码语言:javascript
运行
复制
`sudo pip3 install selenium` on Linux or `pip install selenium` on Windows

然后找到一个驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads,为您的Chrome版本找到合适的驱动程序,并从zip文件中提取它。

注意:在Windows上,您需要向chromedriver添加路径

代码语言:javascript
运行
复制
driver = webdriver.Chrome(options=options)

在Linux上将色度驱动程序复制到/usr/local/bin/chromedriver

代码语言:javascript
运行
复制
from selenium import webdriver
from selenium.webdriver.common.by import By
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import time

# Start with the driver maximised to see the drop down menus properly
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=options)
driver.get('https://www.gurufocus.com/insider/summary')

# Set the page size to 100 to reduce page loads
driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
wait = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//div[contains(text(),'100')]"))
)
element = driver.find_element_by_xpath("//div[contains(text(),'100')]").click()

# Wait for the page to load and don't overload the server
time.sleep(2)

# select Industry
driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()

# Select Financial Services
element = WebDriverWait(driver, 5).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//span[contains(text(),'Financial Services')]"))
)
element.click()

ticker = []

while True:
    # Wait for the page to load and don't overload the server
    time.sleep(6)
    # Parse the HTML
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for tk in soup.find_all('td', {'class': 'table-stock-info', 'data-column': 'Ticker'}):
        ticker.append(tk.text)
    try:
        # Move to the next page
        element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
        element.click()
    except TimeoutException as ex:
        # No more pages so break
        break
driver.quit()

print(len(ticker))
print(ticker)

输出

代码语言:javascript
运行
复制
4604
['PUB   ', 'ARES   ', 'EIM   ', 'CZNC   ', 'SSB   ', 'CNA   ', 'TURN   ', 'FNF   ', 'EGIF   ', 'NWPP  etc...

更新

如果你想从所有的页面上抓取所有的数据并/或写到csv上,请使用熊猫:

代码语言:javascript
运行
复制
from selenium import webdriver
from selenium.webdriver.common.by import By
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
import pandas as pd
import time

# Start with the driver maximised to see the drop down menus properly
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=options)
driver.get('https://www.gurufocus.com/insider/summary')

# Set the page size to 100 to reduce page loads
driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
wait = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//div[contains(text(),'100')]"))
)
driver.find_element_by_xpath("//div[contains(text(),'100')]").click()

# Wait for the page to load and don't overload the server
time.sleep(2)

# select Industry
driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()

# Select Financial Services
element = WebDriverWait(driver, 5).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//span[contains(text(),'Financial Services')]"))
)
element.click()


columns = [
    'Ticker', 'Links', 'Company', 'Price1', 'Insider Name', 'Insider Position',
    'Date', 'Buy/Sell', 'Insider Trading Shares', 'Shares Change', 'Price2',
    'Cost(000)', 'Final Share', 'Price Change Since Insider Trade (%)',
    'Dividend Yield %', 'PE Ratio', 'Market Cap ($M)', 'None'
]
df = pd.DataFrame(columns=columns)


while True:
    # Wait for the page to load and don't overload the server
    time.sleep(6)
    # Parse the HTML
    df = df.append(pd.read_html(driver.page_source, attrs={'class': 'data-table'})[0], ignore_index=True)

    try:
        # Move to the next page
        element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
        element.click()
    except TimeoutException as ex:
        # No more pages so break
        break
driver.quit()

# Write to csv
df.to_csv("Financial_Services.csv", encoding='utf-8', index=False)

针对评论,进行了更新:首先从https://github.com/mozilla/geckodriver/releases下载Firefox,提取驱动程序。同样,在Windows上,您需要将路径添加到driver = webdriver.Firefox()或linux上将geckodriver复制到/usr/local/bin/geckodriver。

代码语言:javascript
运行
复制
from selenium import webdriver
from selenium.webdriver.common.by import By
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
import pandas as pd
import time

# Start with the driver maximised to see the drop down menus properly
driver = webdriver.Firefox()
driver.maximize_window()
driver.get('https://www.gurufocus.com/insider/summary')

# Set the page size to 100 to reduce page loads
driver.find_element_by_xpath("//span[contains(text(),'40 / Page')]").click()
wait = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//div[contains(text(),'100')]"))
)
driver.find_element_by_xpath("//div[contains(text(),'100')]").click()

# Wait for the page to load and don't overload the server
time.sleep(2)

# select Industry
driver.find_element_by_xpath("//span[contains(text(),'Industry')]").click()

# Select Financial Services
element = WebDriverWait(driver, 5).until(
    EC.presence_of_element_located((
        By.XPATH,
        "//span[contains(text(),'Financial Services')]"))
)
element.click()

columns = [
    'Ticker', 'Links', 'Company', 'Price1', 'Insider Name', 'Insider Position',
    'Date', 'Buy/Sell', 'Insider Trading Shares', 'Shares Change', 'Price2',
    'Cost(000)', 'Final Share', 'Price Change Since Insider Trade (%)',
    'Dividend Yield %', 'PE Ratio', 'Market Cap ($M)', 'None'
]
df = pd.DataFrame(columns=columns)
page_limit = 5
page = 0

while True:
    # Wait for the page to load and don't overload the server
    time.sleep(6)
    # Parse the HTML
    df = df.append(pd.read_html(driver.page_source, attrs={'class': 'data-table'})[0], ignore_index=True)

    # Stop after page limit is reached.
    page = page + 1
    if page >= page_limit:
        break

    try:
        # Move to the next page
        element = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CLASS_NAME, 'btn-next')))
        element.click()
    except TimeoutException as ex:
        # No more pages so break
        break

driver.quit()

# Write to csv
df.to_csv("Financial_Services.csv", encoding='utf-8', index=False)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62458558

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档