当我打开要从中抓取信息的url时,HTML代码显示了所有内容。但是当我在网页上抓取它的HTML代码时,它只显示了其中的一部分,而且它甚至不匹配。现在,当网站在我的浏览器上打开时,它确实有一个加载屏幕,但我不确定这就是问题所在。也许他们阻止人们刮掉它?HTML我返回:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title></title>
<base href="/app"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="favicon.ico" rel="icon" type="image/x-icon"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="styles.css" rel="stylesheet"/></head>
<body class="cl">
<app-root>
<div class="loader-wrapper">
<div class="loader"></div>
</div>
</app-root>
<script src="runtime.js" type="text/javascript"></script><script src="polyfills.js" type="text/javascript"></script><script src="scripts.js" type="text/javascript"></script><script src="main.js" type="text/javascript"></script></body>
<script src="https://www.google.com/recaptcha/api.js"></script>
<noscript>
<meta content="0; URL=assets/javascript-warning.html" http-equiv="refresh"/>
</noscript>
</html>
我使用的代码:
from twill.commands import *
import time
import requests
from bs4 import BeautifulSoup
go('url')
time.sleep(4)
showforms()
try:
fv("1", "username", "username")
fv("1", "password", "*********")
submit('0')
except:
pass
time.sleep(2.5)
url = "url_after_login"
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
#name_box = soup.find('h1', attrs={'class': 'trend-and-value'})
发布于 2020-10-25 16:57:50
似乎,网页内容是由javascript动态生成的。你可以结合selenium和美汤来解析这样的网页。selenium的优点是,它可以在浏览器中重现用户行为-单击按钮或链接,在输入字段中输入文本等。
下面是一个简短的示例:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
# define 30 seconds delay
DELAY = 30
# define URI
url = '<<WEBSITE_URL>>'
# define options for selenium driver
chrome_options = webdriver.ChromeOptions()
# this one make browser "invisible"
# comment it out to see all actions performed be selenium
chrome_options.add_argument('--headless')
# create selenium web driver
driver = webdriver.Chrome("<PATH_TO_CHROME_DRIVER>", options=chrome_options)
# open web page
driver.get(url)
# wait for h1 element to load for 30 seconds
h1_element = WebDriverWait(driver, DELAY).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1.trend-and-value')))
# parse web page content using bs4
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
可替换的解决方案可以是分析呈现的javascript网页。通常,这样的网页以JSON格式从后端端点检索数据,也可以由您的抓取器调用。
https://stackoverflow.com/questions/64522671
复制相似问题