我用python编写了一个脚本,使用selenium来获取位于网页右下角标题Company profile
下的business summary
(在p
标记中)。网页的动态性很强,所以我想用浏览器模拟器。我已经创建了一个css选择器,如果我直接从该网页复制html elements
并在本地尝试,它能够解析摘要。由于某些原因,当我在下面的脚本中尝试相同的选择器时,它不起作用。它会抛出timeout exception
错误。我怎么才能拿到它?
这是我的尝试:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"
def get_information(driver, url):
driver.get(url)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
driver.execute_script("arguments[0].scrollIntoView();", item)
print(item.text)
if __name__ == "__main__":
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
try:
get_information(driver,link)
finally:
driver.quit()
发布于 2018-07-08 23:26:10
最初似乎没有业务摘要块,但它是在您向下滚动页面后生成的。尝试以下解决方案:
from selenium.webdriver.common.keys import Keys
def get_information(driver, url):
driver.get(url)
driver.find_element_by_tag_name("body").send_keys(Keys.END)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
print(item.text)
发布于 2018-07-08 23:38:50
您必须向下滚动页面两次,直到元素出现:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
link = "https://in.finance.yahoo.com/quote/AAPL?p=AAPL"
def get_information(driver, url):
driver.get(url)
driver.find_element_by_tag_name("body").send_keys(Keys.END) # scroll page
time.sleep(1) # small pause between
driver.find_element_by_tag_name("body").send_keys(Keys.END) # one more time
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "[id$='-QuoteModule'] p[class^='businessSummary']")))
driver.execute_script("arguments[0].scrollIntoView();", item)
print(item.text)
if __name__ == "__main__":
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 20)
try:
get_information(driver,link)
finally:
driver.quit()
如果你只滚动一次,由于某种原因(至少对我来说),它将不能正常工作。我认为这取决于窗口的大小,小窗口比大窗口滚动更多。
发布于 2018-07-09 00:29:39
下面是一种简单得多的方法,使用请求并处理页面中已有的JSON数据。如果可能的话,我还建议总是使用request。这可能需要一些额外的工作,但最终结果会更可靠/更干净。您还可以进一步了解我的示例,并解析JSON以直接使用它(您需要清理文本以使其成为有效的JSON)。在我的示例中,我只是使用了split,它只是做起来更快,但在做更复杂的事情时,它可能会导致问题。
import requests
from lxml import html
url = 'https://in.finance.yahoo.com/quote/AAPL?p=AAPL'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
r = requests.get(url, headers=headers)
tree = html.fromstring(r.text)
data= [e.text_content() for e in tree.iter('script') if 'root.App.main = ' in e.text_content()][0]
data = data.split('longBusinessSummary":"')[1]
data = data.split('","city')[0]
print (data)
https://stackoverflow.com/questions/51233325
复制相似问题