首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >使用javascript使用python从网页中抓取数据

使用javascript使用python从网页中抓取数据
EN

Stack Overflow用户
提问于 2019-06-25 07:24:48
回答 4查看 5.4K关注 0票数 0

我在试着把网页的标题刮掉。最初,我尝试使用BeautifulSoup,但发现页面本身没有Javascript就无法加载。所以我使用了一些我在Google上找到的代码,这些代码使用了request-html库:

代码语言:javascript
运行
复制
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render()
soup = BeautifulSoup(resp.html.html, "lxml")

soup.find_all('h1')

但总会有这样的错误:

代码语言:javascript
运行
复制
D:\Python\TitleSraping\venv\Scripts\python.exe "D:/Python/TitleSraping/venv/Text Scraping.py"
Traceback (most recent call last):
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 106, in evaluateHandle
    'userGesture': True,
pyppeteer.errors.NetworkError: Protocol error (Runtime.callFunctionOn): Cannot find context with specified id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:/Python/TitleSraping/venv/Text Scraping.py", line 5, in <module>
    resp.html.render()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 598, in render
    content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
  File "D:\Program Files (x86)\Python\lib\asyncio\base_events.py", line 584, in run_until_complete
    return future.result()
  File "D:\Python\TitleSraping\venv\lib\site-packages\requests_html.py", line 531, in _async_render
    content = await page.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\page.py", line 780, in content
    return await frame.content()
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 379, in content
    '''.strip())
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\frame_manager.py", line 295, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 55, in evaluate
    pageFunction, *args, force_expr=force_expr)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 109, in evaluateHandle
    _rewriteError(e)
  File "D:\Python\TitleSraping\venv\lib\site-packages\pyppeteer\execution_context.py", line 238, in _rewriteError
    raise type(error)(msg)
pyppeteer.errors.NetworkError: Execution context was destroyed, most likely because of a navigation.

Process finished with exit code 1

有人知道这是什么意思吗?我是个新手,所以如果我用错了术语,我向你道歉。

EN

回答 4

Stack Overflow用户

发布于 2019-06-25 07:47:58

正如伊万所说,这里有完整的代码: sleep=1,keep_page=True make the trick

代码语言:javascript
运行
复制
from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
resp = session.get("https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210001601")
resp.html.render(sleep=1, keep_page=True)
soup = BeautifulSoup(resp.html.html, "lxml")
print(soup.find_all('title'))

响应:

代码语言:javascript
运行
复制
[<title>
    Milled wheat and wheat flour produced</title>]
票数 1
EN

Stack Overflow用户

发布于 2019-06-25 07:39:21

看起来像是底层库puppeteer中的一个错误,是由处理一些javascript引起的。这里有一个来自https://github.com/kennethreitz/requests-html/issues/251的变通方法,也许它会有所帮助。

resp.html.render(sleep=1, keep_page=True)

票数 0
EN

Stack Overflow用户

发布于 2019-06-25 07:45:56

您需要加载JS,因为如果不加载它,HTML代码就不会加载。您可以使用Selenium

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/56745062

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档