问基本Python webscrape抓取脚本
EN

Stack Overflow用户

提问于 2018-04-24 00:41:48

回答 1查看 176关注 0票数 0

这是我第一次尝试Python web抓取。

我有一个IP摄像机，可以通过HTTP将其所有文件保存到HTML文档。从本质上讲，摄像机是它自己的服务器，可以通过HTTP访问。服务器中的HTML非常基础。它只包含一个body标记，其中包含该body标记中的所有剪辑。这些文件如下所示：

MP_2018-04-23_11-14-04_60.mov

我想列出/打印这些文件，而没有与之关联的所有其他HTML。

import bs4 as bs
import urlib.request
sauce = urllib.request.urlopen('http://192.168.1.99/form/getStorageFileList').read()
soup = bs.BeautifulSoup(sauce,'lxml')
body = soup.body
for paragraph in body.find_all('b'):
print(body.text)

我已经在下面包含了一些截图，因为我收到的错误非常长。我基本上得到了：

属性错误:模块'html5lib.treebuilders‘没有属性'_base’

有没有人能给我澄清一下，并可能给我指出正确的方向？

usr/lib/python3/dist-packages/bs4/builder/_html5lib.py in <module>()
     68 
     69 
---> 70 class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
     71 
     72     def __init__(self, soup, namespaceHTMLElements):

AttributeError: module 'html5lib.treebuilders' has no attribute '_base'

CameraHTML

Jupyterscript JupyterscriptOutput

python

html

web-scraping

html-parsing

jupyter-notebook

回答 1

Stack Overflow用户

发布于 2018-08-09 02:52:23

您的脚本中有一些错误。不过没什么大不了的。此外，使用Requests库可能会给您带来更多好处。像这样的怎么样？

from bs4 import BeautifulSoup as bs
import requests

sauce = requests.get('http://192.168.1.99/form/getStorageFileList')
page = sauce.text  #Converted page to text
soup = bs(page,'html.parser')  #Changed to 'html.parser'
body = soup.body('body')  #Added the 'body' tag
for paragraph in body.find_all('b'):
    print(paragraph.text)  #Grabbed the iterated items & converted them to text

如果这是你要找的东西，请告诉我。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49986038

复制

相似问题

问基本Python webscrape抓取脚本
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基本Python webscrape抓取脚本EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基本Python webscrape抓取脚本
EN