我使用下面的代码来尝试做网络抓取。
import sys , os
import requests, webbrowser,bs4
from PIL import Image
import pyautogui
p = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
n = open("exml.txt" , 'wb')
for i in p.iter_content(1000) :
n.write(i)
n.close()
n = open("exml.txt" , 'r')
soupy= bs4.BeautifulSoup(n,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
因此,我打算做的是提取从页面获得的xml响应中的所有图像链接。(如果我认为requests.get返回了输入网址时打开的整个网页的静态html文件,请纠正我的错误)
然而,在这一行中:
soupy= bs4.BeautifulSoup(n,"html.parser")
我收到以下错误:
Traceback (most recent call last):
File "../../perl/webscratcher.txt", line 24, in <module>
soupy= bs4.BeautifulSoup(n,"html.parser")
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 24662: character maps to <undefined>
我对这个错误一无所知,而且"Appdata“文件夹是空的。
如何进一步进行?
发布尝试建议:
我将文件的扩展名改为py,这个错误被删除了。但是,在下面的行中:
lxml bs4.BeautifulSoup(n,“soupy=”)我收到以下错误:
回溯(最近一次调用):文件"C:\perl\webscratcher.py",第23行,在响应bs4.BeautifulSoup(p,"lxml")文件"C:\Users\PREMRAJ\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4_init_.py",第192行,在初始化响应elif len() <= 256And( TypeError:类型为‘C:\perl\webscratcher.py’的对象没有len() )
如何解决这个问题?
发布于 2018-06-05 07:19:11
你把事情搞得太复杂了。将响应对象的字节内容直接传递到BeautifulSoup对象的构造函数中,而不是将其写入文件。
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
soup = BeautifulSoup(response.content, 'lxml')
for element in soup.select('img[src]'):
print(element)
发布于 2018-06-05 07:18:46
好的,你可能想回顾一下如何使用BeautifulSoup。我引用了我的一个旧项目,这就是打印它们所需的全部内容。检查BS documents以找到select方法所需的确切语法。
这将打印html中的所有img标记。
import requests, bs4
site = 'http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1'
p = requests.get(site).text
soupy = bs4.BeautifulSoup(p,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
https://stackoverflow.com/questions/50689857
复制相似问题