Url = https://letterboxd.com/film/dogville/
我想知道BeautifulSoup的电影名称和发行年。
import requests
from bs4 import BeautifulSoup
url = 'https://letterboxd.com/film/dogville/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
soup.find_all("script")[10]
输出:
<script>
var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };
</script>
我设法得到了块,但我不知道如何得到name
和releaseYear
。我怎么弄到它们?
发布于 2021-09-22 13:48:13
问题是bs4不是javascript解析器。您到达了它的边界,而不是需要smt else (一个javascript解析器)。一些较弱的解决方案可能会使用标准库中的json
模块将字符串字典转换为python字典。
一旦得到包含js-代码的字符串,或者将其正则化以提取字典,就像string或其他方式一样。
在这里,相反的方向
...
script_text = str(soup.find(script, string=True).string) # or what ever
# here the template
script_text = ' var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };'
script_text = script_text.strip()[:-1]
# substring starting after the 1st {
script_text = script_text[script_text.find('{')+1:]
script_text=script_text.replace(':', '=')
# find index of the closing }
i_close = len(script_text) - script_text[::-1].find('}')
#
script_text_d = 'dict(' + script_text[:i_close-1] + ')'
# evaluate the string
script_text_d = eval(script_text_d)
print(script_text_d)
print(script_text_d['name'])
输出
{'id': 51565, 'name': 'Dogville', 'gwiId': 39220, 'releaseYear': '2003', 'posterURL': '/film/dogville/image-150/', 'path': '/film/dogville/'}
Dogville
备注:
json.loads
,我想您需要将它放在{}
格式中,但是您需要双引号,所有类似键的字符串h 212/code>f 213
https://stackoverflow.com/questions/69284422
复制相似问题