我尝试将一个特定类(其中有多个实例)中的b标记的文本提取到一个数组中。我正在和BeautifulSoup 4
和Python 3
一起做这件事。
我正在尝试网页抓取this页面。这就是我的代码目前的样子。
def cattest():
subcat = soup.find_all('span', {"class": "zg_hrsr_ladder"})[x].findChildren()
for i, child in enumerate(subcat):
categories = child.text
print(categories)
for x in range(0, len(cat)):
cattest()
这将产生以下输出:
Beauty & Personal Care
Hair Care
Hair Care Products
Conditioners
Conditioners
Beauty & Personal Care
Personal Care
Personal Care
我想做的是从zg_hrsr_ladder
元素的b标记中获取文本,并将它们放入一个数组中。那么预期的结果将是:
[Conditioners, Personal Care]
任何关于我如何实现这个目标的帮助都将是非常有帮助的。
发布于 2018-08-02 08:41:42
您可以使用列表理解并将'b'
添加到findChildren
的参数中
In [59]: [element.text for s in soup.find_all('span', {"class": "zg_hrsr_ladder"}) for element in s.findChildren('b')]
Out[59]: ['Conditioners', 'Personal Care']
这相当于
In [63]: res = []
In [64]: for s in soup.find_all('span', {"class": "zg_hrsr_ladder"}):
...: for element in s.findChildren('b'):
...: res.append(element.text)
...:
In [65]: res
Out[65]: ['Conditioners', 'Personal Care']
发布于 2018-08-02 11:42:02
有很多方法可以做到这一点。这里是其中的两个。从这两项中选择一项:
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.ca/Abba-Moisture-Conditioner-Unisex-33-8-Ounce/dp/B000VZS3VW/ref=sr_1_1/145-7226897-1893421?ie=UTF8&qid=1532712550&sr=8-1&keywords=B000VZS3VW"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
#using .find_next()
subcat = [item.find_next("b").text for item in soup.find_all('span', class_='zg_hrsr_ladder')]
print(subcat)
#using selector
subcat = [item.text for item in soup.select('span.zg_hrsr_ladder > b')]
print(subcat)
它们都会产生相同的结果:
['Conditioners', 'Personal Care']
https://stackoverflow.com/questions/51643947
复制相似问题