我正在努力刮的网站:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101“注意: 101是网页编号和这个网站有783页。
我编写这段代码是为了获得页面上提到的产品的所有URL,使用的是“漂亮汤”:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))每页有40个产品,这应该给我16000个产品的URL,但我得到了7600(大约)
在检查之后,我可以看到标签的类在页面上发生了变化。例如:-



如何在所有页面上获得所有产品的href。
发布于 2022-09-16 10:31:46
您可以使用find_all方法和指定的attrs获取所有a标记,还可以使用split和startswith方法进一步过滤它,以获得精确的产品链接URL
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]输出:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]https://stackoverflow.com/questions/73742582
复制相似问题