我试图刮这个网站,但我没有得到我看到的“检查元素”。我觉得HTML内容是隐藏的,或者什么的:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://groceries.asda.com/aisle/price-match/view-all-price-match/view-all-price-match/1215686354045-1215686354052-1215686354053")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
print(soup)
--这是我在检查元素中所看到的和我想要的:
但是,当我打印汤时,我得到的是其他东西(请尝试执行这段代码,因为输出将很长时间粘贴在这里)
发布于 2022-07-27 22:34:05
网页是通过JS动态加载的。因此,在bs4的帮助下,您无法看到html内容。如果您的最终目标是刮取数据,那么您也可以使用API
实现这一点。这是健壮的,同时也是仅使用requests
模块获取数据的最简单方法。
示例:
import requests
api_url = "https://groceries.asda.com/api/bff/graphql"
payload= {"requestorigin":"gi","contract":"web/cms/get-items","variables":{"user_segments":["1259","1194","1140","1141","1182","1130","1128","1124","1126","1119","1123","1117","1112","1116","1109","1111","1102","1110","1097","1105","1100","1107","1098","1038","1087","1099","1070","1082","1067","1047","1059","1057","1055","1053","1043","1041","1042","1027","1023","1024","1020","1019","1007","1242","1241","1262","1239","1256","1245","1237","1263","1264","1233","1249","1260","1247","1238","1236","1227","1208","1220","1210","1172","1178","1222","1231","1217","1179","1225","1207","1167","1221","1219","1160","1180","1152","1213","1206","1176","1224","1165","1159","1209","1169","1144","1214","1177","1216","1196","1173","1186","1147","1183","1204","1174","1191","1201","1202","1190","1157","1198","1189","1166","1197","1150","1170","1184","1271","1278","1279","1269","1283","1284","1285","rmp_enabled_user","dp-False","wapp","store_4565","vp_M","anonymous","clothing_store_enabled","checkoutOptimization","NAV_UI","T003","T014"],"store_id":"4565","page":2,"page_size":60,"request_origin":"gi","type":"content","ship_date":1658880000000,"payload":{"cacheable":True,"hierarchy_id":"1215686354045-1215686354052-1215686354053","filter_query":[]}}}
headers={
'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'request-origin': 'gi'
}
data = requests.post(api_url,headers=headers,json=payload).json()
for item in data['data']['tempo_items']['products']['items']:
print(item['item']['name'])
输出:
Fixodent Complete Denture Adhesive Original
Surf Tropical Lily Concentrated Liquid Laundry Detergent 24 Washes
Always Maxi Profresh Night Sanitary Towels Without Wings
Pantene 3 Minute Miracle Repair&Protect Hair Conditioner
Garnier Ultimate Blends Coconut Oil Frizzy Hair Shampoo
Pedigree Schmackos Strips Adult Dog Treats Fish Mix
TRESemme Replenish & Cleanse Conditioner
Herbal Essences Hello Hydration Shampoo For Dry Hair
Blistex Relief Cream
Garnier Skin Active Micellar Cleansing Water Sensitive Skin
TRESemme Rich Moisture Conditioner
Lemsip Max Day & Night Cold & Flu Relief Capsules
Lenor In-Wash Scent Booster Spring Awakening
Sudafed Congestion Headache Relief Day & Night Capsules
Halls Mentholyptus Extra Strong Lozenges 10 pack
Panadol Advance Paracetamol Tablets x16
Always Dailies Extra Protect Large Panty Liners
Simple Kind To Skin Purifying Cleansing Lotion
Nivea Gentle Exfoliating Face Scrub
Simple Kind to Skin Refreshing Facial Wash Gel
Pantene 3 Minute Miracle Smooth&Sleek Hair Conditioner
Olbas Oil Inhalant Decongestant
Johnson's Bedtime Shampoo
Huggies DryNites Pyjama Pants Girl 8-15 Years
Garnier Belle Color 6 Natural Light Brown Permanent Hair Dye
Westlab Pure Mineral Bathing Epsom Salt
Herbal Essences Ignite My Colour Hair Conditioner For Coloured Hair
Poligrip Denture Adhesive Ultra Fixative Cream
Garnier Ultimate Blends Argan Oil & Almond Cream Dry Hair Conditioner
Halls Original Sugar Free Lozenges 10 pack
Huggies DryNites Pyjama Pants Boy 8-15 Years
Westlab Sleep Epsom & Dead Sea Salts with Lavender & Jasmine
Herbal Essences Ignite My Colour Shampoo For Coloured Hair
Westlab Mindful Epsom & Himalayan Salts with Frankincense & Bergamot
Jolen Creme Bleach
Garnier Belle Color 7.1 Natural Dark Ash Blonde Permanent Hair Dye
Herbal Essences Dazzling Shine Hair Conditioner For All Hair Type
Dettol Antibacterial Disinfectant Multi Surface Spray Lemon & Lime
Lemsip Cold & Flu Lemon Flavour Sachets
Toplife Puppy Formula Milk
Westlab Pure Mineral Bathing Dead Sea Salt
Misfits Nasher Sticks Adult Medium Dog Treats with Chicken and Beef
Dove Deeply Nourishing Body Wash
Dreamies Cat Treat Biscuits with Chicken Mega Pack
Deep Freeze Cold Spray
Tena Lady Discreet Mini Pads
Pantene Pro-V Smooth & Sleek 3in1 Shampoo
Garnier Nutrisse 4.3 Dark Golden Brown Permanent Hair Dye
Fixodent Plus Dual Power Denture Adhesive
Beechams All In One Oral Solution 8 Doses 160ML
Panadol Extra Advance 500mg/65mg Tablets x14
Duck Fresh Brush Toilet Cleaning System Holder
Oral-B Allrounder Black Manual Toothbrush x 3
Dove Indulging Cream Bath Soak
Garnier Ultimate Blends Honey Treasures Strengthening Conditioner
Sudafed Sinus Max Strength Capsules
Johnson's Baby Shampoo
Halls Soothers Cherry Lozenges
Rennie Spearmint Heartburn & Indigestion Relief Tablets
Huggies DryNites Pyjama Pants Boy 4-7 Years
硒与bs4:
由于API与HTML内容没有通信,所以我们无法通过API获得html内容。网页是动态的,bs4不能呈现JS。因此,要获得html内容,可以在bs4中使用selenium。下面的代码将从页面生成正确的html内容。
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_experimental_option("detach", True)
# chrome_options.add_argument("--headless")
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://groceries.asda.com/aisle/price-match/view-all-price-match/view-all-price-match/1215686354045-1215686354052-1215686354053'
driver.get(url)
driver.maximize_window()
time.sleep(5)
#accept cookie
driver.find_element(By.XPATH,'//*[@id="onetrust-button-group-parent"]/div/button[1]').click()
time.sleep(2)
soup=BeautifulSoup(driver.page_source,'lxml')
html=soup.select_one('div.co-product-list > ul:nth-child(1)')
print(html.prettify())
输出:
<li class="co-product__promo-icon-item">
<div class="co-product__promo-icon-image-cntr">
<button aria-label="show information on Smooth & Frizz Free" class="asda-btn asda-btn--plain co-product__promo-icon-button" data-auto-id="btnPromo" type="button">
<picture class="asda-image picture">
<source srcset="https://ui.assets-asda.com/dm/_103_frizzfree?$icon-wapp$=&$Icon-wapp$=">
<img alt="Smooth & Frizz Free" class="asda-img asda-image co-product__promo-icon-img" data-auto-id="" loading="lazy" src="https://ui.assets-asda.com/dm/_103_frizzfree?$icon-wapp$=&$Icon-wapp$=" title="Smooth & Frizz Free"/>
</source>
</picture>
</button>
</div>
</li>
</ul>
</div>
</div>
<div class="co-item__col3">
<div class="co-item__price-container">
<span class="co-item__price-per-uom">
<strong class="co-product__price">
<span class="co-product__hidden-label">
now
</span>
£1.99
</strong>
<p class="co-item__price-per-uom-msg">
<span class="co-product__price-per-uom">
(55.3p/100ml)
</span>
</p>
</span>
</div>
<div class="co-item__quantity-container">
<div class="unavailable-banner">
<span class="asda-pill asda-pill--warning unavailable-banner__product-status" data-auto-id="">
OUT OF STOCK
</span>
<button aria-disabled="false" class="asda-link asda-link--primary asda-link--standalone
asda-link--button unavailable-banner__see-alternatives" data-auto-id="linkSeeAlternatives" type="button">
See alternatives
</button>
</div>
</div>
</div>
</div>
</li>
..。等等
https://stackoverflow.com/questions/73144880
复制相似问题