我刚接触.NET和Python,但我想做一个程序来抓取.aspx站点并处理那里的内容(HTML代码就足够了)。我尝试了一些用Python编写的库,但我得到的只是该站点的第一个页面。似乎我在构建错误的POST数据,我不知道数据的正确形式,什么应该包括,什么不应该。
http://nastenka.lesy.sk/EZOZV/Publish/ObjednavkyZverejnenie.aspx?YR=2018
import requests, urllib, urllib2
r = requests.get("http://nastenka.lesy.sk/EZOZV/Publish/ObjednavkyZverejnenie.aspx?YR=2018")
content = r.text
print content
start_index = content.find('id="__VIEWSTATE"') + 24
sliced_vs = content[start_index:content.find('"',start_index)]
start_index = content.find('id="__VIEWSTATEGENERATOR"') + 33
sliced_vsg = content[start_index:content.find('"',start_index)]
start_index = content.find('id="__VIEWSTATEENCRYPTED"') + 33
sliced_vse = content[start_index:content.find('"',start_index)]
start_index = content.find('id="__EVENTVALIDATION"') + 30
sliced_EV = content[start_index:content.find('"',start_index)]
form_data = {'__EVENTTARGET': 'gvZverejnenie',
'__EVENTARGUMENT': 'Page$2',
'__VIEWSTATE': sliced_vs,
'__VIEWSTATEGENERATOR': sliced_vsg,
'__VIEWSTATEENCRYPTED': sliced_vse,
'__EVENTVALIDATION': sliced_EV}
data_encoded = urllib.urlencode(form_data)
r = requests.post('http://nastenka.lesy.sk/EZOZV/Publish/ObjednavkyZverejnenie.aspx?YR=2018',data=data_encoded)
content = r.text
print content
例如,在代码中,我想获取第二个页面(' page $2')。我总是得到相同的结果,但ViewState和EventValidation的值不同。请问问题出在哪里?
发布于 2018-07-27 06:20:31
这段代码需要selenium
和chromedriver
来控制Google Chrome。结果是总共有476页(按照你提供的url )。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('http://nastenka.lesy.sk/EZOZV/Publish/ObjednavkyZverejnenie.aspx?YR=2018')
with open('page_1.html', 'w') as f:
f.write(driver.page_source)
page_num = 2
while True:
try:
element = driver.find_element_by_link_text(str(page_num))
except NoSuchElementException:
elements = driver.find_elements_by_link_text('...')
if len(elements) == 0:
break # less than 11 pages total
elif len(elements) == 1 and page_num > 12:
break # last page
element = elements[-1]
element.click()
with open('page_{}.html'.format(page_num), 'w') as f:
f.write(driver.page_source)
page_num += 1
driver.quit()
https://stackoverflow.com/questions/51546975
复制相似问题