问如何使用python将抓取操作扩展到超过1页
EN

Stack Overflow用户

提问于 2018-08-02 05:21:17

回答 1查看 232关注 0票数 0

嗨，我正在看一个Python代码(粘贴在下面)。代码可以很好地抓取第一页的结果(每页25个列表)。然而，我想要扩展它的可用性，从至少10多个页面中抓取结果

例如，我希望为邮政编码- 98021生成结果，该邮政编码总共有80个清单(直到第4页)。但是，当我使用python zillow.py 980021 newest运行下面的代码时，它只显示25个清单

因为我是python的新手，所以我请求您帮助我实现这个目标。

from lxml import html
import requests
import unicodecsv as csv
import argparse

def parse(zipcode,filter=None):

if filter=="newest":
    url = "https://www.zillow.com/homes/for_sale/{0}/0_singlestory/days_sort".format(zipcode)
elif filter == "cheapest":
    url = "https://www.zillow.com/homes/for_sale/{0}/0_singlestory/pricea_sort/".format(zipcode)
else:
    url = "https://www.zillow.com/homes/for_sale/{0}_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy".format(zipcode)

for i in range(10):
    # try:
    headers= {
                'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'accept-encoding':'gzip, deflate, sdch, br',
                'accept-language':'en-GB,en;q=0.8,en-US;q=0.6,ml;q=0.4',
                'cache-control':'max-age=0',
                'upgrade-insecure-requests':'1',
                'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    print(response.status_code)
    parser = html.fromstring(response.text)
    search_results = parser.xpath("//div[@id='search-results']//article")
    properties_list = []

    for properties in search_results:
        raw_address = properties.xpath(".//span[@itemprop='address']//span[@itemprop='streetAddress']//text()")
        raw_city = properties.xpath(".//span[@itemprop='address']//span[@itemprop='addressLocality']//text()")
        raw_state= properties.xpath(".//span[@itemprop='address']//span[@itemprop='addressRegion']//text()")
        raw_postal_code= properties.xpath(".//span[@itemprop='address']//span[@itemprop='postalCode']//text()")
        raw_price = properties.xpath(".//span[@class='zsg-photo-card-price']//text()")
        raw_info = properties.xpath(".//span[@class='zsg-photo-card-info']//text()")
        raw_broker_name = properties.xpath(".//span[@class='zsg-photo-card-broker-name']//text()")
        url = properties.xpath(".//a[contains(@class,'overlay-link')]/@href")
        raw_title = properties.xpath(".//h4//text()")

        address = ' '.join(' '.join(raw_address).split()) if raw_address else None
        city = ''.join(raw_city).strip() if raw_city else None
        state = ''.join(raw_state).strip() if raw_state else None
        postal_code = ''.join(raw_postal_code).strip() if raw_postal_code else None
        price = ''.join(raw_price).strip() if raw_price else None
        info = ' '.join(' '.join(raw_info).split()).replace(u"\xb7",',')
        broker = ''.join(raw_broker_name).strip() if raw_broker_name else None
        title = ''.join(raw_title) if raw_title else None
        property_url = "https://www.zillow.com"+url[0] if url else None 
        is_forsale = properties.xpath('.//span[@class="zsg-icon-for-sale"]')
        properties = {
                        'address':address,
                        'city':city,
                        'state':state,
                        'postal_code':postal_code,
                        'price':price,
                        'facts and features':info,
                        'real estate provider':broker,
                        'url':property_url,
                        'title':title
        }
        if is_forsale:
            properties_list.append(properties)
    return properties_list
    # except:
    #   print ("Failed to process the page",url)

if __name__=="__main__":
argparser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
argparser.add_argument('zipcode',help = '')
sortorder_help = """
available sort orders are :
newest : Latest property details,
cheapest : Properties with cheapest price
"""
argparser.add_argument('sort',nargs='?',help = sortorder_help,default ='Homes For You')
args = argparser.parse_args()
zipcode = args.zipcode
sort = args.sort
print ("Fetching data for %s"%(zipcode))
scraped_data = parse(zipcode,sort)
print ("Writing data to output file")
with open("properties-%s.csv"%(zipcode),'wb')as csvfile:
    fieldnames = ['title','address','city','state','postal_code','price','facts and features','real estate provider','url']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in  scraped_data:
        writer.writerow(row)

python

python-3.x

web-scraping

lxml

回答 1

Stack Overflow用户

发布于 2018-08-02 05:29:37

您需要从当前页面抓取指向下一页的链接，然后更新用于抓取的url。

这里有一个粗略的例子来说明它是如何工作的：

def parse(zipcode, url, filter=None):
    # get results how you are
    # get url from next page button
    return results, next_page_url

full_results = []
results, next_page_url = parse(zipcode, initial_page_url, filter=filter)

full_results += results

while (len(results) >= 25 and next_page_url):
    results, next_page_url = parse(zipcode, next_page_url, filter=filter)
    full_results += results

因此，在本例中，parse将要抓取的url作为第二个位置参数，并返回结果和下一个要抓取的页面的url。

只要页面上有最大的结果(25)，并且返回下一个页面的url，就会继续抓取。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51642426

复制

相似问题

问如何使用python将抓取操作扩展到超过1页
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用python将抓取操作扩展到超过1页EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用python将抓取操作扩展到超过1页
EN