Python动态网页爬虫—爬取京东商城

原创

AnieaLanie

修改于 2021-12-13 09:54:39

1.4K0

修改于 2021-12-13 09:54:39

文章被收录于专栏：铁子的专栏

1. 静态网页和动态网页

静态网页是指以在服务器中形成静态html或htm文档并发送到客户端的网页服务。

动态网页则需要依靠客户端的脚本和服务端的脚本两种方式进行渲染才形成最终的显示文档。

客户端脚本：

主要是JavaScript脚本，它允许客户端响应服务端事件。

服务端脚本：

服务端的脚本语言众多，包括PHP，ASP，ASP.NET，JSP，ColdFusion和Perl等允许响应网页提交事件。

2. 动态网页爬虫工具—Selenium和PhantomJS

2.1 Selenium简介

Selenium是一个Web自动化测试工具，可以用来操作一些浏览器驱动，以及使用一些headless(无图形用户界面)的浏览器，比如PhantomJS。

安装Selenium：

pip install selenium

Selenium还需要浏览器的驱动才能运行，下载驱动，我下载Chrome驱动：

Chrome:https://sites.google.com/chromium.org/driver/ Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ Firefox:https://github.com/mozilla/geckodriver/releases Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/

注意，chromedriver的版本一定要与本机上装的Chrome浏览器版本一致。

然后放到系统变量Path中。

2.2 PhantomJS

PhantomJS是一种可使用JavaScript脚本编写的headless浏览器。

下载PhantomJS：https://phantomjs.org/download.html

下载完成后只需要将bin目录下的.exe文件放在Windows/System32目录下：

3. 开始编码前的准备

3.1 网页分析

网页地址：http://quotes.toscrape.com/js/

这是一个看起来很整齐的网页，我的目的是抓取它的前几个标语。

接下来查看它的源代码：


<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    
<script src="/static/jquery.js"></script>
<script>
    var data = [
    {
        "tags": [
            "change",
            "deep-thoughts",
            "thinking",
            "world"
        ],
        "author": {
            "name": "Albert Einstein",
            "goodreads_link": "/author/show/9810.Albert_Einstein",
            "slug": "Albert-Einstein"
        },
        "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
    },
    {
        "tags": [
            "abilities",
            "choices"
        ],
        "author": {
            "name": "J.K. Rowling",
            "goodreads_link": "/author/show/1077326.J_K_Rowling",
            "slug": "J-K-Rowling"
        },
        "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
    },
    {
        "tags": [
            "inspirational",
            "life",
            "live",
            "miracle",
            "miracles"
        ],
        "author": {
            "name": "Albert Einstein",
            "goodreads_link": "/author/show/9810.Albert_Einstein",
            "slug": "Albert-Einstein"
        },
        "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
    },
    {
        "tags": [
            "aliteracy",
            "books",
            "classic",
            "humor"
        ],
        "author": {
            "name": "Jane Austen",
            "goodreads_link": "/author/show/1265.Jane_Austen",
            "slug": "Jane-Austen"
        },
        "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
    },
    {
        "tags": [
            "be-yourself",
            "inspirational"
        ],
        "author": {
            "name": "Marilyn Monroe",
            "goodreads_link": "/author/show/82952.Marilyn_Monroe",
            "slug": "Marilyn-Monroe"
        },
        "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
    },
    {
        "tags": [
            "adulthood",
            "success",
            "value"
        ],
        "author": {
            "name": "Albert Einstein",
            "goodreads_link": "/author/show/9810.Albert_Einstein",
            "slug": "Albert-Einstein"
        },
        "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"
    },
    {
        "tags": [
            "life",
            "love"
        ],
        "author": {
            "name": "Andr\u00e9 Gide",
            "goodreads_link": "/author/show/7617.Andr_Gide",
            "slug": "Andre-Gide"
        },
        "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
    },
    {
        "tags": [
            "edison",
            "failure",
            "inspirational",
            "paraphrased"
        ],
        "author": {
            "name": "Thomas A. Edison",
            "goodreads_link": "/author/show/3091287.Thomas_A_Edison",
            "slug": "Thomas-A-Edison"
        },
        "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
    },
    {
        "tags": [
            "misattributed-eleanor-roosevelt"
        ],
        "author": {
            "name": "Eleanor Roosevelt",
            "goodreads_link": "/author/show/44566.Eleanor_Roosevelt",
            "slug": "Eleanor-Roosevelt"
        },
        "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
    },
    {
        "tags": [
            "humor",
            "obvious",
            "simile"
        ],
        "author": {
            "name": "Steve Martin",
            "goodreads_link": "/author/show/7103.Steve_Martin",
            "slug": "Steve-Martin"
        },
        "text": "\u201cA day without sunshine is like, you know, night.\u201d"
    }
];
    for (var i in data) {
        var d = data[i];
        var tags = $.map(d['tags'], function(t) {
            return "<a class='tag'>" + t + "</a>";
        }).join(" ");
        document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");
        }
</script>
<nav>
    <ul class="pager">
        
        
        <li class="next">
            <a href="/js/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
        </li>
        
    </ul>
</nav>

    </div>
    <footer class="footer">
        <div class="container">
            <p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
            </p>
            <p class="copyright">
                Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>
            </p>
        </div>
    </footer>
</body>
</html>

这个网页的标语依靠前端的JavaScript脚本渲染，标语的数据也只是存在前端html文件上。

在html代码里使用了一个javascript脚本加载标语：

for (var i in data) {
    var d = data[i];
    var tags = $.map(d['tags'], function(t) {
        return "<a class='tag'>" + t + "</a>";
    }).join(" ");
    document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");
}

下一页的代码为：

<nav>
<ul class="pager">
    <li class="next">
        <a href="/js/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>
</nav>

3.2 程序代码

# 引入所需模块
import selenium.webdriver
from bs4 import BeautifulSoup as bs
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless') #This line should be uncommented if you're using Docker
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
#调用Chrome或者PhantomJS
driver = webdriver.webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()

获取网页源代码：

driver.get('http://quotes.toscrape.com/js/')
content=driver.page_source

翻页代码：

host='http://quotes.toscrape.com'
biaoyus=[]
next='http://quotes.toscrape.com/js/'
for i in range(4):
    #使用driver获取网页
    driver.get(next)
    content=driver.page_source
    #使用soup查找元素
    eles=soup(content,'html.parser')
    biaoyus.append(eles.find_all("div",{"class":"quote"}))
    print(len(biaoyus))
    #下一页
    next=host+eles.find('li',{'class':'next'}).find('a')['href']
    print(next)

完整代码：

# 引入所需模块
from selenium import webdriver
from bs4 import BeautifulSoup as soup
#调用Chrome或者PhantomJS
driver = webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
#主机
host='http://quotes.toscrape.com'
biaoyus=[]
next='http://quotes.toscrape.com/js/'
for i in range(4):
    #使用driver获取网页
    driver.get(next)
    content=driver.page_source
    #使用soup查找元素
    eles=soup(content,'html.parser')
    biaoyus.append(eles.find_all("div",{"class":"quote"}))
    print(len(biaoyus))
    next=host+eles.find('li',{'class':'next'}).find('a')['href']
    print(next)
    #input()

for biaoyu in biaoyus:
    for quote in biaoyu:
        print(quote.find(class_='text').getText())
        print(quote.find(class_='author').getText())
        print(quote.find(class_='tags').getText())
        print('\n')

4. 爬取京东商店图书

我要爬取京东网站上以 “python” 关键字搜索的前200本图书。

网页地址：https://search.jd.com/Search?keyword=python&enc=utf-8&wq=python&pvid=3e6f853b03a64d86b17638dc2de70fdf

网站页面：

查看网页源代码：

图书的结构，图书以列表li的形式在网页上显示：

这个页面使用了滑动填充书籍的方式显示书籍。开始只展示部分书籍，只有当用户滑动浏览器时，才会将剩余的书籍展示出来，滑动代码：

<span class="clr"></span>
<div id="J_scroll_loading" class="notice-loading-more"><span>正在加载中，请稍后~~</span></div>
<div class="page clearfix"><div id="J_bottomPage" class="p-wrap"></div></div>

4.1 使用selenium定位“下一页”元素，并模拟点击

要爬取200多本书籍的信息，不能在一页内就读取完成，要使用selenium提供模拟点击功能，跳转多页爬取信息。

#使用类class定位下一页位置
next=driver.find_element_by_class_name('pn-next')
#模拟点击
next.click()

4.2 完整代码

# 引入所需模块
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import json
#调用Chrome或者PhantomJS
driver = webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
#主机
next='https://search.jd.com/Search?keyword=python'
#使用driver获取网页
driver.get(next)
booksstore=[]
#保存数据
fi=open("books.txt","a",encoding='utf-8')
for j in range(4):
    #driver控制滚轮滑动
    for i in range(2):
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        #等待页面加载完毕
        time.sleep(4)
    content=driver.page_source
    #使用soup查找元素
    eles=soup(content,'html.parser')
    books=eles.find_all('li',{'class':'gl-item'})
    print(len(books))
    for book in books:
        name=book.find('div',{'class':'p-name'}).find('a').find('em').getText()
        price=book.find('div',{'class':'p-price'}).find('i').getText()
        commit='https:'+book.find('div',{'class':'p-commit'}).find('a')['href']
        shop=book.find('div',{'class':'p-shopnum'}).find_all('a')
        print(name)
        print(price)
        print(commit)
        book={'书籍名称':name,'书籍价格':price,'购买地址':commit}
        if(len(shop)!=0):
            shopaddress=shop[0]['href']
            shopname=shop[0]['title']
            print("http:"+shopaddress)
            print(shopname)
            book['商店地址']="http:"+shopaddress
            book['商店名称']=shopname
        
        booksstore.append(book)
        #booksstore.append('\n')
        fi.write(json.dumps(book,ensure_ascii=False))
        fi.write("\n")
    #下一页
    next=driver.find_element_by_class_name('pn-next')
    print(next.text)
    next.click()
    time.sleep(4)

print(len(booksstore))
print(booksstore)
fi.write
fi.close()

爬取效果：