静态网页是指以在服务器中形成静态html或htm文档并发送到客户端的网页服务。
动态网页则需要依靠客户端的脚本和服务端的脚本两种方式进行渲染才形成最终的显示文档。
客户端脚本:
主要是JavaScript脚本,它允许客户端响应服务端事件。
服务端脚本:
服务端的脚本语言众多,包括PHP,ASP,ASP.NET,JSP,ColdFusion和Perl等允许响应网页提交事件。
Selenium是一个Web自动化测试工具,可以用来操作一些浏览器驱动,以及使用一些headless(无图形用户界面)的浏览器,比如PhantomJS。
安装Selenium:
pip install selenium
Selenium还需要浏览器的驱动才能运行,下载驱动,我下载Chrome驱动:
Chrome:https://sites.google.com/chromium.org/driver/ Edge:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ Firefox:https://github.com/mozilla/geckodriver/releases Safari:https://webkit.org/blog/6900/webdriver-support-in-safari-10/
注意,chromedriver的版本一定要与本机上装的Chrome浏览器版本一致。
然后放到系统变量Path中。
PhantomJS是一种可使用JavaScript脚本编写的headless浏览器。
下载PhantomJS:https://phantomjs.org/download.html
下载完成后只需要将bin目录下的.exe文件放在Windows/System32目录下:
网页地址:http://quotes.toscrape.com/js/
这是一个看起来很整齐的网页,我的目的是抓取它的前几个标语。
接下来查看它的源代码:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<div class="container">
<div class="row header-box">
<div class="col-md-8">
<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>
</div>
<div class="col-md-4">
<p>
<a href="/login">Login</a>
</p>
</div>
</div>
<script src="/static/jquery.js"></script>
<script>
var data = [
{
"tags": [
"change",
"deep-thoughts",
"thinking",
"world"
],
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"
},
{
"tags": [
"abilities",
"choices"
],
"author": {
"name": "J.K. Rowling",
"goodreads_link": "/author/show/1077326.J_K_Rowling",
"slug": "J-K-Rowling"
},
"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"
},
{
"tags": [
"inspirational",
"life",
"live",
"miracle",
"miracles"
],
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"
},
{
"tags": [
"aliteracy",
"books",
"classic",
"humor"
],
"author": {
"name": "Jane Austen",
"goodreads_link": "/author/show/1265.Jane_Austen",
"slug": "Jane-Austen"
},
"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
"tags": [
"be-yourself",
"inspirational"
],
"author": {
"name": "Marilyn Monroe",
"goodreads_link": "/author/show/82952.Marilyn_Monroe",
"slug": "Marilyn-Monroe"
},
"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"
},
{
"tags": [
"adulthood",
"success",
"value"
],
"author": {
"name": "Albert Einstein",
"goodreads_link": "/author/show/9810.Albert_Einstein",
"slug": "Albert-Einstein"
},
"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"
},
{
"tags": [
"life",
"love"
],
"author": {
"name": "Andr\u00e9 Gide",
"goodreads_link": "/author/show/7617.Andr_Gide",
"slug": "Andre-Gide"
},
"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"
},
{
"tags": [
"edison",
"failure",
"inspirational",
"paraphrased"
],
"author": {
"name": "Thomas A. Edison",
"goodreads_link": "/author/show/3091287.Thomas_A_Edison",
"slug": "Thomas-A-Edison"
},
"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"
},
{
"tags": [
"misattributed-eleanor-roosevelt"
],
"author": {
"name": "Eleanor Roosevelt",
"goodreads_link": "/author/show/44566.Eleanor_Roosevelt",
"slug": "Eleanor-Roosevelt"
},
"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"
},
{
"tags": [
"humor",
"obvious",
"simile"
],
"author": {
"name": "Steve Martin",
"goodreads_link": "/author/show/7103.Steve_Martin",
"slug": "Steve-Martin"
},
"text": "\u201cA day without sunshine is like, you know, night.\u201d"
}
];
for (var i in data) {
var d = data[i];
var tags = $.map(d['tags'], function(t) {
return "<a class='tag'>" + t + "</a>";
}).join(" ");
document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");
}
</script>
<nav>
<ul class="pager">
<li class="next">
<a href="/js/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
</div>
<footer class="footer">
<div class="container">
<p class="text-muted">
Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
Made with <span class='sh-red'>❤</span> by <a href="https://scrapinghub.com">Scrapinghub</a>
</p>
</div>
</footer>
</body>
</html>
这个网页的标语依靠前端的JavaScript脚本渲染,标语的数据也只是存在前端html文件上。
在html代码里使用了一个javascript脚本加载标语:
for (var i in data) {
var d = data[i];
var tags = $.map(d['tags'], function(t) {
return "<a class='tag'>" + t + "</a>";
}).join(" ");
document.write("<div class='quote'><span class='text'>" + d['text'] + "</span><span>by <small class='author'>" + d['author']['name'] + "</small></span><div class='tags'>Tags: " + tags + "</div></div>");
}
下一页的代码为:
<nav>
<ul class="pager">
<li class="next">
<a href="/js/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
# 引入所需模块
import selenium.webdriver
from bs4 import BeautifulSoup as bs
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless') #This line should be uncommented if you're using Docker
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
#调用Chrome或者PhantomJS
driver = webdriver.webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
获取网页源代码:
driver.get('http://quotes.toscrape.com/js/')
content=driver.page_source
翻页代码:
host='http://quotes.toscrape.com'
biaoyus=[]
next='http://quotes.toscrape.com/js/'
for i in range(4):
#使用driver获取网页
driver.get(next)
content=driver.page_source
#使用soup查找元素
eles=soup(content,'html.parser')
biaoyus.append(eles.find_all("div",{"class":"quote"}))
print(len(biaoyus))
#下一页
next=host+eles.find('li',{'class':'next'}).find('a')['href']
print(next)
完整代码:
# 引入所需模块
from selenium import webdriver
from bs4 import BeautifulSoup as soup
#调用Chrome或者PhantomJS
driver = webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
#主机
host='http://quotes.toscrape.com'
biaoyus=[]
next='http://quotes.toscrape.com/js/'
for i in range(4):
#使用driver获取网页
driver.get(next)
content=driver.page_source
#使用soup查找元素
eles=soup(content,'html.parser')
biaoyus.append(eles.find_all("div",{"class":"quote"}))
print(len(biaoyus))
next=host+eles.find('li',{'class':'next'}).find('a')['href']
print(next)
#input()
for biaoyu in biaoyus:
for quote in biaoyu:
print(quote.find(class_='text').getText())
print(quote.find(class_='author').getText())
print(quote.find(class_='tags').getText())
print('\n')
我要爬取京东网站上以 “python” 关键字搜索的前200本图书。
网站页面:
查看网页源代码:
图书的结构,图书以列表li的形式在网页上显示:
这个页面使用了滑动填充书籍的方式显示书籍。开始只展示部分书籍,只有当用户滑动浏览器时,才会将剩余的书籍展示出来,滑动代码:
<span class="clr"></span>
<div id="J_scroll_loading" class="notice-loading-more"><span>正在加载中,请稍后~~</span></div>
<div class="page clearfix"><div id="J_bottomPage" class="p-wrap"></div></div>
要爬取200多本书籍的信息,不能在一页内就读取完成,要使用selenium提供模拟点击功能,跳转多页爬取信息。
#使用类class定位下一页位置
next=driver.find_element_by_class_name('pn-next')
#模拟点击
next.click()
# 引入所需模块
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
import json
#调用Chrome或者PhantomJS
driver = webdriver.Chrome()
#driver = webdriver.webdriver.PhantomJS()
#主机
next='https://search.jd.com/Search?keyword=python'
#使用driver获取网页
driver.get(next)
booksstore=[]
#保存数据
fi=open("books.txt","a",encoding='utf-8')
for j in range(4):
#driver控制滚轮滑动
for i in range(2):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
#等待页面加载完毕
time.sleep(4)
content=driver.page_source
#使用soup查找元素
eles=soup(content,'html.parser')
books=eles.find_all('li',{'class':'gl-item'})
print(len(books))
for book in books:
name=book.find('div',{'class':'p-name'}).find('a').find('em').getText()
price=book.find('div',{'class':'p-price'}).find('i').getText()
commit='https:'+book.find('div',{'class':'p-commit'}).find('a')['href']
shop=book.find('div',{'class':'p-shopnum'}).find_all('a')
print(name)
print(price)
print(commit)
book={'书籍名称':name,'书籍价格':price,'购买地址':commit}
if(len(shop)!=0):
shopaddress=shop[0]['href']
shopname=shop[0]['title']
print("http:"+shopaddress)
print(shopname)
book['商店地址']="http:"+shopaddress
book['商店名称']=shopname
booksstore.append(book)
#booksstore.append('\n')
fi.write(json.dumps(book,ensure_ascii=False))
fi.write("\n")
#下一页
next=driver.find_element_by_class_name('pn-next')
print(next.text)
next.click()
time.sleep(4)
print(len(booksstore))
print(booksstore)
fi.write
fi.close()
爬取效果:
[1] 什么是动态脚本
[2] Python爬虫,使用Python爬取动态网页-腾讯动漫(Selenium)
[3] selenium控制滚轮滑动
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。