Java爬虫入门

南风

发布于 2018-07-02 15:10:04

1.9K0

发布于 2018-07-02 15:10:04

文章被收录于专栏：Java大联盟

这次为大家分享不一样的Java，使用Java完成简单的爬虫，爬取某网站文章中的插图，当然你也可以爬感兴趣的其他资源。

爬虫，就是对html的完整解析中完成对目标元素的精确定位，从而得以利用IO流，将资源保存在本地，完成信息收集。

爬虫以Python为主流，因其支持库丰富成熟，通俗易懂的代码风格，成为了很多人的不二之选。

但Java同样不逊色，它也有自己独特的对html解析的lib库，今天，我们就使用Jsoup，和HttpClient做一个简单的图片爬虫。

环境准备：

1.自己喜欢的IDE（本文使用的是IDEA）。

2.Maven包管理器。

3.能上网的电脑，和一个已经准备好了跃跃欲试的你。

开始搞事：

1.在pom.xml中添加需要的jar：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>Spider</groupId>
    <artifactId>SpiderByJAVA</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.8.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>


    </dependencies>
</project>

2.新建WeChat类，首先定义目标站点。

String url = "http://www.wubupua.com/html/7203.html";

3.Java向website发起请求时，使用HttpClient类去提交封装好的HttpGet对象，这就完成了一次request，从而得到了一个response。

CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36");
CloseableHttpResponse response = httpClient.execute(httpGet);

如上所示，封装HttpGet请求时，需要注意两个方面：

1> 如果当前网站无需登录，便可完成图片的手动保存，那么在封装HTTPGet请求时，只需要给出当前默认浏览器的“User-Agent”请求头即可得到完整的response。

2.>相反的如果网站的任何操作都需要登录后的状态才可以，那么在封装HTTPGet时，需要手动登录后将当前用户的cookie值set进Header中，方可获得完整的response。

4.获得到的response，或者说是html,我们需要对其进行必要的编码以便我们能获取到尽可能多的正确信息。

HttpEntity entity = response.getEntity();
String html = EntityUtils.toString(entity, "UTF-8");

首先通过当前response的getEntity()方法获得对应的HTTPEntity对象，并通过EntityUtils，对该对象进行统一编码，编码后HTTPEntity对象转化为String类型的html文档。

5.使用Jsoup的解释器对html文档进行解析。

Document document = Jsoup.parse(html);
Elements elements = document.select("img");

返回的document对象通过select方法粗定位当前图片的位置，即所在的标签。如图，在document中选出所有的img标签，得到一个关于当前html中所有的img标签的elements对象。

6.遍历elements对象，通过attr()方法获得img标签中的src属性下的图片链接。

for (Element element : elements) {
     String img_url = element.attr("src");

然而，得到的图片链接并非就全是正文插图的图片链接，所以需要我们对原有的插图链接进行分析，可以使用chrome浏览器的F12中的元素选择器手动定位正文插图，经过分析，发现所有的正文插图的链接中均存在“upload”特殊词。

7.对6中得到的所有img链接进行数据清洗，得到真正的正文链接。

for (Element element : elements) {
    String img_url = element.attr("src");
    if (img_url.indexOf("upload") > 0) {

通过String类下的indexOf()方法查找所有包含“upload”关键词的链接，得到的img_url便是真正的插图链接了。

8.使用IO流，将文件保存在本地，首先我们在分区中新建一个文件夹用来存放要保存的图片，我将其放在F:\img下。

int nameIndex = 1;
for (Element element : elements) {
    String img_url = element.attr("src");
    if (img_url.indexOf("upload") > 0) {
        URL img_Url = new URL(img_url);
        URLConnection connection = img_Url.openConnection();
        InputStream inputStream = connection.getInputStream();
        FileOutputStream outputStream = new FileOutputStream(new File("f:\\img", String.valueOf(nameIndex)) + ".jpg");
        byte[] buf = new byte[1024];
        int l;
        while ((l = inputStream.read(buf)) != -1) {
             outputStream.write(buf, 0, l);
        }
        outputStream.close();
        inputStream.close();
        System.out.println("已经将第" + nameIndex + "张图片下载到了本地");
        nameIndex++;
        Thread.sleep(10);
     }
}
System.out.println("所有图片下载完成");

如图，属性nameIndex是文件的文件名，第一次for循环时，它为1，这时也是完成第一次下载的时候，所以第一次保存时，图片名称为1.jpg，第二次为2.jpg，以此类推。

每循环一次，让下载线程睡眠一会，是因为太过频繁的二进制读取，会使得服务器警觉，从而关闭网络链接，爬虫自然也就失效了，当然本次的教程是初级教程，图片都很少，只是为了让大家感受下Java爬虫的实现过程，对比Python，有着异曲同工之妙。

完整代码：

package com.mrlee.spider;

import org.apache.http.HttpEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;

public class WeChat {
    public static void main(String[] args) throws Exception {
        String url = "http://www.wubupua.com/html/7203.html";
        CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("User-Agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36");
        CloseableHttpResponse response = httpClient.execute(httpGet);

        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity, "UTF-8");

        Document document = Jsoup.parse(html);
        Elements elements = document.select("img");

        int nameIndex = 1;
        for (Element element : elements) {
            String img_url = element.attr("src");
            if (img_url.indexOf("upload") > 0) {
                URL img_Url = new URL(img_url);
                URLConnection connection = img_Url.openConnection();
                InputStream inputStream = connection.getInputStream();
                FileOutputStream outputStream = new FileOutputStream(new File("f:\\img", String.valueOf(nameIndex)) + ".jpg");
                byte[] buf = new byte[1024];
                int l;
                while ((l = inputStream.read(buf)) != -1) {
                    outputStream.write(buf, 0, l);
                }
                outputStream.close();
                inputStream.close();
                System.out.println("已经将第" + nameIndex + "张图片下载到了本地");
                nameIndex++;
                Thread.sleep(10);
            }
        }
        System.out.println("所有图片下载完成");
    }
}

最后附上结果图：