前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Java爬取先知论坛文章

Java爬取先知论坛文章

作者头像
全栈程序员站长
发布2022-07-13 18:42:57
6690
发布2022-07-13 18:42:57
举报
文章被收录于专栏:全栈程序员必看

大家好,又见面了,我是全栈君,祝每个程序员都可以多学几门语言。

Java爬取先知论坛文章

0x00 前言

上篇文章写了部分爬虫代码,这里给出一个完整的爬取先知论坛文章代码,用于技术交流。

0x01 代码实现

pom.xml加入依赖:

代码语言:javascript
复制
<dependencies>

        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.7</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/junit/junit -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>





    </dependencies>
实现代码

实现类:

代码语言:javascript
复制
package xianzhi;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;

public class Climbimpl implements Runnable {
    private String url ;
    private int pages;
    private String filename;



    Lock lock = new ReentrantLock();

    public Climbimpl(String url, int pages,String filename) {
        this.url = url;
        this.pages = pages;
        this.filename = filename;
    }

    public void run() {
        File file = new File(this.filename);

        boolean mkdir = file.mkdir();

        if (mkdir){
            System.out.println("目录已创建");
        }

        lock.lock();

//        String url = "https://xz.aliyun.com/";

        for (int i = 1; i < this.pages; i++) {
            try {

            String requesturl = this.url+"?page="+i;
            Document doc = null;
            doc = Jsoup.parse(new URL(requesturl), 10000);
            Elements element = doc.getElementsByClass("topic-title");
            List<String> href = element.eachAttr("href");
                for (String s : href) {
                    try{
                        Document requests = Jsoup.parse(new URL(this.url+s), 100000);
//                        String topic_content = requests.getElementById("topic_content").text();
                        String titile = requests.getElementsByClass("content-title").first().text();
                        System.out.println("已爬取"+titile+"->"+this.filename+titile+".html");


                        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(this.filename+titile+".html"));
                        bufferedOutputStream.write(requests.toString().getBytes());
                        bufferedOutputStream.flush();
                        bufferedOutputStream.close();


                    }catch (Exception e){
                        System.out.println("爬取"+this.url+s+"报错"+"报错信息"+e);
                    }
                }


            } catch (IOException e) {
                e.printStackTrace();
            }


        }
        lock.unlock();

    }
}

main类:

代码语言:javascript
复制
package xianzhi;

public class TestClimb {
    public static void main(String[] args) {
        int Threadlist_num = 10; //线程数
        String url = "https://xz.aliyun.com/";  //设置url
        int pages = 10; //读取页数
        String path = "D:\\paramss\\";  //设置保存路径

        Climbimpl climbimpl = new Climbimpl(url,pages,path);
        for (int i = 0; i < Threadlist_num; i++) {
            new Thread(climbimpl).start();

        }
    }
}

0x03 结尾

该爬虫总体的代码都比较简单。

发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/119947.html原文链接:https://javaforall.cn

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2021年12月,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Java爬取先知论坛文章
    • 0x00 前言
      • 0x01 代码实现
        • 0x03 结尾
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档