java使用xpath来进行网页爬虫

java后端指南

发布于 2022-05-24 15:46:59

6990

发布于 2022-05-24 15:46:59

文章被收录于专栏：java后端java后端

今日主题:java使用xpath来进行网页爬虫

我一直在寻找一种爬取网页比较方便的方式，今天我找到了，我发现用xpath来解析网页是非常不错的。

依赖

 <!--xsoup-->
        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>xsoup</artifactId>
            <version>0.3.2</version>
        </dependency>

xsoup其实是整合了jsoup的，所以只需要引用这个依赖就行了。

参考：http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/xsoup.html

测试代码

我们在爬取网页内容时，可以用对某段代码就行右键，复制xpath路径。

右键这段代码进行xpath复制。

举例：我们要爬取某篇文章的内容：https://www.cls.cn/detail/973228。

  //财联社单篇文章地址
        Document document = Jsoup.parse(HttpUtil.get("https://www.cls.cn/detail/973228"));
        //标题
        System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[1]").evaluate(document).getElements().get(0).text());
        System.out.println("--------------------------------------------");
        //内容
        System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[2]").evaluate(document).getElements().get(0).text());
        //System.out.println("--------------------------------------------");
        //System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]").evaluate(document).get());
        System.out.println("--------------------------------------------");
        //这里直接写div表示所有的div，图片
        Elements elements = Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]/div").evaluate(document).getElements();
        for (Element element:elements){
            System.out.println(element.select("img").attr("src"));
        }
        System.out.println("--------------------------------------------");
        List<String> list = Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]/div").evaluate(document).list();
        System.out.println(list);

爬取财联社电报：

Document document = Jsoup.parse(HttpUtil.get("https://www.cls.cn/telegraph"));
        //System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div").evaluate(document).getElements());
        Elements elements = Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div").evaluate(document).getElements();
        for (Element element:elements){
            Document document1 = Jsoup.parse(element.toString());
            Elements elements1=Xsoup.compile("//*[@class=\"b-c-e6e7ea telegraph-list\"]/div/div/span[2]").evaluate(document1).getElements();
            if (null!=elements1 && elements1.size()>0){
                System.out.println(Xsoup.compile("//*[@class=\"b-c-e6e7ea telegraph-list\"]/div/div/span[2]").evaluate(document1).getElements().get(0).select("span").text());
            }
        }

这样就可以把所有的文本内容爬取出来了。

爬取华尔街见闻xml文件：

Document document = Jsoup.parse(HttpUtil.get("https://dedicated.wallstreetcn.com/rss.xml"));
        Elements elements = Xsoup.compile("/html/body/rss/channel/item").evaluate(document).getElements();
        for (Element element:elements){
            System.out.println(element.textNodes().get(2));
        }

返回值：获取文章的地址：

https://wallstreetcn.com/articles/3655852
https://wallstreetcn.com/articles/3655850
https://wallstreetcn.com/articles/3655845
https://wallstreetcn.com/articles/3655851
https://wallstreetcn.com/articles/3655846
https://wallstreetcn.com/articles/3655844
https://wallstreetcn.com/articles/3655842
https://wallstreetcn.com/articles/3655831
https://wallstreetcn.com/articles/3655785
https://wallstreetcn.com/articles/3655820
https://wallstreetcn.com/articles/3655827
https://wallstreetcn.com/articles/3655830
https://wallstreetcn.com/articles/3655829
https://wallstreetcn.com/articles/3655824
https://wallstreetcn.com/articles/3655826
https://wallstreetcn.com/articles/3655825
https://wallstreetcn.com/articles/3655821
https://wallstreetcn.com/articles/3655817
https://wallstreetcn.com/articles/3655814
https://wallstreetcn.com/articles/3655812
https://wallstreetcn.com/articles/3655810
https://wallstreetcn.com/articles/3655802
https://wallstreetcn.com/articles/3655803
https://wallstreetcn.com/articles/3655793
https://wallstreetcn.com/articles/3655799
https://wallstreetcn.com/articles/3655798
https://wallstreetcn.com/articles/3655787
https://wallstreetcn.com/articles/3655790
https://wallstreetcn.com/articles/3655789
https://wallstreetcn.com/articles/3655782
https://wallstreetcn.com/articles/3655778
https://wallstreetcn.com/articles/3655746
https://wallstreetcn.com/articles/3655763
https://wallstreetcn.com/articles/3655774
https://wallstreetcn.com/articles/3655755
https://wallstreetcn.com/articles/3655771
https://wallstreetcn.com/articles/3655761
https://wallstreetcn.com/articles/3655734
https://wallstreetcn.com/articles/3655758
https://wallstreetcn.com/articles/3655749

Process finished with exit code 0

视频在我B站：java后端指南。

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2022-04-17，如有侵权请联系 cloudcommunity@tencent.com 删除

xslt & xpath