我一直在寻找一种爬取网页比较方便的方式,今天我找到了,我发现用xpath来解析网页是非常不错的。
<!--xsoup-->
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
<version>0.3.2</version>
</dependency>
xsoup
其实是整合了jsoup
的,所以只需要引用这个依赖就行了。
参考:http://webmagic.io/docs/zh/posts/ch4-basic-page-processor/xsoup.html
我们在爬取网页内容时,可以用对某段代码就行右键,复制xpath路径。
右键这段代码进行xpath复制。
举例:我们要爬取某篇文章的内容:https://www.cls.cn/detail/973228
。
//财联社单篇文章地址
Document document = Jsoup.parse(HttpUtil.get("https://www.cls.cn/detail/973228"));
//标题
System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[1]").evaluate(document).getElements().get(0).text());
System.out.println("--------------------------------------------");
//内容
System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[2]").evaluate(document).getElements().get(0).text());
//System.out.println("--------------------------------------------");
//System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]").evaluate(document).get());
System.out.println("--------------------------------------------");
//这里直接写div表示所有的div,图片
Elements elements = Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]/div").evaluate(document).getElements();
for (Element element:elements){
System.out.println(element.select("img").attr("src"));
}
System.out.println("--------------------------------------------");
List<String> list = Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div[3]/div").evaluate(document).list();
System.out.println(list);
爬取财联社电报:
Document document = Jsoup.parse(HttpUtil.get("https://www.cls.cn/telegraph"));
//System.out.println(Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div").evaluate(document).getElements());
Elements elements = Xsoup.compile("//*[@id=\"__next\"]/div/div[2]/div[2]/div[1]/div[2]/div").evaluate(document).getElements();
for (Element element:elements){
Document document1 = Jsoup.parse(element.toString());
Elements elements1=Xsoup.compile("//*[@class=\"b-c-e6e7ea telegraph-list\"]/div/div/span[2]").evaluate(document1).getElements();
if (null!=elements1 && elements1.size()>0){
System.out.println(Xsoup.compile("//*[@class=\"b-c-e6e7ea telegraph-list\"]/div/div/span[2]").evaluate(document1).getElements().get(0).select("span").text());
}
}
这样就可以把所有的文本内容爬取出来了。
爬取华尔街见闻xml文件:
Document document = Jsoup.parse(HttpUtil.get("https://dedicated.wallstreetcn.com/rss.xml"));
Elements elements = Xsoup.compile("/html/body/rss/channel/item").evaluate(document).getElements();
for (Element element:elements){
System.out.println(element.textNodes().get(2));
}
返回值:获取文章的地址:
https://wallstreetcn.com/articles/3655852
https://wallstreetcn.com/articles/3655850
https://wallstreetcn.com/articles/3655845
https://wallstreetcn.com/articles/3655851
https://wallstreetcn.com/articles/3655846
https://wallstreetcn.com/articles/3655844
https://wallstreetcn.com/articles/3655842
https://wallstreetcn.com/articles/3655831
https://wallstreetcn.com/articles/3655785
https://wallstreetcn.com/articles/3655820
https://wallstreetcn.com/articles/3655827
https://wallstreetcn.com/articles/3655830
https://wallstreetcn.com/articles/3655829
https://wallstreetcn.com/articles/3655824
https://wallstreetcn.com/articles/3655826
https://wallstreetcn.com/articles/3655825
https://wallstreetcn.com/articles/3655821
https://wallstreetcn.com/articles/3655817
https://wallstreetcn.com/articles/3655814
https://wallstreetcn.com/articles/3655812
https://wallstreetcn.com/articles/3655810
https://wallstreetcn.com/articles/3655802
https://wallstreetcn.com/articles/3655803
https://wallstreetcn.com/articles/3655793
https://wallstreetcn.com/articles/3655799
https://wallstreetcn.com/articles/3655798
https://wallstreetcn.com/articles/3655787
https://wallstreetcn.com/articles/3655790
https://wallstreetcn.com/articles/3655789
https://wallstreetcn.com/articles/3655782
https://wallstreetcn.com/articles/3655778
https://wallstreetcn.com/articles/3655746
https://wallstreetcn.com/articles/3655763
https://wallstreetcn.com/articles/3655774
https://wallstreetcn.com/articles/3655755
https://wallstreetcn.com/articles/3655771
https://wallstreetcn.com/articles/3655761
https://wallstreetcn.com/articles/3655734
https://wallstreetcn.com/articles/3655758
https://wallstreetcn.com/articles/3655749
Process finished with exit code 0
视频在我B站:java后端指南。