Java数据采集-3.抓取开源中国新闻（新版）

geekfly

发布于 2022-04-24 07:08:47

45100

代码可运行

文章被收录于专栏：geekflygeekfly

运行总次数：0

代码可运行

最近看之前写的几篇网页数据采集的博客，陆陆续续的有好友发私信交流，又想重新整理一下这些了，抽空继续更新本系列博客。

针对开源中国新闻列表新版，重新写代码抓取。

网址：https://www.oschina.net/news jar包：jsoup.1.7.2.jar 项目源码：https://github.com/geekfly2016/Spider

分析新闻列表所在位置

根据上图我们可以看出，新闻列表全部都在该div下。

<div class="news-list-item" id="all-news">
<!--文章列表-->
</div>

单个新闻位于该div下。

<div class="item box"></div>

故选择新闻列表的代码即为：

Elements items = document.select("#all-news .item");
System.out.println(items.size());

注：因为class有两个，item和box，由于Jsoup选择器中需写两个select，此处使用一个即可精确匹配。可参看：http://blog.csdn.net/ywf008/article/details/53215648

分析单条新闻信息

标题位于第一个a标签下，标题地址为href参数

String title = item.select("a").first().text();
String title_href = item.select("a").first().attr("href");
if(!title_href.startsWith("https://")){
                title_href = host + title_href;
            }

注：抓取时打印链接发现部分链接已为完整的，有些许自行拼接域名，故此处加了判断是否已https://开始。

新闻描述

String desc = item.select("div[class=sc sc-text text-gradient wrap summary]").text();

对于属性有多个值得时候，除了上述提到的使用某个能确定的值或者使用多个select选择器外，也可以使用div[class=xx yy zz]这种模式匹配（推荐方式）。

用户头像

String author_image = item.select("img[class=avatar]").attr("src");

或者

String author_image = item.select("img").first().attr("src");

获取方式都不唯一

作者姓名

Element mr = item.select(".from .mr").get(0);
//作者
String author = mr.select("a").text();
// 从span[class=mr]中移除a标签，输出的即为发布时间
mr.select("a").remove();
String published = mr.text();

获取评论数

String number = item.select(".from .mr").last().text();

至此，我们已经可以完整获取当前页的新闻数据了。注：新闻列表数据中包含一条广告数据

过滤代码

//过滤广告
if(!item.attr("data-tracepid").isEmpty()){
    continue;
}

仓库：https://github.com/geekfly2016/Spider 代码目录：Spider/src/xyz/geekfly/oschina/News.java

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2017/07/19 ，如有侵权请联系 cloudcommunity@tencent.com 删除

github

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

Java数据采集-3.抓取开源中国新闻（新版）

Java数据采集-3.抓取开源中国新闻（新版）

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐