全文搜索实战1-简单网页抓取及搜索

原创

技术路漫漫

修改于 2020-07-13 10:08:23

8210

修改于 2020-07-13 10:08:23

文章被收录于专栏：技术路漫漫

本文基于jsoup和elasticsearch，实现了从指定网页抓取内容，并存储到es中，进而通过es的搜索功能实现全文检索

基础环境搭建

es是基于docker安装，鉴于当前springboot对应的是7.6.2，为保持一致也安装该版本：

docker run -d --name es-test -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2

本示例涉及到的依赖主要有：

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>

目标网页结构分析

在开展jsoup网页抓取处理之前，先要分析目标网页结构，例如目标网页关键结构如下：

<div class="titleList">
    <ul class="font02">
        <li class="titleList_bj">
            <div class="titleList_01">
                <a href="javascript:void(0);" onclick="showNewsDetail('014002003', '84168');"
                   title="X网电动汽车服务有限公司2020年第五次服务增补招标采购项目中标结果公告">
                    [X网电动汽车服务有限公司]
                    X网电动汽车服务有限公司2020年第五次服务增补招标采购项目中标结果公告
                </a></div>
            <div class="titleList_02">2020-07-11</div>
        </li>
        <li>
            <div class="titleList_01">
                <a href="javascript:void(0);" onclick="showNewsDetail('014002003', '84167');"
                   title="X网电动汽车服务有限公司2020年第五次服务招标采购项目中标结果公告">
                    [X网电动汽车服务有限公司]
                    X网电动汽车服务有限公司2020年第五次服务招标采购项目中标结果公告
                </a></div>
            <div class="titleList_02">2020-07-11</div>
        </li>
    </ul>
</div>

分析网页结构后，确定需要提取的内容有：

onclick方法中的两个参数，因需通过该参数是拼接详情URL
需要获取超链接对象的text
需要获取titleList_02对应div的内容，代表了时间

网页抓取服务编写

主要逻辑是：

基于jsoup的select选择器，筛选特定html元素，提取具体需要的内容。
将抓取内容通过es的repository，存储到es中。
基于repository的find方法，实现特定字段内容的查询。

具体代码如下：

public class PageParseService {

    @Value("${app.web.initUrl}")
    private String INIT_URL;

    private static final Pattern HREF_ID_PATTERN = Pattern.compile(".*'(\\d+)'.*'(\\d+)'.*");

    private static final String HREF_UFL_FORMAT = "http://ecp.sgcc.com.cn/html/news/%s/%s.html";

    @Autowired
    private Snowflake snowflake;

    @Autowired
    private BulletinRepository bulletinRepo;

    /**
     * 基于初始URL地址进行列表页面内容抓取
     *
     * @return
     * @throws IOException
     */
    public int listPageParse() throws IOException {
        return this.listPageParse(INIT_URL);
    }

    /**
     * 根据列表URL进行网页内容抓取，并存储到es中
     *
     * @param listUrl 列表页面URL地址
     * @return 存储记录数
     * @throws IOException
     */
    public int listPageParse(String listUrl) throws IOException {
        Document document = Jsoup.connect(listUrl).get();
        // 选取class为titleList的所有div 下的li对象
        Elements elements = document.select("div.titleList li");
        AtomicInteger count = new AtomicInteger();
        elements.forEach(e -> {
            Bulletin bulletin = new Bulletin();
            // 设置ID
            bulletin.setId(snowflake.nextId());
            // 获取超链接
            Element href = e.selectFirst("a");
            // 设置标题
            bulletin.setTitle(href.text());
            // 获取onclick方法文字内容
            Matcher idMatcher = HREF_ID_PATTERN.matcher(href.attr("onclick"));
            if (idMatcher.matches()) {
                // 设置URL
                bulletin.setDetailUrl(
                        String.format(HREF_UFL_FORMAT, idMatcher.group(1), idMatcher.group(2)));
            }
            String strDate = e.select("div.titleList_02").text();
            // 设置日期
            bulletin.setPublishDate(LocalDate.parse(strDate));
            // 保存到ES中
            bulletinRepo.save(bulletin);
            count.getAndIncrement();
        });
        return count.get();
    }

    /**
     * 根据标题进行关键词模糊查询
     *
     * @param words
     * @return
     */
    public List<Bulletin> searchByTitle(String words) {
        return bulletinRepo.findByTitleOrderByPublishDateDesc(words);
    }

}

支撑数据类编写

首先是 repository编写：

public interface BulletinRepository extends ElasticsearchRepository<Bulletin, String> {

    /**
     * 根据标题名称模糊查询公告，并以发布时间降序返回结果
     *
     * @param title
     * @return
     */
    List<Bulletin> findByTitleOrderByPublishDateDesc(String title);

}

实体类有一点需要特别注意，基于springboot官方文档来看，如果是date类型，则务必要制定format，否则默认es会存储为long类型，从而会导致从es读取内容后转换为javabean出错。

具体代码如下：

@Data
@SuperBuilder
@AllArgsConstructor
@NoArgsConstructor
@Document(indexName = "bulletin")
public class Bulletin {

    @Id
    private Long id;
    private String title;

    /** date类型必须指定format，从测试看不指定默认按long存储 */
    @Field(type = FieldType.Date, format = DateFormat.date_optional_time)
    private LocalDate publishDate;

    private String detailUrl;
}

配置类编写

主要是es的配置类，具体内容如下：

@Configuration
@EnableElasticsearchRepositories
public class ElasticSearchConfig {

    @Value("${app.es.host}")
    private String host;

    @Bean
    public RestHighLevelClient client() {
        ClientConfiguration configuration = ClientConfiguration.builder().connectedTo(host).build();
        return RestClients.create(configuration).rest();
    }

    @Bean
    public ElasticsearchOperations elasticsearchTemplate() {
        // 注意名称需定义为 elasticsearchTemplate，否则会出错
        return new ElasticsearchRestTemplate(client());
    }
}

配置文件编写

yml配置文件具体内容如下：

# 将shiro相关自动依赖设置为关闭
shiro:
  enabled: false
  web:
    enabled: false
  annotations:
    enabled: false

app:
  es:
    host: localhost:9200
  web:
    initUrl: http://ecp.sgcc.com.cn/topic_news_list.jsp?columnName=topic23

测试类编写

分别通过两个测试方法来验证服务结果：

@SpringBootTest
@Slf4j
public class PageParseServiceTest {

    @Autowired
    PageParseService pageParseService;

    @Test
    void insertTest() throws IOException {
        int rows = pageParseService.listPageParse();
        log.info("rows:{},", rows);
    }

    @Test
    void findTest() {
        List<Bulletin> bulletins = pageParseService.searchByTitle("鲁能");
        log.info("size:{},detail:{}", bulletins.size(), bulletins);
    }
}

findTest运行结果示例如下，也即通过鲁能关键词，找到了2条记录：

size:2,detail:[Bulletin(id=1281947926733656064, title=[鲁能集团有限公司] 海阳富阳置业有限公司单一来源采购事前公示, publishDate=2020-07-08, detailUrl=http://ecp.sgcc.com.cn/html/news/014002005/84035.html), Bulletin(id=1281947927257944064, title=[鲁能集团有限公司] 海阳富阳置业有限公司单一来源采购事前公示, publishDate=2020-07-08, detailUrl=http://ecp.sgcc.com.cn/html/news/014002005/84022.html)]

es-rest-api操作

除了上述直接代码方式操作es外，es本身也支持通过curl接口调用方式进行数据操作。

# 查看当前服务运行情况
curl "localhost:9200/_cat/indices?v"

# 查bulletin库（index）下所有内容，也即查询条件为空
curl -X GET "localhost:9200/bulletin/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} }
}
'

# 根据关键词查询，并且结果按publishDate降序排列
curl -X GET "localhost:9200/bulletin/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": { "match": { "title": "鲁能" } },
  "sort": [
    { "publishDate": "desc" }
  ]
}
'

# 删除bulletin库
curl -X DELETE "localhost:9200/bulletin?pretty"

至此，一个简单的网页抓取及检索实例就是实现完毕，希望对你有所帮助，相关代码已开源道gitee，详见：https://gitee.com/coolpine/backends。后续还将增强该示例，例如抓取详情页面内容、通过ik进行中文分词、支持结果高亮等。

参考资料

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

java

Elasticsearch Service

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

java

Elasticsearch Service

登录后参与评论

0 条评论

热度