本文通过在es中安装ik分词插件,并基于jsoup解析某网站内容并存储到mysql和es中,实现了一个完整的关键词全文搜索并高亮展示功能
通过输入中国 鲁能 关键词,即可得到如下图的结果:
首先,借助ik分词插件改善中文搜索:
# 创建容器 docker run -d --name es-test -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.6.2 # 进入容器内 docker exec -it es-test /bin/sh # 查看进入后的es根目录 sh-4.2# pwd /usr/share/elasticsearch # 将解压后的ik从本地拷贝到容器中 docker cp D:\ProgramData\docker\es\ik es-test:/usr/share/elasticsearch/plugins/ik
完成后重启es-test服务,即可参考官网说明来验证:
# 建立iktest index curl -XPUT http://localhost:9200/iktest # 建立映射 curl -XPOST http://localhost:9200/iktest/_mapping -H 'Content-Type:application/json' -d' { "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" } } }' # 添加测试数据 curl -XPOST http://localhost:9200/iktest/_create/1 -H 'Content-Type:application/json' -d' {"content":"美国留给伊拉克的是个烂摊子吗"} ' curl -XPOST http://localhost:9200/iktest/_create/3 -H 'Content-Type:application/json' -d' {"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"} ' curl -XPOST http://localhost:9200/iktest/_create/4 -H 'Content-Type:application/json' -d' {"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"} '
完成数据条件之后,即可通过如下的调用来查看分词查询结果
# 分词查询测试 curl -XPOST http://localhost:9200/iktest/_search?pretty -H 'Content-Type:application/json' -d' { "query" : { "match" : { "content" : "中国" }}, "highlight" : { "pre_tags" : ["<tag1>", "<tag2>"], "post_tags" : ["</tag1>", "</tag2>"], "fields" : { "content" : {} } } } '
接下来是基于jsoup解析网页列表内容,并且存储到mysql数据库中。
一样的,通过docker来搭建mysql服务:
docker run --name mysql-search -p 3306:3306 -e MYSQL_ROOT_PASSWORD=admin -e MYSQL_DATABASE=ecommerce -d mysql
服务正常启动后,建立相关数据库表:
drop table if exists t_news; create table t_news ( id bigint(20) comment '主键', title varchar(128) comment '标题', detail_url varchar(128) comment '明细URL路径', publish_date timestamp comment '发布时间', create_time timestamp comment '创建时间', update_time timestamp comment '更新时间', primary key (id) );
数据存储功能基于mybatisplus框架,实现相关entity和mapper等即可。
entity类具体如下:
@TableName("t_news") @Data @Builder @NoArgsConstructor @AllArgsConstructor public class News { private Long id; private String title; private String detailUrl; private LocalDate publishDate; private LocalDateTime createTime; private LocalDateTime updateTime; }
mapper类暂无定制化方法:
public interface NewsMapper extends BaseMapper<News> { }
与上篇类似,在解析单页基础上,增加了多页解析,具体代码如下:
/** * 解析单页数据 * * @param listUrl * @return */ @Transactional(rollbackFor = Exception.class) public int listNewsParse(String listUrl) throws IOException { Document document = Jsoup.connect(listUrl).get(); // 设置baseuri,自动识别的有错误 document.setBaseUri("http://ecp.sgcc.com.cn/"); Elements elements = document.select("div.titleList li"); AtomicInteger count = new AtomicInteger(0); elements.forEach(e -> { Element href = e.selectFirst("a"); String url = ""; Matcher idMatcher = HREF_ID_PATTERN.matcher(href.attr("onclick")); if (idMatcher.matches()) { // 解析并拼接详情URL url = String.format(HREF_UFL_FORMAT, idMatcher.group(1), idMatcher.group(2)); } // 获取发布日期内容 String strDate = e.select("div.titleList_02").text(); News news = News.builder() .id(snowflake.nextId()) .title(href.text()) .publishDate(LocalDate.parse(strDate)) .detailUrl(url) .createTime(LocalDateTime.now()) .updateTime(LocalDateTime.now()) .build(); // 保存到mysql数据库 newsMapper.insert(news); count.incrementAndGet(); }); // 自动将下一页URL放入到集合中 detailUrlList.push(getNextPageUrl(document)); return count.get(); } /** * 根据文档内容,获取下一页url * * @param document * @return */ public String getNextPageUrl(Document document) { Element nextHref = document.selectFirst("b.next"); return document.baseUri() + nextHref.parent().attr("href"); }
借助单页解析,实现批量多页解析
/** * 实现批量解析列表数据 * * @return * @throws IOException */ public int batchParseList() throws IOException { detailUrlList.add(INIT_URL); int left = MAX_ITEM; while (left > 0) { //减去 已抓取的记录数 left -= this.listNewsParse(detailUrlList.pop()); try { // 等待,规避反爬虫 Thread.sleep(1 * 1000); } catch (InterruptedException e) { log.error("线程等待时出错", e); } } // 返回最终抓取的总记录数 return MAX_ITEM - left; }
基于springboot-data-elasticresearch来实现es相关功能,首先是实现document类:
@Document(indexName = "news") @Data @Builder public class DetailNews { @Id private Long id; @Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart") private String title; @Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart") private String detailText; @Field(type = FieldType.Date, format = DateFormat.date_optional_time) private LocalDate publishDate; }
接下来是通过repository类实现保存等操作,当前没有定制化方法,具体如下:
public interface DetailNewsRepository extends ElasticsearchRepository<DetailNews, Long> { }
完成基础存储服务类以后,接下来是解析详情页面并存储到es中,主要代码如下:
/** * 基于数据库中概要数据,实现详细网页内容提取并存储到es中 * * @param news * @return * @throws IOException */ public DetailNews parseDetail(News news) throws IOException { Document document = Jsoup.connect(news.getDetailUrl()).get(); // 获取详情页正文内容 String text = document.select("div.bot_list").text(); DetailNews detailNews = DetailNews.builder() .id(news.getId()) .publishDate(news.getPublishDate()) .title(news.getTitle()) .detailText(text) .build(); // 保存到es中 return detailNewsRepository.save(detailNews); } /** * 批量处理明细数据,并存储到es中 * * @return */ public int batchParseDetail() { AtomicInteger total = new AtomicInteger(); newsMapper.selectList(null).forEach(news -> { try { // 解析明细网页 parseDetail(news); total.incrementAndGet(); } catch (IOException e) { log.error("解析详情页面出错", e); } }); return total.get(); }
完成了上面基础服务之后,接下来是实现基本的关键词搜索及高亮功能。
该服务主要实现:
具体代码如下:
@Service @Slf4j public class NewsSearchService { @Autowired private ElasticsearchRestTemplate elasticsearchRestTemplate; public SearchHits<DetailNews> search(String keywords) { QueryBuilder queryBuilder = multiMatchQuery(keywords).field("title") .field("detailText") .type(Type.BEST_FIELDS); HighlightBuilder highlightBuilder = new HighlightBuilder().preTags( "<span " + "class='highlight'>") .postTags("</span>") .field("title") .field("detailText"); NativeSearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(queryBuilder) .withHighlightBuilder(highlightBuilder) .build(); return elasticsearchRestTemplate.search(searchQuery, DetailNews.class); } }
通过该controller实现对前端搜索的处理:
具体代码如下:
@Controller @RequestMapping("/search") public class NewsSearchController { @Autowired private NewsSearchService searchService; @RequestMapping("/") public ModelAndView doSearch(String keyword) { String searchWord = StrUtil.isBlank(keyword) ? "" : keyword.trim(); SearchHits<DetailNews> searchHits = searchService.search(searchWord); List<Map> items = new ArrayList<>(); searchHits.forEach(hit -> { Map<String, String> item = new HashMap<>(2); String title = hit.getHighlightField("title").stream().collect(Collectors.joining()); String detailText = hit.getHighlightField("detailText") .stream() .collect(Collectors.joining()); item.put("title", StrUtil.isBlank(title) ? hit.getContent().getTitle() : title); item.put("detailText", detailText); items.add(item); }); ModelAndView view = new ModelAndView("search/search"); view.addObject("items", items); view.addObject("total", items.size()); return view; } }
前端采用thymeleaf实现一个简单的search.html:
具体内容如下:
<!DOCTYPE html> <html lang="en" xmlns:th="http://www.thymeleaf.org"> <head> <meta charset="UTF-8"> <title>搜索</title> </head> <body> <div> <form th:action="@{/search/}"> <input type="text" name="keyword" class="text-input" placeholder="按标题或内容搜索"> <button>搜索</button> </form> </div> <span>共找到<span th:text="${total}"/>条记录</span> <div th:each="item,stat:${items}"> <span th:text="${stat.index+1}"/> <span th:utext="${item['title']}" class="title"/> <div th:utext="${item['detailText']}"></div> <br/> </div> </body> <style> .highlight { color: red; } .text-input { height: 28px; } .title { font-size: 20px; font-weight: bold; } </style> </html>
案例主要实现了两个配置类,具体如下:
@Configuration @EnableElasticsearchRepositories public class ElasticSearchConfig { @Value("${app.es.host}") private String host; @Bean public RestHighLevelClient client() { ClientConfiguration configuration = ClientConfiguration.builder().connectedTo(host).build(); return RestClients.create(configuration).rest(); } @Bean(name = {"elasticsearchRestTemplate", "elasticsearchTemplate"}) public ElasticsearchRestTemplate elasticsearchRestTemplate() { return new ElasticsearchRestTemplate(client()); } }
另外一个是实现了mybatisplus mapper扫描及id生成器的注册:
@Configuration @MapperScan("pers.techlmm.search2.mapper") public class MainConfig { @Bean public Snowflake snowflake() { // 创建ID生成器 return IdUtil.createSnowflake(1, 1); } }
主要配置相关数据库等内容,具体如下:
app: es: host: localhost:9200 web: initUrl: http://ecp.sgcc.com.cn/topic_news_list.jsp?columnName=topic23 maxItem: 200 spring: datasource: password: admin username: root url: jdbc:mysql://localhost:3306/ecommerce
首先是通过下面测试类,分步把列表内容存储到mysql中,并读取mysql条目解析详情并存储到es:
@SpringBootTest @Slf4j public class NewsParseServiceTest { @Autowired private NewsParseService newsParseService; @Test void listTest() throws IOException { log.info("{}", newsParseService.batchParseList()); } @Test void detailTest() { log.info("{}", newsParseService.batchParseDetail()); } }
同时,也可以通过下面测试来验证搜索服务情况:
@SpringBootTest @Slf4j public class NewsSearchServiceTest { @Autowired private NewsSearchService newsSearchService; @Test void searchTest() { SearchHits<DetailNews> searchHits = newsSearchService.search("鲁能"); log.info("{}", searchHits); searchHits.forEach(hit -> { log.info("content:{}", hit.getContent()); hit.getHighlightFields().forEach((key, list) -> { log.info("{},{}", key, list); }); }); } }
最后就是启动服务,通过访问http://localhost:8080/search/ 后,输入相关关键词来进行全文搜索。至此,一个完整的全文搜索功能实现完毕,相关代码已经开源到 https://gitee.com/coolpine/backends/tree/master/hiboot/src/main/java/pers/techlmm/search2 ,供参考,欢迎反馈相关问题及意见。
原创声明,本文系作者授权云+社区发表,未经许可,不得转载。
如有侵权,请联系 yunjia_community@tencent.com 删除。
我来说两句