近期工作需要,需要从成千上万封邮件中搜索一些关键字并返回对应的邮件内容,经调研我选择了Elastic Search。
Elasticsearch ,简称ES 。是一个全文搜索服务器,也可以作为NoSQL 数据库,存储任意格式的文档和数据,同时,也可以做大数据的分析,是一个跨界开源产品。ES 有如下特点:
安装过程参考:
https://www.cnblogs.com/xxoome/p/6663993.html
(可以在文章底部点击"查看原文"来阅读进入这个连接。)
Elastic Search下载地址:
https://www.elastic.co/downloads/elasticsearch
我的相关系统版本号: 版本号: 6.3.1 jdk: 1.8 操作系统: CentOS 7.0
默认Elastic Search对中文搜索不是很友好,需要安装相应的插件,安装方法:
./elasticsearch-plugin install analysis-smartcn
ES 有一些基本概念, 掌握这些基本概念对理解ES 有很大帮助。
curl -XPOST 'http://192.168.111.130:9200/index-instance/type-instance/1?pretty ' -H 'Content-Type:application/json' -d '
{
"name ": "携程业务名称",
"type": "department",
"postDate": "2018-07-15",
"message": "携程有很多业务,如酒店、机票、火车票、旅游、度假等"
}
'
curl -XPOST 'http://192.168.111.130:9200/index-instance/type-instance/2?pretty ' -H 'Content-Type:application/json' -d '
{
"name ": "携程业务名称",
"type": "department",
"postDate": "2018-07-15",
"message": "携程的酒店服务非常好。"
}
'
说明:这里增加两条记录,index-instance 表示Index , type-instance表示Type ,数字1和2是文挡的主键,主键可以是任意形式,如果未指定主键, ES 会自动生成一个唯一主键, pretty是可选的, ES 输出的时候会格式化输出结果,更加美观。
执行成功会输出:
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "2",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
curl -XGET 'http://192.168.111.130:9200/index-instance/type-instance/1?pretty'
说明: 1是主键。
如果存在结果,则返回:
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"name " : "携程业务名称",
"type" : "department",
"postDate" : "2018-07-15",
"message" : "携程有很多业务,如酒店、机票、火车票、旅游、度假等"
}
}
ES 提供了强大的搜索功能,搜索参数可以在url 后面,也可以放到body 中。使用GET 方法:
curl -G --data-urlencode 'q=message:机票' 'http://192.168.111.130:9200/index-instance/type-instance/_search?pretty'
说明:这里搜索“机票”,返回:
{
"took" : 85,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.68324494,
"hits" : [
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "1",
"_score" : 0.68324494,
"_source" : {
"name " : "携程业务名称",
"type" : "department",
"postDate" : "2018-07-15",
"message" : "携程有很多业务,如酒店、机票、火车票、旅游、度假等"
}
}
]
}
}
再次搜索“酒店”:
curl -G --data-urlencode 'q=message:酒店' 'http://192.168.111.130:9200/index-instance/type-instance/_search?pretty'
返回:
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "2",
"_score" : 0.5753642,
"_source" : {
"name " : "携程业务名称",
"type" : "department",
"postDate" : "2018-07-15",
"message" : "携程的酒店服务非常好。"
}
},
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"name " : "携程业务名称",
"type" : "department",
"postDate" : "2018-07-15",
"message" : "携程有很多业务,如酒店、机票、火车票、旅游、度假等"
}
}
]
}
}
使用POST搜索:
curl -XPOST 'http://192.168.111.130:9200/index-instance/type-instance/_search?pretty' -H 'Content-Type:application/json' -d'
{
"query" : {
"match": {"message": "酒店"}
}
}
'
返回同样的结果:
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "2",
"_score" : 0.5753642,
"_source" : {
"name " : "携程业务名称",
"type" : "department",
"postDate" : "2018-07-15",
"message" : "携程的酒店服务非常好。"
}
},
{
"_index" : "index-instance",
"_type" : "type-instance",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"name " : "携程业务名称",
"type" : "department",
"postDate" : "2018-07-15",
"message" : "携程有很多业务,如酒店、机票、火车票、旅游、度假等"
}
}
]
}
}
注意:因为关键字中含有中文,需要curl进行url编码,所以curl使用参数--data-urlencode,参数-G表示这是一个GET请求,如果不加-G,则默认使用POST请求,则导致elastic search返回一个406不支持的POST错误请求响应。hits 包含了查询结果,在本例中, 只有2 条,_index 是index-instance , _type 是type-instance , 主键是1和2, _score 是搜索引擎概念, 表示查询相关度, 分数越高,表示此文档与关键字期望的结果的匹配程度高。
curl -XGET 'http://192.168.111.130:9200/index-instance/type-instance/_search?pretty' -H 'Content-Type:application/json' -d'
{
"query": {
"term": { "type": "department"}
}
}
'
这里查找关键字值food。 如果需要使用翻页功能可以使用:
curl -XGET 'http://192.168.111.130:9200/index-instance/type-instance/_search?pretty' -H 'Content-Type:application/json' -d'
{
"from": 0, "size": 5,
"query": {
"term": { "type": "department"}
}
}
'
如果需要知道查询总数,则使用_count代替_search:
curl -XGET 'http://192.168.111.130:9200/index-instance/type-instance/_count?pretty' -H 'Content-Type:application/json' -d'
{
"query": {
"term": { "type": "department"}
}
}
'
返回结果:
{
"count" : 2,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
}
}
如果要联合条件查询, 则可以使用must 关键字:
curl -XGET 'http://192.168.111.130:9200/index-instance/type-instance/_count?pretty' -H 'Content-Type:application/json' -d'
{
"query": {
"bool":{
"must": { "match": {" message": "酒店"}},
"must": {"term": { "type": "IT" } }
}
}
}
'
由于这是一篇科普类文章,所以简单地介绍了一下ES的用法,更多高级用法可参考ES官网。