前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >ElasticSearch 6.x 学习笔记:16.全文检索

ElasticSearch 6.x 学习笔记:16.全文检索

作者头像
程裕强
发布2022-05-06 19:14:04
3230
发布2022-05-06 19:14:04
举报

ElasticSearch 6.x 全文检索相关内容官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/6.1/full-text-queries.html

The high-level full text queries are usually used for running full text queries on full text fields like the body of an email. They understand how the field being queried is analyzed and will apply each field’s analyzer (or search_analyzer) to the query string before executing. 高级别全文检索通常用于在全文本字段(如电子邮件正文)上运行全文检索。 他们了解如何分析被查询的字段,并在执行之前将每个字段的分析器(或search_analyzer)应用于查询字符串。

16.1 match 查询

(1)引例

代码语言:javascript
复制
GET website/_search
{
  "query": {
    "term": {
        "title": "centos升级"
    }
  }
}
代码语言:javascript
复制
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

(2)and操作符

代码语言:javascript
复制
GET website/_search
{
  "query": {
    "match": {
        "title":{
          "query":"centos升级",
          "operator":"and"
        }
    }
  }
}
代码语言:javascript
复制
{
  "took": 36,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "url": "http://url/53868915"
        }
      }
    ]
  }
}

(3)or操作符

代码语言:javascript
复制
GET website/_search
{
  "query": {
    "match": {
        "title":{
          "query":"centos升级",
          "operator":"or"
        }
    }
  }
}
代码语言:javascript
复制
{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.9227539,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "url": "http://url/53946911"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "url": "http://url/53868915"
        }
      }
    ]
  }
}

16.2 match_phrase查询(短语查询)

Like the match query but used for matching exact phrases or word proximity matches. 与match query类似,但用于匹配精确短语,可称为短语查询

match_phrase查询会将查询内容分词,分词器可以自定义,文档中同时满足以下两个条件才会被检索到:

  1. 分词后所有词项都要出现在该字段中
  2. 字段中的词项顺序要一致

(1)创建索引,插入数据

代码语言:javascript
复制
PUT test

PUT test/hello/1
{ "content":"World Hello"}

PUT test/hello/2
{ "content":"Hello World"}

PUT test/hello/3
{ "content":"I just said hello world"}

(2)使用match_phrase查询”hello world”

代码语言:javascript
复制
GET test/_search
{
  "query": {
    "match_phrase": {
      "content": "hello world"
    }
  }
}

上面后两个文档匹配,被检索出来;第1个文档的词序与被查询内容不一致,所以不匹配。

代码语言:javascript
复制
{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test",
        "_type": "hello",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        }
      },
      {
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"
        }
      }
    ]
  }
}

16.3 match_phrase_prefix 查询(前缀查询)

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-match-query-phrase-prefix.html The match_phrase_prefix is the same as match_phrase, except that it allows for prefix matches on the last term in the text. match_phrase_prefix与match_phrase相同,只是它允许在文本中的最后一个词的前缀匹配。也就是说,对match_phrase进行了扩展,查询内容的最后一个分词与只要满足前缀匹配即可。

代码语言:javascript
复制
GET test/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": "hello wor"
    }
  }
}
代码语言:javascript
复制
{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test",
        "_type": "hello",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        }
      },
      {
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"
        }
      }
    ]
  }
}

16.4 multi_match 查询

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-multi-match-query.html

The multi_match query builds on the match query to allow multi-field queries. multi_match查询是match查询的升级版,用于多字段检索。 【例子】查询“centos”,查询字段为title和abstract

代码语言:javascript
复制
GET website/_search
{
  "query": {
    "multi_match": {
      "query": "centos",
      "fields": ["title","abstract"]
    }
  }
}

查询结果如下

代码语言:javascript
复制
{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.9227539,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS更换国内yum源",
          "url": "http://url/53946911"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "2",
        "_score": 0.41360322,
        "_source": {
          "title": "watchman源码编译",
          "author": "程裕强",
          "postdate": "2016-12-23",
          "abstract": "CentOS7.x的watchman源码编译",
          "url": "http://url.cn/53844169"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "abstract": "CentOS升级gcc",
          "url": "http://url/53868915"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "7",
        "_score": 0.20055373,
        "_source": {
          "title": "搭建Ember开发环境",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS系统下搭建Ember开发环境",
          "url": "http://url/53947507"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "1",
        "_score": 0.1671281,
        "_source": {
          "title": "Ambari源码编译",
          "author": "程裕强",
          "postdate": "2016-12-21",
          "abstract": "CentOS7.x下的Ambari2.4源码编译",
          "url": "http://url.cn/53788351"
        }
      }
    ]
  }
}

可见,文档中title和abstract字段有一个匹配就被检索出来。

16.5 common_terms 查询(常用词查询)

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-common-terms-query.html (1)停用词 有些词在文本中出现的频率非常高,但是对文本所携带的信息基本不产生影响。比如英文中的a、an、the、of,中文的“的”、”了”、”着”、”是” 、标点符号等。文本经过分词之后,停用词通常被过滤掉,不会被进行索引。在检索的时候,用户的查询中如果含有停用词,检索系统也会将其过滤掉(因为用户输入的查询字符串也要进行分词处理)。排除停用词可以加快建立索引的速度,减小索引库文件的大小。

(2)虽然停用词对文档评分影响不大,但是有时停用词仍然具有重要意义,去除停用词显然不合适。如果去除停用词,就无法区分“happy”和”not happy”, “to be or not to be”就不能被索引,搜索的准确率就会降低。

(3)common_terms查询提供了一种解决方案,把查询分词后的词项分为重要词项(比如low frequency terms ,低频词)和不重要词(high frequency terms which would previously have been stopwords,高频的停用词)。在搜索时,首先搜索与重要词匹配的文档,然后执行第二次搜索,搜索评分较小的高频词。 Terms are allocated to the high or low frequency groups based on the cutoff_frequency, which can be specified as an absolute frequency (>=1) or as a relative frequency (0.0 .. 1.0).词项是高频词还是低频词,可以通过cutoff_frequency来设置阀值,取值可以是绝对频率 (>=1)或者相对频率(0.0 ~1.0)

代码语言:javascript
复制
GET website/_search
{
    "query": {
        "common": {
            "title": {
                "query": "to be",
                "cutoff_frequency": 0.0001,
                "low_freq_operator": "and"
            }
        }
    }
}
代码语言:javascript
复制
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 2.364739,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "9",
        "_score": 2.364739,
        "_source": {
          "title": "to be or not to be",
          "author": "somebody",
          "postdate": "2018-01-03",
          "abstract": "to be or not to be,that is the question",
          "url": "http://url/63991802"
        }
      }
    ]
  }
}

16.5 query_string查询

query_string查询与Lucence查询语句紧密结合,允许在一个查询语句中使用多个特殊条件关键字,建议熟悉Lucence查询语法用户使用。

16.6 simple_query_string

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-simple-query-string-query.html 解析出错时不抛异常,丢弃查询无效的部分

代码语言:javascript
复制
GET website/_search
{
  "query": {
    "simple_query_string" : {
        "query": "\"fried eggs\" +(eggplant | potato) -frittata",
        "fields": ["title^5", "abstract"],
        "default_operator": "and"
    }
  }
}
代码语言:javascript
复制
{
  "took": 20,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2018-01-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 16.1 match 查询
  • 16.2 match_phrase查询(短语查询)
  • 16.3 match_phrase_prefix 查询(前缀查询)
  • 16.4 multi_match 查询
  • 16.5 common_terms 查询(常用词查询)
  • 16.5 query_string查询
  • 16.6 simple_query_string
相关产品与服务
Elasticsearch Service
腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档