ElasticSearch 6.x 学习笔记：16.全文检索

程裕强

发布于 2022-05-06 19:14:04

3230

发布于 2022-05-06 19:14:04

ElasticSearch 6.x 全文检索相关内容官方文档： https://www.elastic.co/guide/en/elasticsearch/reference/6.1/full-text-queries.html

The high-level full text queries are usually used for running full text queries on full text fields like the body of an email. They understand how the field being queried is analyzed and will apply each field’s analyzer (or search_analyzer) to the query string before executing. 高级别全文检索通常用于在全文本字段（如电子邮件正文）上运行全文检索。他们了解如何分析被查询的字段，并在执行之前将每个字段的分析器（或search_analyzer）应用于查询字符串。

16.1 match 查询

（1）引例

GET website/_search
{
  "query": {
    "term": {
        "title": "centos升级"
    }
  }
}

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

（2）and操作符

GET website/_search
{
  "query": {
    "match": {
        "title":{
          "query":"centos升级",
          "operator":"and"
        }
    }
  }
}

{
  "took": 36,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "url": "http://url/53868915"
        }
      }
    ]
  }
}

（3）or操作符

GET website/_search
{
  "query": {
    "match": {
        "title":{
          "query":"centos升级",
          "operator":"or"
        }
    }
  }
}

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.9227539,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "url": "http://url/53946911"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "url": "http://url/53868915"
        }
      }
    ]
  }
}

16.2 match_phrase查询（短语查询）

Like the match query but used for matching exact phrases or word proximity matches. 与match query类似，但用于匹配精确短语，可称为短语查询。

match_phrase查询会将查询内容分词，分词器可以自定义，文档中同时满足以下两个条件才会被检索到：

分词后所有词项都要出现在该字段中
字段中的词项顺序要一致

（1）创建索引，插入数据

PUT test

PUT test/hello/1
{ "content":"World Hello"}

PUT test/hello/2
{ "content":"Hello World"}

PUT test/hello/3
{ "content":"I just said hello world"}

（2）使用match_phrase查询”hello world”

GET test/_search
{
  "query": {
    "match_phrase": {
      "content": "hello world"
    }
  }
}

上面后两个文档匹配，被检索出来；第1个文档的词序与被查询内容不一致，所以不匹配。

{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test",
        "_type": "hello",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        }
      },
      {
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"
        }
      }
    ]
  }
}

16.3 match_phrase_prefix 查询（前缀查询）

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-match-query-phrase-prefix.html The match_phrase_prefix is the same as match_phrase, except that it allows for prefix matches on the last term in the text. match_phrase_prefix与match_phrase相同，只是它允许在文本中的最后一个词的前缀匹配。也就是说，对match_phrase进行了扩展，查询内容的最后一个分词与只要满足前缀匹配即可。

GET test/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": "hello wor"
    }
  }
}

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test",
        "_type": "hello",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "content": "Hello World"
        }
      },
      {
        "_index": "test",
        "_type": "hello",
        "_id": "3",
        "_score": 0.5753642,
        "_source": {
          "content": "I just said hello world"
        }
      }
    ]
  }
}

16.4 multi_match 查询

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-multi-match-query.html

The multi_match query builds on the match query to allow multi-field queries. multi_match查询是match查询的升级版，用于多字段检索。【例子】查询“centos”，查询字段为title和abstract

GET website/_search
{
  "query": {
    "multi_match": {
      "query": "centos",
      "fields": ["title","abstract"]
    }
  }
}

查询结果如下

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.9227539,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "6",
        "_score": 0.9227539,
        "_source": {
          "title": "CentOS更换国内yum源",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS更换国内yum源",
          "url": "http://url/53946911"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "2",
        "_score": 0.41360322,
        "_source": {
          "title": "watchman源码编译",
          "author": "程裕强",
          "postdate": "2016-12-23",
          "abstract": "CentOS7.x的watchman源码编译",
          "url": "http://url.cn/53844169"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "CentOS升级gcc",
          "author": "程裕强",
          "postdate": "2016-12-25",
          "abstract": "CentOS升级gcc",
          "url": "http://url/53868915"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "7",
        "_score": 0.20055373,
        "_source": {
          "title": "搭建Ember开发环境",
          "author": "程裕强",
          "postdate": "2016-12-30",
          "abstract": "CentOS系统下搭建Ember开发环境",
          "url": "http://url/53947507"
        }
      },
      {
        "_index": "website",
        "_type": "blog",
        "_id": "1",
        "_score": 0.1671281,
        "_source": {
          "title": "Ambari源码编译",
          "author": "程裕强",
          "postdate": "2016-12-21",
          "abstract": "CentOS7.x下的Ambari2.4源码编译",
          "url": "http://url.cn/53788351"
        }
      }
    ]
  }
}

可见，文档中title和abstract字段有一个匹配就被检索出来。

16.5 common_terms 查询（常用词查询）

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-common-terms-query.html （1）停用词有些词在文本中出现的频率非常高，但是对文本所携带的信息基本不产生影响。比如英文中的a、an、the、of，中文的“的”、”了”、”着”、”是” 、标点符号等。文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。排除停用词可以加快建立索引的速度，减小索引库文件的大小。

（2）虽然停用词对文档评分影响不大，但是有时停用词仍然具有重要意义，去除停用词显然不合适。如果去除停用词，就无法区分“happy”和”not happy”, “to be or not to be”就不能被索引，搜索的准确率就会降低。

（3）common_terms查询提供了一种解决方案，把查询分词后的词项分为重要词项(比如low frequency terms ,低频词)和不重要词(high frequency terms which would previously have been stopwords,高频的停用词)。在搜索时，首先搜索与重要词匹配的文档，然后执行第二次搜索，搜索评分较小的高频词。 Terms are allocated to the high or low frequency groups based on the cutoff_frequency, which can be specified as an absolute frequency (>=1) or as a relative frequency (0.0 .. 1.0).词项是高频词还是低频词，可以通过cutoff_frequency来设置阀值，取值可以是绝对频率 (>=1)或者相对频率(0.0 ~1.0)

GET website/_search
{
    "query": {
        "common": {
            "title": {
                "query": "to be",
                "cutoff_frequency": 0.0001,
                "low_freq_operator": "and"
            }
        }
    }
}

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 2.364739,
    "hits": [
      {
        "_index": "website",
        "_type": "blog",
        "_id": "9",
        "_score": 2.364739,
        "_source": {
          "title": "to be or not to be",
          "author": "somebody",
          "postdate": "2018-01-03",
          "abstract": "to be or not to be,that is the question",
          "url": "http://url/63991802"
        }
      }
    ]
  }
}

16.5 query_string查询

query_string查询与Lucence查询语句紧密结合，允许在一个查询语句中使用多个特殊条件关键字，建议熟悉Lucence查询语法用户使用。

16.6 simple_query_string

https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-simple-query-string-query.html 解析出错时不抛异常，丢弃查询无效的部分

GET website/_search
{
  "query": {
    "simple_query_string" : {
        "query": "\"fried eggs\" +(eggplant | potato) -frittata",
        "fields": ["title^5", "abstract"],
        "default_operator": "and"
    }
  }
}

{
  "took": 20,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2018-01-17，如有侵权请联系 cloudcommunity@tencent.com 删除

https