前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >如何提高Elasticsearch搜索的相关性

如何提高Elasticsearch搜索的相关性

作者头像
用户7634691
发布2021-03-18 20:49:03
9630
发布2021-03-18 20:49:03
举报
文章被收录于专栏:犀牛饲养员的技术笔记

什么是相关性

首先需要了解什么是相关性?默认情况下,搜索返回的结果是按照 相关性 进行排序的,也就是最相关的文档排在最前。相关性是由一个所谓的打分机制决定的,每个文档在搜索过程中都会被计算一个_score字段,这是一个浮点数类型,值越高表示分数越高,也就是相关性越大。

具体的评分算法不是本文的重点,但是我们可以通过一个查询示例了解下评分的过程。ES对于一次搜索请求提供了一种explain的机制,设置为true的情况下,查询结果会额外输出一些信息,我们一起来看下这些信息。

查询的demo是,

代码语言:javascript
复制
GET kibana_sample_data_logs/_search
{
  "explain": true, 
  "size": 1, 
  "query": {
    "match": {
      "message": "metricbeat"
    }
  }
}

查询结果里包含了 _explanation字段 。其中包含了descriptionvaluedetails 字段,它分别告诉你计算的类型、计算结果和计算细节。

代码语言:javascript
复制
"_explanation" : {
          "value" : 2.912974,
          "description" : "weight(message:metricbeat in 6) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 2.912974,
              "description" : "score(freq=2.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 2.1402972,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1655,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 14074,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.6186426,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 2.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 28.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 27.013002,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }

首先,前面两行,

代码语言:javascript
复制
"_explanation" : {
          "value" : 2.912974,
          "description" : "weight(message:metricbeat in 15) [PerFieldSimilarity], result of:",
          ......

告诉了我们 metricbeat 在 message 字段中的检索评分结果。15是文档的内部id,这个可以不用管。

紧接着是details字段,它是个嵌套的结构,里面可以包含多个details

代码语言:javascript
复制
"details" : [
            {
              "value" : 2.912974,
              "description" : "score(freq=2.0), computed as boost * idf * tf from:",
              ......

这部分告诉我们,2.912974这个值是有三部分相乘得到的:

代码语言:javascript
复制
boost * idf * tf

这三个值分别是2.2,2.1402972,0.6186426,相乘的结果确实是2.912974。

后面三个嵌套的details,就是对应上面三部分,告诉你上面三部分是怎么计算的,比如idf部分:

代码语言:javascript
复制
{
                  "value" : 2.1402972,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1655,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 14074,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },

这个是说idf这个值,是由

代码语言:javascript
复制
log(1 + (N - n + 0.5) / (n + 0.5))

这个公式计算出来的。其中n是1655,N是14074,另外也告诉你这两个字母分别表示啥意思。其中n表示包含metricbeat这个词的文档数量。N表示一共有多少文档(基于分片)。

提高搜索的相关性

我们通过一个示例来展开这部分的讨论。首先写入一些测试数据,

代码语言:javascript
复制
PUT demo_idx/_doc/1
{
  "content": "Distributed nature, simple REST APIs, speed, and scalability"
}
PUT demo_idx/_doc/2
{
  "content": "Distributed nature, simple APIs, speed, and scalability"
}
PUT demo_idx/_doc/3
{
  "content": "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization."
}

先来一个基本的查询看看效果,

代码语言:javascript
复制
GET demo_idx/_search
{
  "query": {
    "match": {
      "content": {
        "query": "simple rest apis distributed nature"
      }
    }
  }
}

返回结果,

代码语言:javascript
复制
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.2689934,
    "hits" : [
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.2689934,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6970792,
        "_source" : {
          "content" : "Distributed nature, simple APIs, speed, and scalability"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.69611007,
        "_source" : {
          "content" : "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization."
        }
      }
    ]
  }
}

可以看到是文档1评分最高,其次是文档2和文档3。这个结果好不好呢?答案是不一定,具体要看你的业务场景,或者说在你的业务场景下你期望什么结果。

默认情况下,上面的查询ES会使用OR来分别查询每个term,也就是说上面的查询会被解析为

代码语言:javascript
复制
simple OR rest OR apis OR distributed OR nature

然后查询的结果相加的分数就是整个查询的分数。文档1包含所有的查询term,并且文档比较短(跟算法有关),所以它的分数最高。文档2也比较短,但是它少了一些term。文档3包含了所有的查询term,但是它太长了,导致算分贡献太少。

注意到文档1和文档2的term顺序和查询语句里不一样,但是这并不影响最后的算分,因为OR查询是不关心顺序的。

所以我上面说,这个结果究竟好不好,取决于你的业务场景。比如你的场景对顺序要求很严格,可能你期望文档3算分最高。再比如你对顺序没有要求,但是要求所有的查询term都必须存在,那么文档2就不能在返回结果里。下面就来使用示例来看看这些场景。

场景1,要求查询term都存在

这种场景,需要使用AND操作符,如下:

代码语言:javascript
复制
GET demo_idx/_search
{
  "query": {
    "match": {
      "content": {
        "query": "simple rest apis distributed nature",
        "operator": "and"
      }
    }
  }
}

结果如下:

代码语言:javascript
复制
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.2689934,
    "hits" : [
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.2689934,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability"
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.69611007,
        "_source" : {
          "content" : "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization."
        }
      }
    ]
  }
}

只有文档1和文档3返回了,符合预期。

场景2,对term顺序有要求

这个场景下,希望文档里term出现的顺序和查询语句一样。ES提供了match phrase查询可以满足这种场景。

代码语言:javascript
复制
GET demo_idx/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "simple rest apis distributed nature"
      }
    }
  }
}

结果如下:

代码语言:javascript
复制
{
  "took" : 26,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6961101,
    "hits" : [
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.6961101,
        "_source" : {
          "content" : "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization."
        }
      }
    ]
  }
}

场景3,组合场景

比如我们期望term都存在,或者顺序相同的term查询,任意满足都可以,可以使用bool查询组合条件。

代码语言:javascript
复制
GET demo_idx/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "content": {
              "query": "simple rest apis distributed nature",
              "operator": "and"
            }
          }
        },
        {
          "match_phrase": {
            "content": {
              "query": "simple rest apis distributed nature"
            }
          }
        }
      ]
    }
  }
}

这个查询,should包含两个查询条件,每个查询条件都会对文档贡献算分,并且默认情况下权重是一样的。这个查询的结果是,

代码语言:javascript
复制
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.3922203,
    "hits" : [
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.3922203,
        "_source" : {
          "content" : "Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization."
        }
      },
      {
        "_index" : "demo_idx",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.2689934,
        "_source" : {
          "content" : "Distributed nature, simple REST APIs, speed, and scalability"
        }
      }
    ]
  }
}

返回了文档3和文档1,并且文档3的算分更高。文档3更高的原因在于它两个条件都满足,而文档1只满足第一个条件。

总结

ES提供了多种查询方式,没有哪种是绝对最优的。在实际项目中,我们应该根据自己的业务场景选择合适的查询方式,才能获得最优的查询结果。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-03-04,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 犀牛的技术笔记 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 什么是相关性
  • 提高搜索的相关性
    • 场景1,要求查询term都存在
      • 场景2,对term顺序有要求
        • 场景3,组合场景
        • 总结
        领券
        问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档