019.Elasticsearch搜索原理

CoderJed

发布于 2020-07-14 15:38:10

3220

发布于 2020-07-14 15:38:10

1. 倒排索引原理初探

有以下两个文档：

# doc1
I really liked my small dogs, and I think my mom also liked them.
# doc2
He never liked any dogs, so I hope that my mom will not expect me to liked him

对文档进行分词，统计其在这两个doc中是否存在，*代表存在

word	doc1	doc2
I	*	*
really	*
liked	*	*
my	*	*
small	*
dogs	*	*
and	*
think	*
mom	*	*
also	*
them	*
He		*
never		*
any		*
so		*
hope		*
that		*
will		*
not		*
expect		*
me		*
to		*
him		*

对单词进行常规化处理：

# 同义词转换
mom -> mother
small -> little
# 时态转换
liked -> like
# 类似：likes -> like
# 单复数转换
dogs -> dog
# 大小写转换
# 举例：Tom -> tom

重新建立倒排索引，加入常规化后的单词：

word	常规化单词	document id
I		[1、2]
really		[1]
liked	like	[1、2]
my		[1、2]
small	little	[1]
dogs	dog	[1、2]
and		[1]
think		[1]
mom	mother	[1、2]
also		[1]
them		[1]
He		[2]
never		[2]
any		[2]
so		[2]
hope		[2]
that		[2]
will		[2]
not		[2]
expect		[2]
me		[2]
to		[2]
him		[2]

搜索"mother like little dog"，首先分词，然后查看这些单词出现过的id，就返回了id为1和2的这两条文档

2. TF&IDF算法

ES使用TF&IDF算法(Term Frequency/Inverse Document Frequency)算法来计算document的评分，及其与搜索条件的匹配程度，此算法包括以下几个方面的影响因素：

Term Frequency：搜索词出现次数越多的文档越相关

doc1：hello you, and world is very good
doc2：hello, how are you

搜索请求：hello world
hello在doc1中出现1次，在doc2中出现1次
world在doc1中出现1次，在doc2中出现0次
所以doc1更相关

Inverse Document Frequency：搜索词在所有文档中出现的次数越多，则包含这个词的文档就越不相关

doc1：hello, today is very good
doc2：hi world, how are you
搜索请求：hello world
假设hello这个单词在全部文档中出现了10000次，world这个单词在全部文档中出现了100次
此时，doc1和doc2都是只能匹配一个单词，那么doc2的评分就比doc1靠前

Field-Length Norm：field的长度越长，相关度越弱

doc1：{ "title": "hello article", "content": "yes"(一万个yes)}
doc2：{ "title": "my article", "content": "world yes"(一万个yes)}
搜索请求：hello world
doc1和doc2同样都是匹配到一个单词，且都出现了1次，但是doc1匹配到单词的field(title)比doc2匹配到单词的field(content)长度短
所有doc1评分更高

使用执行计划查看API可以获取score的具体计算公式

GET /test_index/test_type/_search?explain
{
  "query": {
    "match": {
      "test_field": "test hello"
    }
  }
}

3. 正排索引

搜索的时候，要依靠倒排索引，排序的时候，需要依靠正排索引，将每个document的每个field，然后进行排序，就是所谓的正排索引，在建立索引的时候，一方面会建立倒排索引，以供搜索用，一方面会建立正排索引，供排序，聚合，过滤等操作使用

doc1: {"name": "jack", "age": 27}
doc2: {"name": "tom", "age": 30}

以上两条doc的正排索引如下：

document id	name	age
1	jack	27
2	tom	30

在内存资源充足的时候，正排索引会缓存在内存中，否则就会写入磁盘

4. 倒排索引的结构

一条倒排索引的信息中包含以下内容：

内容包含这个关键词的document list
内容包含这个关键词的所有document的数量：IDF
这个关键词在每个document中出现的次数：TF
这个关键词在这个document中的位置
内容包含这个关键词的每个document的长度：Length Norm
内容包含这个关键词的所有document的平均长度

5. 字符串排序问题

如果对一个字符串进行排序，结果往往不准确，因为分词后是多个单词，再排序就不是我们想要的结果了，通常解决方案是，将一个字符串建立两次索引，一个分词，用来进行搜索，一个不分词，用来进行排序：

PUT /my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "fielddata": true,
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

PUT /my_index/my_type/1
{
  "title": "first article"
}

PUT /my_index/my_type/2
{
  "title": "second article"
}

PUT /my_index/my_type/3
{
  "title": "third article"
}

# 使用分词的title进行排序
GET /my_index/my_type/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "title": {
        "order": "desc"
      }
    }
  ]
}

{
  "took" : 25,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "title" : "third article"
        },
        "sort" : [
          "third"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "title" : "second article"
        },
        "sort" : [
          "second"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "title" : "first article"
        },
        "sort" : [
          "first"
        ]
      }
    ]
  }
}

# 使用不分词的title进行排序
GET /my_index/my_type/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "title.raw": {
        "order": "desc"
      }
    }
  ]
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "title" : "third article"
        },
        "sort" : [
          "third article"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "title" : "second article"
        },
        "sort" : [
          "second article"
        ]
      },
      {
        "_index" : "my_index",
        "_type" : "my_type",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "title" : "first article"
        },
        "sort" : [
          "first article"
        ]
      }
    ]
  }
}

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

编程算法

登录后参与评论

0 条评论

热度