Elasticsearch使用：Vector API

原创

HLee

修改于 2021-01-18 10:23:33

1.7K0

修改于 2021-01-18 10:23:33

文章被收录于专栏：房东的猫

简介

官方文档：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html#vector-functions

目前这个功能还是处于试验阶段。在未来的版本中可能会出现变化。基于 Vector（向量）的打分目前分为一下两种：

它们都是基于功能进行评分的。在实际的使用中，我们必须注意的是：向量函数的计算过程中，所有匹配的文档均被线性扫描。因此，期望查询时间随匹配文档的数量线性增长。因此，我们建议使用查询参数限制匹配文档的数量。

Vector

我们首先创建一个叫做 books 的索引，并定义它的 mapping 如下：

PUT books
{
  "mappings": {
    "properties": {
      "author": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "category": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "format": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "isbn13": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "pages": {
        "type": "long"
      },
      "price": {
        "type": "float"
      },
      "publisher": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "rating": {
        "type": "float"
      },
      "release_year": {
        "type": "date",
        "format": "strict_year"
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "vector_recommendation": {
        "type": "dense_vector",
        "dims": 3
      }
    }
  }
}

然后，我们使用 bulk API 接口来导入数据：

PUT books/_bulk
{ "index" : { "_id" : "database-internals" } }
{"isbn13":"978-1492040347","author":"Alexander Petrov", "title":"Database Internals: A deep-dive into how distributed data systems work","publisher":"O'Reilly","category":["databases","information systems"],"pages":350,"price":47.28,"format":"paperback","rating":4.5, "release_year" : "2019", "vector_recommendation" : [3.5, 4.5, 5.2]}
{ "index" : { "_id" : "designing-data-intensive-applications" } }
{"isbn13":"978-1449373320", "author":"Martin Kleppmann", "title":"Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems","publisher":"O'Reilly","category":["databases" ],"pages":590,"price":31.06,"format":"paperback","rating":4.4, "release_year" : "2017", "vector_recommendation" : [5.9, 4.4, 6.8]}
{ "index" : { "_id" : "kafka-the-definitive-guide" } }
{"isbn13":"978-1491936160","author":[ "Neha Narkhede", "Gwen Shapira", "Todd Palino"], "title":"Kafka: The Definitive Guide: Real-time data and stream processing at scale", "publisher":"O'Reilly","category":["databases" ],"pages":297,"price":37.31,"format":"paperback","rating":3.9, "release_year" : "2017", "vector_recommendation" : [2.97, 3.9, 6.2]}
{ "index" : { "_id" : "effective-java" } }
{"isbn13":"978-1491936160","author": "Joshua Block", "title":"Effective Java", "publisher":"Addison-Wesley", "category":["programming languages", "java" ],"pages":412,"price":27.91,"format":"paperback","rating":4.2, "release_year" : "2017", "vector_recommendation" : [4.12, 4.2, 7.2]}
{ "index" : { "_id" : "daemon" } }
{"isbn13":"978-1847249616","author":"Daniel Suarez", "title":"Daemon","publisher":"Quercus","category":["dystopia","novel"],"pages":448,"price":12.03,"format":"paperback","rating":4.0, "release_year" : "2011", "vector_recommendation" : [4.48, 4.0, 8.7]}
{ "index" : { "_id" : "cryptonomicon" } }
{"isbn13":"978-1847249616","author":"Neal Stephenson", "title":"Cryptonomicon","publisher":"Avon","category":["thriller", "novel" ],"pages":1152,"price":6.99,"format":"paperback","rating":4.0, "release_year" : "2002", "vector_recommendation" : [10.0, 4.0, 9.3]}
{ "index" : { "_id" : "garbage-collection-handbook" } }
{"isbn13":"978-1420082791","author": [ "Richard Jones", "Antony Hosking", "Eliot Moss" ], "title":"The Garbage Collection Handbook: The Art of Automatic Memory Management","publisher":"Taylor & Francis","category":["programming algorithms" ],"pages":511,"price":87.85,"format":"paperback","rating":5.0, "release_year" : "2011", "vector_recommendation" : [5.1, 5.0, 1.3] }
{ "index" : { "_id" : "radical-candor" } }
{"isbn13":"978-1250258403","author": "Kim Scott", "title":"Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity","publisher":"Macmillan","category":["human resources","management", "new work"],"pages":404,"price":7.29,"format":"paperback","rating":4.0, "release_year" : "2018", "vector_recommendation" : [4.0, 4.0, 9.2] }

这样我们的索引 books 中有8个文档。我们仔细地查看一下我们输入的数据，它里面含有一个叫做 vecto_recommendation 的字段

"vector_recommendation" : [3.5, 4.5, 5.2]

向量里的第一个数据3.5，实际上是我们的在这个文档里的 pages 除以100而得到的。如果这本书的页数越多，则表示这个数值越大。它的范围在0-10之间
向量里的第二个数据是这本书的 rating，也即评价。这个值在这个文档里的 rating 项可以查到。范围在0-5之间
向量里的第三个数据是书的价钱，这个值越低表明，价钱越贵，因为我们都喜欢便宜一点的书籍。0代表100元以上，10则表示10元以下的书

在上面的 mapping 中，我们是这样定义的：

"vector_recommendation": { "type": "dense_vector", "dims": 3 }

它定义了这个 vector_recommendation 的类型是 dense_vector，它是一个3维的向量。

现在我们的数据都已经准备好了。我们接下来做一些我们喜欢的搜索。

vector搜索

在上面我们已经建立了我们的向量模型。那么我们怎么能够找到那些书的页数比较少，便宜的而且评价非常高的书呢？我们可以采用如下的搜索方式：

GET books/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, doc['vector_recommendation']) + 1.0",
        "params": {
          "query_vector": [
            1,
            5,
            10
          ]
        }
      }
    }
  }
}

在上面的搜索中，我们通过脚本：

cosineSimilarity(params.query_vector, doc['vector_recommendation']) + 1.0

来计算我们的搜索的分数。这里加上1的作用是为了避免我们最后的分数是负数。

在上面的表达式中：

"params": {"query_vector": [ 1, 5, 10 ] }

我们想寻找的书是最好是100页的书，因为第一项是1；我们也同时想找一个评价好的书，因为第二项是5；同时我们想找最便宜的书，因为第三项是10。按照上面的要求，我们可以得到如下的搜索结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    },
    "max_score" : 1.9568613,
    "hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "radical-candor",
        "_score" : 1.9568613,
        "_source" : {
          "isbn13" : "978-1250258403",
          "author" : "Kim Scott",
          "title" : "Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity",
          "publisher" : "Macmillan",
          "category" : [
            "human resources",
            "management",
            "new work"
          ],
          "pages" : 404,
          "price" : 7.29,
          "format" : "paperback",
          "rating" : 4.0,
          "release_year" : "2018",
          "vector_recommendation" : [
            4.0,
            4.0,
            9.2
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "kafka-the-definitive-guide",
        "_score" : 1.9520907,
        "_source" : {
          "isbn13" : "978-1491936160",
          "author" : [
            "Neha Narkhede",
            "Gwen Shapira",
            "Todd Palino"
          ],
          "title" : "Kafka: The Definitive Guide: Real-time data and stream processing at scale",
          "publisher" : "O'Reilly",
          "category" : [
            "databases"
          ],
          "pages" : 297,
          "price" : 37.31,
          "format" : "paperback",
          "rating" : 3.9,
          "release_year" : "2017",
          "vector_recommendation" : [
            2.97,
            3.9,
            6.2
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "daemon",
        "_score" : 1.9394372,
        "_source" : {
          "isbn13" : "978-1847249616",
          "author" : "Daniel Suarez",
          "title" : "Daemon",
          "publisher" : "Quercus",
          "category" : [
            "dystopia",
            "novel"
          ],
          "pages" : 448,
          "price" : 12.03,
          "format" : "paperback",
          "rating" : 4.0,
          "release_year" : "2011",
          "vector_recommendation" : [
            4.48,
            4.0,
            8.7
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "effective-java",
        "_score" : 1.9305289,
        "_source" : {
          "isbn13" : "978-1491936160",
          "author" : "Joshua Block",
          "title" : "Effective Java",
          "publisher" : "Addison-Wesley",
          "category" : [
            "programming languages",
            "java"
          ],
          "pages" : 412,
          "price" : 27.91,
          "format" : "paperback",
          "rating" : 4.2,
          "release_year" : "2017",
          "vector_recommendation" : [
            4.12,
            4.2,
            7.2
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "database-internals",
        "_score" : 1.9005439,
        "_source" : {
          "isbn13" : "978-1492040347",
          "author" : "Alexander Petrov",
          "title" : "Database Internals: A deep-dive into how distributed data systems work",
          "publisher" : "O'Reilly",
          "category" : [
            "databases",
            "information systems"
          ],
          "pages" : 350,
          "price" : 47.28,
          "format" : "paperback",
          "rating" : 4.5,
          "release_year" : "2019",
          "vector_recommendation" : [
            3.5,
            4.5,
            5.2
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "designing-data-intensive-applications",
        "_score" : 1.8525991,
        "_source" : {
          "isbn13" : "978-1449373320",
          "author" : "Martin Kleppmann",
          "title" : "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems",
          "publisher" : "O'Reilly",
          "category" : [
            "databases"
          ],
          "pages" : 590,
          "price" : 31.06,
          "format" : "paperback",
          "rating" : 4.4,
          "release_year" : "2017",
          "vector_recommendation" : [
            5.9,
            4.4,
            6.8
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "cryptonomicon",
        "_score" : 1.7700485,
        "_source" : {
          "isbn13" : "978-1847249616",
          "author" : "Neal Stephenson",
          "title" : "Cryptonomicon",
          "publisher" : "Avon",
          "category" : [
            "thriller",
            "novel"
          ],
          "pages" : 1152,
          "price" : 6.99,
          "format" : "paperback",
          "rating" : 4.0,
          "release_year" : "2002",
          "vector_recommendation" : [
            10.0,
            4.0,
            9.3
          ]
        }
      },
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "garbage-collection-handbook",
        "_score" : 1.528916,
        "_source" : {
          "isbn13" : "978-1420082791",
          "author" : [
            "Richard Jones",
            "Antony Hosking",
            "Eliot Moss"
          ],
          "title" : "The Garbage Collection Handbook: The Art of Automatic Memory Management",
          "publisher" : "Taylor & Francis",
          "category" : [
            "programming algorithms"
          ],
          "pages" : 511,
          "price" : 87.85,
          "format" : "paperback",
          "rating" : 5.0,
          "release_year" : "2011",
          "vector_recommendation" : [
            5.1,
            5.0,
            1.3
          ]
        }
      }
    ]
  }
}

我们可以看出来 “Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity” 是最贴近的书。我们可以看一下它的recommendation_vector:

"vector_recommendation" : [ 4.0, 4.0, 9.2 ]

这是所有的书里最贴近我们搜索要求的书了。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

Elasticsearch Service

全文检索

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

Elasticsearch Service

全文检索

登录后参与评论

0 条评论

热度