前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Elasticsearch:基于 Vector 的打分

Elasticsearch:基于 Vector 的打分

作者头像
腾讯云大数据
修改2021-01-08 16:00:51
1K0
修改2021-01-08 16:00:51
举报
文章被收录于专栏:腾讯云Elasticsearch Service

腾讯云 Elasticsearch Service】高可用,可伸缩,云端全托管。集成X-Pack高级特性,适用日志分析/企业搜索/BI分析等场景


目前这个功能还是处于试验阶段。在未来的版本中可能会出现变化。基于 Vector(向量)的打分目前分为一下两种:

它们都是基于功能进行评分的。在实际的使用中,我们必须注意的是:向量函数的计算过程中,所有匹配的文档均被线性扫描。 因此,期望查询时间随匹配文档的数量线性增长。 因此,我们建议使用查询参数限制匹配文档的数量。

准备数据

我们首先创建一个叫做 books 的索引,并定义它的 mapping 如下:

代码语言:javascript
复制
PUT books{  "mappings": {    "properties": {      "author": {        "type": "text",        "fields": {          "keyword": {            "type": "keyword",            "ignore_above": 256          }        }      },      "category": {        "type": "text",        "fields": {          "keyword": {            "type": "keyword",            "ignore_above": 256          }        }      },      "format": {        "type": "text",        "fields": {          "keyword": {            "type": "keyword",            "ignore_above": 256          }        }      },      "isbn13": {        "type": "text",        "fields": {          "keyword": {            "type": "keyword",            "ignore_above": 256          }        }      },      "pages": {        "type": "long"      },      "price": {        "type": "float"      },      "publisher": {        "type": "text",        "fields": {          "keyword": {            "type": "keyword",            "ignore_above": 256          }        }      },      "rating": {        "type": "float"      },      "release_year": {        "type": "date",        "format": "strict_year"      },      "title": {        "type": "text",        "fields": {          "keyword": {            "type": "keyword",            "ignore_above": 256          }        }      },      "vector_recommendation": {        "type": "dense_vector",        "dims": 3      }    }  }}

然后,我们使用 bulk API 接口来导入数据:

代码语言:javascript
复制
PUT books/_bulk{ "index" : { "_id" : "database-internals" } }{"isbn13":"978-1492040347","author":"Alexander Petrov", "title":"Database Internals: A deep-dive into how distributed data systems work","publisher":"O'Reilly","category":["databases","information systems"],"pages":350,"price":47.28,"format":"paperback","rating":4.5, "release_year" : "2019", "vector_recommendation" : [3.5, 4.5, 5.2]}{ "index" : { "_id" : "designing-data-intensive-applications" } }{"isbn13":"978-1449373320", "author":"Martin Kleppmann", "title":"Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems","publisher":"O'Reilly","category":["databases" ],"pages":590,"price":31.06,"format":"paperback","rating":4.4, "release_year" : "2017", "vector_recommendation" : [5.9, 4.4, 6.8]}{ "index" : { "_id" : "kafka-the-definitive-guide" } }{"isbn13":"978-1491936160","author":[ "Neha Narkhede", "Gwen Shapira", "Todd Palino"], "title":"Kafka: The Definitive Guide: Real-time data and stream processing at scale", "publisher":"O'Reilly","category":["databases" ],"pages":297,"price":37.31,"format":"paperback","rating":3.9, "release_year" : "2017", "vector_recommendation" : [2.97, 3.9, 6.2]}{ "index" : { "_id" : "effective-java" } }{"isbn13":"978-1491936160","author": "Joshua Block", "title":"Effective Java", "publisher":"Addison-Wesley", "category":["programming languages", "java" ],"pages":412,"price":27.91,"format":"paperback","rating":4.2, "release_year" : "2017", "vector_recommendation" : [4.12, 4.2, 7.2]}{ "index" : { "_id" : "daemon" } }{"isbn13":"978-1847249616","author":"Daniel Suarez", "title":"Daemon","publisher":"Quercus","category":["dystopia","novel"],"pages":448,"price":12.03,"format":"paperback","rating":4.0, "release_year" : "2011", "vector_recommendation" : [4.48, 4.0, 8.7]}{ "index" : { "_id" : "cryptonomicon" } }{"isbn13":"978-1847249616","author":"Neal Stephenson", "title":"Cryptonomicon","publisher":"Avon","category":["thriller", "novel" ],"pages":1152,"price":6.99,"format":"paperback","rating":4.0, "release_year" : "2002", "vector_recommendation" : [10.0, 4.0, 9.3]}{ "index" : { "_id" : "garbage-collection-handbook" } }{"isbn13":"978-1420082791","author": [ "Richard Jones", "Antony Hosking", "Eliot Moss" ], "title":"The Garbage Collection Handbook: The Art of Automatic Memory Management","publisher":"Taylor & Francis","category":["programming algorithms" ],"pages":511,"price":87.85,"format":"paperback","rating":5.0, "release_year" : "2011", "vector_recommendation" : [5.1, 5.0, 1.3] }{ "index" : { "_id" : "radical-candor" } }{"isbn13":"978-1250258403","author": "Kim Scott", "title":"Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity","publisher":"Macmillan","category":["human resources","management", "new work"],"pages":404,"price":7.29,"format":"paperback","rating":4.0, "release_year" : "2018", "vector_recommendation" : [4.0, 4.0, 9.2] }

这样我们的索引 books 中有8个文档。我们仔细地查看一下我们输入的数据,它里面含有一个叫做 vecto_recommendation 的字段

代码语言:javascript
复制
"vector_recommendation" : [3.5, 4.5, 5.2]
  • 向量里的第一个数据3.5,实际上是我们的在这个文档里的 pages 除以100而得到的。如果这本书的页数越多,则表示这个数值越大。它的范围在0-10之间
  • 向量里的第二个数据是这本书的 rating,也即评价。这个值在这个文档里的 rating 项可以查到。范围在0-5之间
  • 向量里的第三个数据是书的价钱,这个值越低表明,价钱越贵,因为我们都喜欢便宜一点的书籍。0代表100元以上,10则表示10元以下的书

在上面的 mapping 中,我们是这样定义的:

代码语言:javascript
复制
      "vector_recommendation": {        "type": "dense_vector",        "dims": 3      }

它定义了这个 vector_recommendation 的类型是 dense_vector,它是一个3维的向量。

现在我们的数据都已经准备好了。我们接下来做一些我们喜欢的搜索。

搜索短的,便宜的并且评价高的书

在上面我们已经建立了我们的向量模型。那么我们怎么能够找到那些书的页数比较少,便宜的而且评价非常高的书呢?我们可以采用如下的搜索方式:

代码语言:javascript
复制
GET books/_search{  "query": {    "script_score": {      "query": {        "match_all": {}      },      "script": {        "source": "cosineSimilarity(params.query_vector, doc['vector_recommendation']) + 1.0",        "params": {          "query_vector": [            1,            5,            10          ]        }      }    }  }}

在这里,我们使用 script_score。如果你对这个不是很了解的话,可以参阅我之前的文章 “Elasticsearch:使用function_score及soft_score定制搜索结果的分数”来做更进一步的了解。

在上面的搜索中,我们通过脚本:

代码语言:javascript
复制
cosineSimilarity(params.query_vector, doc['vector_recommendation']) + 1.0

来计算我们的搜索的分数。这里加上1的作用是为了避免我们最后的分数是负数。

在上面的表达式中:

代码语言:javascript
复制
      "params": {          "query_vector": [            1,            5,            10          ]        }

我们想寻找的书是最好是100页的书,因为第一项是1;我们也同时想找一个评价好的书,因为第二项是5;同时我们想找最便宜的书,因为第三项是10。按照上面的要求,我们可以得到如下的搜索结果:

代码语言:javascript
复制
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释 
* "hits" : [      {        "_index" : "books",        "_type" : "_doc",        "_id" : "radical-candor",        "_score" : 1.9568613,        "_source" : {          "isbn13" : "978-1250258403",          "author" : "Kim Scott",          "title" : "Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity",          "publisher" : "Macmillan",          "category" : [            "human resources",            "management",            "new work"          ],          "pages" : 404,          "price" : 7.29,          "format" : "paperback",          "rating" : 4.0,          "release_year" : "2018",          "vector_recommendation" : [            4.0,            4.0,            9.2          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "kafka-the-definitive-guide",        "_score" : 1.9520907,        "_source" : {          "isbn13" : "978-1491936160",          "author" : [            "Neha Narkhede",            "Gwen Shapira",            "Todd Palino"          ],          "title" : "Kafka: The Definitive Guide: Real-time data and stream processing at scale",          "publisher" : "O'Reilly",          "category" : [            "databases"          ],          "pages" : 297,          "price" : 37.31,          "format" : "paperback",          "rating" : 3.9,          "release_year" : "2017",          "vector_recommendation" : [            2.97,            3.9,            6.2          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "daemon",        "_score" : 1.9394372,        "_source" : {          "isbn13" : "978-1847249616",          "author" : "Daniel Suarez",          "title" : "Daemon",          "publisher" : "Quercus",          "category" : [            "dystopia",            "novel"          ],          "pages" : 448,          "price" : 12.03,          "format" : "paperback",          "rating" : 4.0,          "release_year" : "2011",          "vector_recommendation" : [            4.48,            4.0,            8.7          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "effective-java",        "_score" : 1.9305289,        "_source" : {          "isbn13" : "978-1491936160",          "author" : "Joshua Block",          "title" : "Effective Java",          "publisher" : "Addison-Wesley",          "category" : [            "programming languages",            "java"          ],          "pages" : 412,          "price" : 27.91,          "format" : "paperback",          "rating" : 4.2,          "release_year" : "2017",          "vector_recommendation" : [            4.12,            4.2,            7.2          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "database-internals",        "_score" : 1.9005439,        "_source" : {          "isbn13" : "978-1492040347",          "author" : "Alexander Petrov",          "title" : "Database Internals: A deep-dive into how distributed data systems work",          "publisher" : "O'Reilly",          "category" : [            "databases",            "information systems"          ],          "pages" : 350,          "price" : 47.28,          "format" : "paperback",          "rating" : 4.5,          "release_year" : "2019",          "vector_recommendation" : [            3.5,            4.5,            5.2          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "designing-data-intensive-applications",        "_score" : 1.8525991,        "_source" : {          "isbn13" : "978-1449373320",          "author" : "Martin Kleppmann",          "title" : "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems",          "publisher" : "O'Reilly",          "category" : [            "databases"          ],          "pages" : 590,          "price" : 31.06,          "format" : "paperback",          "rating" : 4.4,          "release_year" : "2017",          "vector_recommendation" : [            5.9,            4.4,            6.8          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "cryptonomicon",        "_score" : 1.7700485,        "_source" : {          "isbn13" : "978-1847249616",          "author" : "Neal Stephenson",          "title" : "Cryptonomicon",          "publisher" : "Avon",          "category" : [            "thriller",            "novel"          ],          "pages" : 1152,          "price" : 6.99,          "format" : "paperback",          "rating" : 4.0,          "release_year" : "2002",          "vector_recommendation" : [            10.0,            4.0,            9.3          ]        }      },      {        "_index" : "books",        "_type" : "_doc",        "_id" : "garbage-collection-handbook",        "_score" : 1.528916,        "_source" : {          "isbn13" : "978-1420082791",          "author" : [            "Richard Jones",            "Antony Hosking",            "Eliot Moss"          ],          "title" : "The Garbage Collection Handbook: The Art of Automatic Memory Management",          "publisher" : "Taylor & Francis",          "category" : [            "programming algorithms"          ],          "pages" : 511,          "price" : 87.85,          "format" : "paperback",          "rating" : 5.0,          "release_year" : "2011",          "vector_recommendation" : [            5.1,            5.0,            1.3          ]        }      }    ]
*/

我们可以看出来 “Radical Candor: Be a Kick-Ass Boss Without Losing Your Humanity” 是最贴近的书。我们可以看一下它的recommendation_vector:

代码语言:javascript
复制
          "vector_recommendation" : [            4.0,            4.0,            9.2          ]

这是所有的书里最贴近我们搜索要求的书了。

参考:

【1】https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html#vector-functions


最新活动

包含文章发布时段最新活动,前往ES产品介绍页,可查找ES当前活动统一入口

Elasticsearch Service自建迁移特惠政策>>

Elasticsearch Service 新用户特惠狂欢,最低4折首购优惠 >>

Elasticsearch Service 企业首购特惠,助力企业复工复产>>

关注“腾讯云大数据”公众号,技术交流、最新活动、服务专享一站Get~

本文系转载,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系转载前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 【腾讯云 Elasticsearch Service】高可用,可伸缩,云端全托管。集成X-Pack高级特性,适用日志分析/企业搜索/BI分析等场景
  • 准备数据
  • 搜索短的,便宜的并且评价高的书
  • 最新活动
相关产品与服务
Elasticsearch Service
腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档