在之前的文章中,我介绍了 Painless 脚本编程,并提供了有关其语法和用法的详细信息。 它还涵盖了一些最佳实践,例如,为什么使用参数,何时访问文档字段时何时使用 “doc” 值而不是 “ _source” 以及如何动态创建字段等。
在本文中,我们将探讨 Painless 脚本的更多用法。 本文介绍了在查询上下文中使用 Painless 脚本,过滤上下文,在脚本中使用条件,删除字段/嵌套字段,访问嵌套对象,在评分中使用脚本等。
PUT tweets/_bulk{"index":{"_id":1}}{"username":"tom","posted_date":"2017/07/25" ,"message": "I brought apple stock at the best price" ,"tags": ["stock","money"] , "info":{"device":"mobile", "os": "ios"}, "likes": 10}{"index":{"_id":2}}{"username":"mary","posted_date":"2017/06/25" ,"message": "Machine learning is the future" ,"tags": ["ai","tech"] , "info":{"device":"desktop", "os": "ios"}, "likes": 100}{"index":{"_id":3}}{"username":"tom","posted_date":"2017/07/27" ,"message": "just tweeting" ,"tags": ["confused"] , "info":{"device":"mobile", "os": "win"}, "likes": 0}{"index":{"_id":4}}{"username":"mary","posted_date":"2017/07/28" ,"message": "exploring painless" ,"tags": ["elastic"] , "info":{"device":"mobile", "os": "linux"}, "likes": 100}{"index":{"_id":5}}{"username":"mary","posted_date":"2017/05/20" ,"message": "painless is fun but its a new scripting language in the town" ,"tags": ["elastic","painless","scripting"] , "info":{"device":"mobile", "os": "linux"}, "likes": 1000}
在上面,我们通过 bulk API 来把我们的实验数据导入到 tweets 索引中。
脚本查询使我们可以在每个文档上执行脚本。 脚本查询通常在过滤器上下文中使用。 如果要在查询或过滤器上下文中包含脚本,请确保将脚本嵌入脚本对象("script":{})中。 因此,在下面的示例中,您将在 script 标签内看到 script 标签。
让我们尝试一个例子。 让我们找出所有包含字符串 “painless” 且长度大于25个字符的推文。
GET tweets/_search{ "query": { "bool": { "must": [ { "match": { "message": "painless" } } ], "filter": [ { "script": { "script": { "source": "doc['message.keyword'].value.length() > params.length", "params": { "length": 25 } } } } ] } }}
"hits" : [ { "_index" : "tweets", "_type" : "_doc", "_id" : "5", "_score" : 0.60910475, "_source" : { "username" : "mary", "posted_date" : "2017/05/20", "message" : "painless is fun but its a new scripting language in the town", "tags" : [ "elastic", "painless", "scripting" ], "info" : { "device" : "mobile", "os" : "linux" }, "likes" : 1000 } } ]
脚本也可以用于聚合中。 对于聚合,我们通常使用字段(非分析字段)中的值执行聚合。 使用脚本,可以从现有字段中提取值,从多个字段中追加值,然后对新派生的值进行聚合。
在上面的推文中,我们仅包含 “posted_date” 信息。 如果我们想找出每月的推文数量怎么办? 下面是一个示例,显示了聚合中脚本的使用:
GET tweets/_search{ "size": 0, "aggs": { "my_terms_agg": { "terms": { "script": { "source": """ ZonedDateTime date = doc['posted_date'].value; return date.getMonth() """ } } } }}
在上面我们通过 script 来获得每个文档的月份,然后让这个生产的月份来进行做聚合:
"aggregations" : { "my_terms_agg" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "JULY", "doc_count" : 3 }, { "key" : "JUNE", "doc_count" : 1 }, { "key" : "MAY", "doc_count" : 1 } ] } }
我们可以使用脚本删除字段/嵌套字段。 您要做的就是使用 remove 方法并传入字段/嵌套字段名称。 例如,假设我们要删除 ID 为5的文档的嵌套字段 “device”。
POST tweets/_update/5{ "script": { "source": "ctx._source.info.remove(params.fieldname)", "params": { "fieldname": "device" } }}
GET tweets/_doc/5
"_source" : { "username" : "mary", "posted_date" : "2017/05/20", "message" : "painless is fun but its a new scripting language in the town", "tags" : [ "elastic", "painless", "scripting" ], "info" : { "os" : "linux" }, "likes" : 1000 }
我们可以看到在 info 下的 device 被删除了。
当我们执行匹配查询时,elasticsearch 返回匹配结果,并为每个匹配的文档计算分数,以显示文档与给定查询的匹配程度。 尽管默认算法 BM25 很好地完成了评分/相关性,但有时必须通过其他算法来回答相关性问题,或者必须通过其他评分启发式方法来增强相关性。 在这里,Elasticsearch 的 script_score 和 function_score 功能变得有用。
假设我们要搜索 “painless” 文本,但要在搜索结果顶部显示带有更多 “likes” 赞的推文。 它更像是顶部的热门推文/流行推文。 让我们来看看它的实际效果。
GET tweets/_search{ "query": { "function_score": { "query": { "match": { "message": "painless" } }, "script_score": { "script": { "source": "1 + doc['likes'].value" } } } }}
"hits" : [ { "_index" : "tweets", "_type" : "_doc", "_id" : "5", "_score" : 529.9271, "_source" : { "username" : "mary", "posted_date" : "2017/05/20", "message" : "painless is fun but its a new scripting language in the town", "tags" : [ "elastic", "painless", "scripting" ], "info" : { "os" : "linux" }, "likes" : 1000 } }, { "_index" : "tweets", "_type" : "_doc", "_id" : "4", "_score" : 98.51341, "_source" : { "username" : "mary", "posted_date" : "2017/07/28", "message" : "exploring painless", "tags" : [ "elastic" ], "info" : { "device" : "mobile", "os" : "linux" }, "likes" : 100 } } ]
在上面的示例中,如果由于进行了常规查询而未创建自定义分数,则由于 TF/IDF,文档4将会位于顶部(由于这个句子比较短),也就是文档分数将高于文档5。
GET tweets/_search{ "query": { "match": { "message": "painless" } }}
"hits" : [ { "_index" : "tweets", "_type" : "_doc", "_id" : "4", "_score" : 0.9753803, "_source" : { "username" : "mary", "posted_date" : "2017/07/28", "message" : "exploring painless", "tags" : [ "elastic" ], "info" : { "device" : "mobile", "os" : "linux" }, "likes" : 100 } }, { "_index" : "tweets", "_type" : "_doc", "_id" : "5", "_score" : 0.5293977, "_source" : { "username" : "mary", "posted_date" : "2017/05/20", "message" : "painless is fun but its a new scripting language in the town", "tags" : [ "elastic", "painless", "scripting" ], "info" : { "os" : "linux" }, "likes" : 1000 } } ]
文档id为4的分数高于 id 为5的文档。
