文章/答案/技术大牛

发布

社区首页 >问答首页 >弹性搜索中两个字段组合的不同值的精确计数

问弹性搜索中两个字段组合的不同值的精确计数
EN

Stack Overflow用户

提问于 2022-11-09 18:24:44

回答 2查看 38关注 0票数 0

在我的elasticsearch索引中，我有大约4000万张记录。我想计算两个字段组合的不同值的计数。。

给定一组文档的示例：

[
 {
  "JobId" : 2,
  "DesigId" : 12
 },
 {
  "JobId" : 2,
  "DesigId" : 4
 },
 {
  "JobId" : 3,
  "DesigId" : 5
 },
 {
  "JobId" : 2,
  "DesigId" : 4
 },
 {
  "JobId" : 3,
  "DesigId" : 5
 }
]

对于上面的例子，我应该得到计数= 3，因为只有3个不同的值：(2,12)，(2,4)，(3,5)

为此，我尝试使用基数聚合，但这提供了一个近似计数。我想精确地计算的精确计数。

下面是我使用基数聚合使用的查询：

"aggs": {
        "counts": {
            "cardinality": {
                "script": "doc['JobId'].value + ',' + doc['DesigId'].value",
                "precision_threshold": 40000
            }
        }
    }

我还尝试在两个字段的组合上使用组合聚合，在键之后使用，并计算桶的总体大小，但是这个过程确实耗时，我的查询被超时了。

有什么最佳的方法来实现它吗？

elasticsearch

elasticsearch-aggregation

resthighlevelclient

elastic-rest-client

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-11-10 05:18:38

脚本应该避免，因为它会影响性能。对于用例，有三种方法可以实现所需的结果：

使用复合聚集 (您已经尝试过了)
使用多项聚合，但这不是内存高效的解决方案

搜索查询:

{
    "size": 0,
    "aggs": {
        "jobId_and_DesigId": {
            "multi_terms": {
                "terms": [
                    {
                        "field": "JobId"
                    },
                    {
                        "field": "DesigId"
                    }
                ]
            }
        }
    }
}

搜索结果：

"aggregations": {
        "jobId_and_DesigId": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": [
                        2,
                        4
                    ],
                    "key_as_string": "2|4",
                    "doc_count": 2
                },
                {
                    "key": [
                        3,
                        5
                    ],
                    "key_as_string": "3|5",
                    "doc_count": 2
                },
                {
                    "key": [
                        2,
                        12
                    ],
                    "key_as_string": "2|12",
                    "doc_count": 1
                }
            ]
        }
    }

组合字段值(即"JobId“和”DesigId“的组合)应该存储在索引时间本身，因为这是最好的方法。通过使用集处理机，这是可能的。

PUT /_ingest/pipeline/concat
{
  "processors": [
    {
      "set": {
        "field": "combined_field",
        "value": "{{JobId}} {{DesigId}}"
      }
    }
  ]
}

索引API

当索引文档时，每次索引文档时，都需要添加pipeline=concat查询参数。假设索引API如下所示：

POST _doc/1?pipeline=concat
{
    "JobId": 2,
    "DesigId": 12
    
}

搜索查询：

{
    "size": 0,
    "aggs": {
        "jobId_and_DesigId": {
            "terms": {
                "field":"combined_field.keyword"
            }
        }
    }
}

搜索结果：

 "aggregations": {
        "jobId_and_DesigId": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "2 4",
                    "doc_count": 2
                },
                {
                    "key": "3 5",
                    "doc_count": 2
                },
                {
                    "key": "2 12",
                    "doc_count": 1
                }
            ]
        }
    }

票数 1

Stack Overflow用户

发布于 2022-11-10 05:35:06

基数聚合只给出了近似计数。由于有超过40K的文档使用精度阈值也将无法工作。

您可以使用脚本度量聚合。它将给出精确的计数，但要比基数聚合慢得多。

{
  "aggs": {
    "Distinct_Count": {
      "scripted_metric": {
        "init_script": "state.list = []",
        "map_script": """
                            state.list.add(doc['JobId'].value+'-'+doc['DesigId'].value);
                      """,
        "combine_script": "return state.list;",
        "reduce_script":"""
                          Map uniqueValueMap = new HashMap(); 
                          int count = 0;
                          for(shardList in states) {
                            if(shardList != null) { 
                              for(key in shardList) {
                                if(!uniqueValueMap.containsKey(key)) {
                                  count +=1;
                                  uniqueValueMap.put(key, key);
                                }
                              }
                            }
                          } 
                          return count;
                        """
      }
    }
  }
}

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74379718

复制

相似问题

问弹性搜索中两个字段组合的不同值的精确计数
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问弹性搜索中两个字段组合的不同值的精确计数EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问弹性搜索中两个字段组合的不同值的精确计数
EN