前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >ES系列16:管道聚合你都不会?那你如何做聚合分析

ES系列16:管道聚合你都不会?那你如何做聚合分析

作者头像
方才编程_公众号同名
发布2020-11-13 10:38:33
1.2K0
发布2020-11-13 10:38:33
举报
文章被收录于专栏:方才编程方才编程
本文目标

学习管道聚合,是为了完成更复杂的聚合分析,通过本文,你将对管道聚合的各种类型的功用和使用场景有一个全面的掌握。当遇到聚合需求时,可以快速反应,选用合适的聚合类型。

ps:本文基于ES 7.7.1【文末附《管道聚合详解》xmind 获取方式】

管道聚合详解

前两天,我们已经学习ES的桶聚合和指标聚合,这是学习 Pipeline Agg 的基础,如果对这两个聚合还没有整体概念的伙伴,可点击:ES系列14:你知道25种(桶聚合)Bucket Aggs 类型各自的使用场景么ES系列15:ES的指标聚合有哪些呢?在这里,我都给你总结好了

在掌握了Bucket Agg 和 Metric Agg 后,对于很多聚合分析场景,就已经可以得心应手了,如果有更为复杂的,比如说二次聚合,可能就需要用到管道聚合 Pipeline Agg 了。

什么叫 Pipeline Agg 呢?就是管道聚合:对其他聚合结果进行二次聚合。注意,管道聚合不能具有子聚合,但是根据其类型,它可以引用buckets_path 允许管道聚合链接的另一个管道。例如,您可以将两个导数链接在一起以计算第二个导数(即导数的导数)。

在系统学习管道聚合之前,我们需要先掌握管道聚合的必填参数 buckets_path 的语法。

01 buckets_path 的语法

1.1 与操作的聚合对象同级,就是 Agg_Name

代码语言:javascript
复制
POST /_search
{
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"timestamp",
                "calendar_interval":"day"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "lemmings" }
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" }
                }
            }
        }
    }
}

1.2 与操作的聚合对象不同级,就是 Agg_Name的相对路径,多个 Agg_Name间使用“>”连接

代码语言:javascript
复制
POST /_search
{
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "max_monthly_sales": {
            "max_bucket": {
                "buckets_path": "sales_per_month>sales"
            }
        }
    }
}

1.3 操作对象是多值桶某个key 的聚合,就是 Agg_Name1['keyName']>Agg_Name2

代码语言:javascript
复制
POST /_search
{
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "sale_type": {
                    "terms": {
                        "field": "type"
                    },
                    "aggs": {
                        "sales": {
                            "sum": {
                                "field": "price"
                            }
                        }
                    }
                },
                "hat_vs_bag_ratio": {
                    "bucket_script": {
                        "buckets_path": {
                            "hats": "sale_type['hat']>sales",
                            "bags": "sale_type['bag']>sales"
                        },
                        "script": "params.hats / params.bags"
                    }
                }
            }
        }
    }
}

1.4 特殊路径

代码语言:javascript
复制
POST /_search
{
    "aggs": {
        "my_date_histo": {
            "date_histogram": {
                "field":"timestamp",
                "calendar_interval":"day"
            },
            "aggs": {
                "the_movavg": {
                    "moving_avg": { "buckets_path": "_count" }
                }
            }
        }
    }
}

或者
POST /sales/_search
{
  "size": 0,
  "aggs": {
    "histo": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "day"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "category"
          }
        },
        "min_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "categories._bucket_count"
            },
            "script": {
              "source": "params.count != 0"
            }
          }
        }
      }
    }
  }
}

了解了buckets_path 的语法,接下来我们就详细学习管道聚合,在学习之前,我们要知道管道聚合根据输出结果的位置分为Parent【结果内嵌到现有的聚合分析结果中】 和 Sibling【结果和现有分析结果同级】 两类

02 Parent 分类

2.1 Bucket Script 桶脚本聚合

场景示例:计算出每月T恤销售额与总销售额的比例百分比

代码语言:javascript
复制
POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "t-shirts": {
                  "filter": {
                    "term": {
                      "type": "t-shirt"
                    }
                  },
                  "aggs": {
                    "sales": {
                      "sum": {
                        "field": "price"
                      }
                    }
                  }
                },
                "t-shirt-percentage": {
                    "bucket_script": {
                        "buckets_path": {
                          "tShirtSales": "t-shirts>sales",
                          "totalSales": "total_sales"
                        },
                        "script": "params.tShirtSales / params.totalSales * 100"
                    }
                }
            }
        }
    }
}

结果:

代码语言:javascript
复制
{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 200.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 36.36363636363637
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "total_sales": {
                   "value": 60.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 10.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 16.666666666666664
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 175.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 46.666666666666664
               }
            }
         ]
      }
   }
}

2.2 Bucket Selector 桶选择器聚合

场景示例:获取当月总销售额超过200的存储分区桶

代码语言:javascript
复制
POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_bucket_filter": {
                    "bucket_selector": {
                        "buckets_path": {
                          "totalSales": "total_sales"
                        },
                        "script": "params.totalSales > 200"
                    }
                }
            }
        }
    }
}

结果:

代码语言:javascript
复制
{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               }
            }
         ]
      }
   }
}

2.3 Bucket Sort 桶排序聚合

场景示例:按降序返回总销售额最高的3个月相对应的存储桶

代码语言:javascript
复制
POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_bucket_sort": {
                    "bucket_sort": {
                        "sort": [
                          {"total_sales": {"order": "desc"}}
                        ],
                        "size": 3,				
		        "from": 0
                    }
                }
            }
        }
    }
}

结果:

代码语言:javascript
复制
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "total_sales": {
                   "value": 60.0
               },
            }
         ]
      }
   }

2.4 Cumulative Cardinality 累积基数聚合

场景示例1:获取网站每天的新访问者总的累计数量【后一天会累加前一天的,就是以第一天为基准】

代码语言:javascript
复制
GET /user_hits/_search
{
    "size": 0,
    "aggs" : {
        "users_per_day" : {
            "date_histogram" : {
                "field" : "timestamp",
                "calendar_interval" : "day"
            },
            "aggs": {
                "distinct_users": {
                    "cardinality": {
                        "field": "user_id"
                    }
                },
                "total_new_users": {
                    "cumulative_cardinality": {
                        "buckets_path": "distinct_users"
                    }
                }
            }
        }
    }
}

结果:

代码语言:javascript
复制
   "aggregations": {
      "users_per_day": {
         "buckets": [
            {
               "key_as_string": "2019-01-01T00:00:00.000Z",
               "key": 1546300800000,
               "doc_count": 2,
               "distinct_users": {
                  "value": 2
               },
               "total_new_users": {
                  "value": 2
               }
            },
            {
               "key_as_string": "2019-01-02T00:00:00.000Z",
               "key": 1546387200000,
               "doc_count": 2,
               "distinct_users": {
                  "value": 2
               },
               "total_new_users": {
                  "value": 3
               }
            },
            {
               "key_as_string": "2019-01-03T00:00:00.000Z",
               "key": 1546473600000,
               "doc_count": 3,
               "distinct_users": {
                  "value": 3
               },
               "total_new_users": {
                  "value": 4
               }
            }
         ]
      }
   }
}

结果分析:
请注意,第二天2019-01-02拥有两个不同的用户,
但total_new_users累积管道agg生成的指标仅增加到三个。
这意味着当天的两个用户中只有一个是新用户,
而在前一天已经看到了另一个用户。
第三天再次发生这种情况,三个用户中只有一个是全新的。

添加聚合:derivative 增量累积基数

场景示例2:获取网站每天增加了多少新用户【数据以前一天为基准】

代码语言:javascript
复制
GET /user_hits/_search
{
    "size": 0,
    "aggs" : {
        "users_per_day" : {
            "date_histogram" : {
                "field" : "timestamp",
                "calendar_interval" : "day"
            },
            "aggs": {
                "distinct_users": {
                    "cardinality": {
                        "field": "user_id"
                    }
                },
                "total_new_users": {
                    "cumulative_cardinality": {
                        "buckets_path": "distinct_users"
                    }
                },
                "incremental_new_users": {
                    "derivative": {
                        "buckets_path": "total_new_users"
                    }
                }
            }
        }
    }
}

结果:

代码语言:javascript
复制
   "aggregations": {
      "users_per_day": {
         "buckets": [
            {
               "key_as_string": "2019-01-01T00:00:00.000Z",
               "key": 1546300800000,
               "doc_count": 2,
               "distinct_users": {
                  "value": 2
               },
               "total_new_users": {
                  "value": 2
               }
            },
            {
               "key_as_string": "2019-01-02T00:00:00.000Z",
               "key": 1546387200000,
               "doc_count": 2,
               "distinct_users": {
                  "value": 2
               },
               "total_new_users": {
                  "value": 3
               },
               "incremental_new_users": {
                  "value": 1.0
               }
            },
            {
               "key_as_string": "2019-01-03T00:00:00.000Z",
               "key": 1546473600000,
               "doc_count": 3,
               "distinct_users": {
                  "value": 3
               },
               "total_new_users": {
                  "value": 4
               },
               "incremental_new_users": {
                  "value": 1.0
               }
            }
         ]
      }
   }
}

2.5 Cumulative Sum 累积总和聚合

场景示例:计算到当月为止,每月累计销售金额的总和

代码语言:javascript
复制
POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "cumulative_sales": {
                    "cumulative_sum": {
                        "buckets_path": "sales"
                    }
                }
            }
        }
    }
}

结果:

代码语言:javascript
复制
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               },
               "cumulative_sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "cumulative_sales": {
                  "value": 610.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "cumulative_sales": {
                  "value": 985.0
               }
            }
         ]
      }
   }
}

2.6 剩下4个Parent类型的管道聚合

先了解下,如有需要再深入学习

03 Sibling 分类

3.1 六种统计类管道聚合

我觉得对于这6种管道聚合,应该不用一一介绍了,就和指标聚合类似,Avg、Max、Min、Sum,一看就知道是什么意思,简单看一个示例即可。

场景示例:计算每月销售额总量的平均值

代码语言:javascript
复制
POST /_search
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    },
    "avg_monthly_sales": {
      "avg_bucket": {
        "buckets_path": "sales_per_month>sales"
      }
    }
  }
}

结果:

代码语言:javascript
复制
{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "avg_monthly_sales": {
          "value": 328.33333333333333
      }
   }
}

对于 Stats Bucket 统计数据桶聚合,我们看下响应结果,知道统计了哪些指标即可:

代码语言:javascript
复制
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "stats_monthly_sales": {
         "count": 3,
         "min": 60.0,
         "max": 550.0,
         "avg": 328.3333333333333,
         "sum": 985.0
      }
   }
}

Extended Stats Bucket 扩展统计数据桶聚合,也直接看统计结果:

代码语言:javascript
复制
示例结果:
      "stats_monthly_sales": {
         "count": 3,
         "min": 60.0,
         "max": 550.0,
         "avg": 328.3333333333333,
         "sum": 985.0,
         "sum_of_squares": 446725.0,
         "variance": 41105.55555555556,
         "std_deviation": 202.74505063146563,
         "std_deviation_bounds": {
           "upper": 733.8234345962646,
           "lower": -77.15676792959795
         }
      }
   }

2.2 Percentiles Bucket 百分数桶聚合

场景示例:计算每月总销售额存储桶对应的百分比位置的金额

代码语言:javascript
复制
POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "calendar_interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "percentiles_monthly_sales": {
            "percentiles_bucket": {
                "buckets_path": "sales_per_month>sales",
                "percents": [ 25.0, 50.0, 75.0 ]
            }
        }
    }
}

结果:

代码语言:javascript
复制
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "percentiles_monthly_sales": {
        "values" : {
            "25.0": 375.0,
            "50.0": 375.0,
            "75.0": 550.0
         }
      }
   }
}

写在最后

对于需要使用聚合分析的小伙伴,建议一定要对ES的3种聚合有一个整体的概念,知道ES的聚合能做哪些数据操作,从而面对各种聚合分析的需求时候,才能快速反应,知道该用什么样的操作,而不是绞尽脑汁,使用自己仅知道的Max、Sum等简单聚合去组合DSL。

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-07-10,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 方才编程 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档