python实现elastcsearch中timestampe(long)类型的date_histogram聚合测试

sparkexpert

发布于 2019-05-26 14:09:41

1K0

发布于 2019-05-26 14:09:41

文章被收录于专栏：大数据智能实战

由于老版本的elasticsearch不支持date类型，因此之前的存储(5.0版本）都用了timestamp来进行设计。

当新的es版本（６．０）支持日期date_histogram统计聚合函数时，发现其interval可以设置相当灵活用于设置各种间隔，如下：

Here are the valid time specifications and their meanings:

milliseconds (ms)

Fixed length interval; supports multiples.

seconds (s)

1000 milliseconds; fixed length interval (except for the last second of a minute that contains a leap-second, which is 2000ms long); supports multiples.

minutes (m)

All minutes begin at 00 seconds.

One minute (1m) is the interval between 00 seconds of the first minute and 00 seconds of the following minute in the specified timezone, compensating for any intervening leap seconds, so that the number of minutes and seconds past the hour is the same at the start and end.
Multiple minutes (nm) are intervals of exactly 60x1000=60,000 milliseconds each.

hours (h)

All hours begin at 00 minutes and 00 seconds.

One hour (1h) is the interval between 00:00 minutes of the first hour and 00:00 minutes of the following hour in the specified timezone, compensating for any intervening leap seconds, so that the number of minutes and seconds past the hour is the same at the start and end.
Multiple hours (nh) are intervals of exactly 60x60x1000=3,600,000 milliseconds each.

days (d)

All days begin at the earliest possible time, which is usually 00:00:00 (midnight).

One day (1d) is the interval between the start of the day and the start of of the following day in the specified timezone, compensating for any intervening time changes.
Multiple days (nd) are intervals of exactly 24x60x60x1000=86,400,000 milliseconds each.

weeks (w)

One week (1w) is the interval between the start day_of_week:hour:minute:second and the same day of the week and time of the following week in the specified timezone.
Multiple weeks (nw) are not supported.

months (M)

One month (1M) is the interval between the start day of the month and time of day and the same day of the month and time of the following month in the specified timezone, so that the day of the month and time of day are the same at the start and end.
Multiple months (nM) are not supported.

quarters (q)

One quarter (1q) is the interval between the start day of the month and time of day and the same day of the month and time of day three months later, so that the day of the month and time of day are the same at the start and end.
Multiple quarters (nq) are not supported.

years (y)

One year (1y) is the interval between the start day of the month and time of day and the same day of the month and time of day the following year in the specified timezone, so that the date and time are the same at the start and end.
Multiple years (ny) are not supported

然而对于原先老版本的timestamp如何实现其date_histogram，网上很多说法是无法进行直接的利用。而设置interval为相应秒数的情况下也无法确认为周或者月。

然而具体测试结果发现，ＥＳ能够自动识别数据的情况，进行测试。具体测试脚本如下：

（１）写入es，按照long的timestamp类型进行写入

'''
    写入ＥＳ
'''
def WriteES():
    es = Elasticsearch()
    
    base = datetime.datetime.today()
    numdays = 100
    
    j = 0
    actions = []
    while (j <= 100):
        d1 = base - datetime.timedelta(days = j)
        ts= int(time.mktime(d1.timetuple())*1000)
        action = {
            "_index": "tickets",
            "_type": "last",
            "_id": j,
            "_source": {
                "count":randint(0,1000),
                "timestamp": ts
                }
            }
        actions.append(action)
        j += 1
    
    helpers.bulk(es, actions)

(2) 聚合测试：

def AggES():
    client = Elasticsearch()
    
    s = Search(using=client)
    s.aggs.bucket('per_tag', 'date_histogram', field='timestamp', interval='week') \
        .metric('clicks_per_day', 'sum', field='count')# \
    
    response = s.execute()
    
    print('查询结果')
    for hit in response:
        st = datetime.fromtimestamp(hit.timestamp//1000).strftime('%Y-%m-%d %H:%M:%S')
        print(hit.meta.score, hit.count,st)
    
    print('聚合结果')
    for tag in response.aggregations.per_tag.buckets:
        st = datetime.fromtimestamp(tag.key//1000).strftime('%Y-%m-%d %H:%M:%S')
        print(st, tag.clicks_per_day.value)

（３）打印输出过程，可以发现可以快速实现按周的统计

查询结果 1.0 720 2018-11-06 16:44:03 1.0 438 2018-10-23 16:44:03 1.0 403 2018-10-18 16:44:03 1.0 113 2018-10-15 16:44:03 1.0 503 2018-10-13 16:44:03 1.0 928 2018-10-12 16:44:03 1.0 89 2018-10-11 16:44:03 1.0 590 2018-10-08 16:44:03 1.0 854 2018-09-27 16:44:03 1.0 846 2018-09-26 16:44:03 聚合结果 2018-07-23 08:00:00 618.0 2018-07-30 08:00:00 3657.0 2018-08-06 08:00:00 4519.0 2018-08-13 08:00:00 3609.0 2018-08-20 08:00:00 3204.0 2018-08-27 08:00:00 3378.0 2018-09-03 08:00:00 3365.0 2018-09-10 08:00:00 4609.0 2018-09-17 08:00:00 3594.0 2018-09-24 08:00:00 3918.0 2018-10-01 08:00:00 3098.0 2018-10-08 08:00:00 4251.0 2018-10-15 08:00:00 3235.0 2018-10-22 08:00:00 2689.0 2018-10-29 08:00:00 4493.0 2018-11-05 08:00:00 1254.0 work done!

（４）按月的统计：只需要修改相应配置