在Hive 1.2.1中有以下数据(我的实际数据集要大得多,但数据结构类似):
id radar_id car_id datetime
1 A21 123 2017-03-08 17:31:19.0
2 A21 555 2017-03-08 17:32:00.0
3 A21 777 2017-03-08 17:33:00.0
4 B15 123 2017-03-08 17:35:22.0
5 B15 555 2017-03-08 17:34:05.0
6 B15 777 2017-03-08 20:50:12.0
7 C09 777 2017-03-08 20:55:00.0
8 A21 123 2017-03-09 11:00:00.0
9 C11 664 2017-03-09 11:10:00.0
10 A21 123 2017-03-09 11:12:00.0
11 A21 555 2017-03-09 11:12:10.0
12 B15 123 2017-03-09 11:14:00.0
13 B15 555 2017-03-09 11:20:00.0
14 A21 444 2017-03-09 10:00:00.0
15 C09 444 2017-03-09 10:20:00.0
16 B15 444 2017-03-09 10:05:00.0
我想去最常用的两条路。路由是由radar_id
命令的datetime
序列。我想得出以下结果:
route frequency
A21->B15 2
A21->B15-C09 1
频率是车辆(不是唯一的,不需要考虑car_id
)每天经过一条路线的平均次数。对于路由A21->B15
,频率为2,因为2017-03-08
上有3次,2017-03-09
上有1次。重要的是车辆123
在date 2017-03-09
上做了一条路线A21->A21->B15
。它与A21->B15
不一样。所以,我想考虑从最初的雷达到最后的雷达在白天捕获那辆车的路线。
当骑行从23:55开始,到00:22结束的情况下,应该被认为是两条不同的路线。
我如何使用Hive 1.2.1来实现它?
更新:
正如答案中所建议的那样,我测试了这个查询,但是route
不包含->
。路由的值看起来像000021
或0450001
等等。
df = sqlContext.sql("select regexp_replace(route,'(?<=^|->)\\d{5}','') as route " +
",count(*) / min(days) as frequency " +
"from (select concat_ws('->',sort_array(collect_list(radarids))) as route " +
",count(distinct dt) over() as days " +
"from (select car_id " +
",to_date(datetime) as dt " +
",concat(printf('%05d',row_number() over " +
"(partition by car_id,to_date(datetime) " +
"order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
"from mytable " +
") t " +
"group by car_id " +
",dt " +
") t " +
"group by route " +
"order by frequency desc " +
"limit 5")
发布于 2017-08-06 05:04:43
select regexp_replace(route,'(?<=^|->)\\d{5}','') as route
,count(*) / min(days) as frequency
from (select concat_ws('->',sort_array(collect_list(radar_ids))) as route
,count(distinct dt) over() as days
from (select car_id
,to_date(datetime) as dt
,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
from mytable
) t
group by car_id
,dt
) t
group by route
order by frequency desc
limit 2
;
+---------------+-----------+
| route | frequency |
+---------------+-----------+
| A21->B15 | 1.5 |
+---------------+-----------+
| A21->B15->C09 | 1.0 |
+---------------+-----------+
发布于 2017-08-05 19:57:33
从文档看来,HIVE不支持递归CTE,但幸运的是它支持子查询、group by
clasue、row_number
分析函数、trunc(string date, string format)
函数、concat
函数和LIMIT x
子句。
我没有访问Hive的权限,但是我可以展示如何在PostgreSQL上构建这样的查询,它们之间的差别很小,所以我相信您可以重写它。我认为唯一要替换的是来自postgres的date_trunc('day', datetim )
函数,它由Hive的trunc(datetim , 'DD')
代替。
SELECT route, avg( cnt ) as average
FROM (
SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
count(*) as cnt
FROM (
SELECT date_trunc('day', datetim ) As datetim, car_id,
max( case when rn = 1 then radar_id end ) as route1,
max( case when rn = 2 then radar_id end ) as route2,
max( case when rn = 3 then radar_id end ) as route3,
max( case when rn = 4 then radar_id end ) as route4
/* max( case when rn = 5 then radar_id end ) as route5
......
max( case when rn = N then radar_id end ) as routeN */
FROM (
select t.*,
row_number() over (
partition by date_trunc('day', datetim ),car_id
order by datetim
) as rn
from table111 t
) x
GROUP BY date_trunc('day', datetim ), car_id
) x
group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;
演示:http://sqlfiddle.com/#!15/53c7e/27
| route | average |
|--------------|---------|
| A21>B15>> | 3 |
| A21>B15>C09> | 2 |
https://stackoverflow.com/questions/45522929
复制相似问题