首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何从历史数据中获取最常用的路径?

如何从历史数据中获取最常用的路径?
EN

Stack Overflow用户
提问于 2017-08-05 14:19:05
回答 2查看 207关注 0票数 1

在Hive 1.2.1中有以下数据(我的实际数据集要大得多,但数据结构类似):

代码语言:javascript
运行
复制
id    radar_id     car_id     datetime
1     A21          123        2017-03-08 17:31:19.0
2     A21          555        2017-03-08 17:32:00.0
3     A21          777        2017-03-08 17:33:00.0
4     B15          123        2017-03-08 17:35:22.0
5     B15          555        2017-03-08 17:34:05.0
6     B15          777        2017-03-08 20:50:12.0
7     C09          777        2017-03-08 20:55:00.0
8     A21          123        2017-03-09 11:00:00.0
9     C11          664        2017-03-09 11:10:00.0
10    A21          123        2017-03-09 11:12:00.0
11    A21          555        2017-03-09 11:12:10.0
12    B15          123        2017-03-09 11:14:00.0
13    B15          555        2017-03-09 11:20:00.0
14    A21          444        2017-03-09 10:00:00.0
15    C09          444        2017-03-09 10:20:00.0
16    B15          444        2017-03-09 10:05:00.0

我想去最常用的两条路。路由是由radar_id命令的datetime序列。我想得出以下结果:

代码语言:javascript
运行
复制
route          frequency
A21->B15       2
A21->B15-C09   1

频率是车辆(不是唯一的,不需要考虑car_id)每天经过一条路线的平均次数。对于路由A21->B15,频率为2,因为2017-03-08上有3次,2017-03-09上有1次。重要的是车辆123在date 2017-03-09上做了一条路线A21->A21->B15。它与A21->B15不一样。所以,我想考虑从最初的雷达到最后的雷达在白天捕获那辆车的路线。

当骑行从23:55开始,到00:22结束的情况下,应该被认为是两条不同的路线。

我如何使用Hive 1.2.1来实现它?

更新:

正如答案中所建议的那样,我测试了这个查询,但是route不包含->。路由的值看起来像0000210450001等等。

代码语言:javascript
运行
复制
df = sqlContext.sql("select      regexp_replace(route,'(?<=^|->)\\d{5}','')  as route " +
                                      ",count(*) / min(days)                        as frequency " +

                           "from       (select      concat_ws('->',sort_array(collect_list(radarids))) as route " +
                                                  ",count(distinct dt) over()                           as days " +
                                       "from       (select  car_id " +
                                                  ",to_date(datetime)   as dt " +
                                                  ",concat(printf('%05d',row_number() over " +
                                                  "(partition by car_id,to_date(datetime) " +
                                                  "order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
                                                  "from    mytable " +
                                                  ") t " +
                                       "group by    car_id " +
                                      ",dt " +
                                      ") t " +
                           "group by    route " +      
                           "order by    frequency desc " +
                           "limit       5")
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-08-06 05:04:43

代码语言:javascript
运行
复制
select      regexp_replace(route,'(?<=^|->)\\d{5}','')  as route
           ,count(*) / min(days)                        as frequency

from       (select      concat_ws('->',sort_array(collect_list(radar_ids))) as route
                       ,count(distinct dt) over()                           as days
            from       (select  car_id
                               ,to_date(datetime)   as dt
                               ,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
                        from    mytable
                        ) t
            group by    car_id
                       ,dt
            ) t
group by    route          
order by    frequency desc
limit       2 
;
代码语言:javascript
运行
复制
+---------------+-----------+
| route         | frequency |
+---------------+-----------+
| A21->B15      | 1.5       |
+---------------+-----------+
| A21->B15->C09 | 1.0       |
+---------------+-----------+
票数 4
EN

Stack Overflow用户

发布于 2017-08-05 19:57:33

文档看来,HIVE不支持递归CTE,但幸运的是它支持子查询、group by clasue、row_number分析函数、trunc(string date, string format)函数、concat函数和LIMIT x子句。

我没有访问Hive的权限,但是我可以展示如何在PostgreSQL上构建这样的查询,它们之间的差别很小,所以我相信您可以重写它。我认为唯一要替换的是来自postgres的date_trunc('day', datetim )函数,它由Hive的trunc(datetim , 'DD')代替。

代码语言:javascript
运行
复制
SELECT route, avg( cnt ) as average
FROM (
        SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
               count(*) as cnt
        FROM (
                SELECT date_trunc('day', datetim ) As datetim, car_id,
                    max( case when rn = 1 then radar_id end ) as route1,
                    max( case when rn = 2 then radar_id end ) as route2,
                    max( case when rn = 3 then radar_id end ) as route3,
                    max( case when rn = 4 then radar_id end ) as route4
                /*  max( case when rn = 5 then radar_id end ) as route5
                    ......
                    max( case when rn = N then radar_id end ) as routeN */
                FROM (
                    select t.*,
                           row_number() over (
                               partition by date_trunc('day', datetim ),car_id 
                               order by datetim 
                           ) as rn
                    from table111 t
                ) x
                GROUP BY date_trunc('day', datetim ), car_id
        ) x
        group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;

演示:http://sqlfiddle.com/#!15/53c7e/27

代码语言:javascript
运行
复制
|        route | average |
|--------------|---------|
|    A21>B15>> |       3 |
| A21>B15>C09> |       2 |
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/45522929

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档