文章/答案/技术大牛

发布

社区首页 >问答首页 >如何从历史数据中获取最常用的路径？

问如何从历史数据中获取最常用的路径？
EN

Stack Overflow用户

提问于 2017-08-05 14:19:05

回答 2查看 207关注 0票数 1

在Hive 1.2.1中有以下数据(我的实际数据集要大得多，但数据结构类似)：

id    radar_id     car_id     datetime
1     A21          123        2017-03-08 17:31:19.0
2     A21          555        2017-03-08 17:32:00.0
3     A21          777        2017-03-08 17:33:00.0
4     B15          123        2017-03-08 17:35:22.0
5     B15          555        2017-03-08 17:34:05.0
6     B15          777        2017-03-08 20:50:12.0
7     C09          777        2017-03-08 20:55:00.0
8     A21          123        2017-03-09 11:00:00.0
9     C11          664        2017-03-09 11:10:00.0
10    A21          123        2017-03-09 11:12:00.0
11    A21          555        2017-03-09 11:12:10.0
12    B15          123        2017-03-09 11:14:00.0
13    B15          555        2017-03-09 11:20:00.0
14    A21          444        2017-03-09 10:00:00.0
15    C09          444        2017-03-09 10:20:00.0
16    B15          444        2017-03-09 10:05:00.0

我想去最常用的两条路。路由是由radar_id命令的datetime序列。我想得出以下结果：

route          frequency
A21->B15       2
A21->B15-C09   1

频率是车辆(不是唯一的，不需要考虑car_id)每天经过一条路线的平均次数。对于路由A21->B15，频率为2，因为2017-03-08上有3次，2017-03-09上有1次。重要的是车辆123在date 2017-03-09上做了一条路线A21->A21->B15。它与A21->B15不一样。所以，我想考虑从最初的雷达到最后的雷达在白天捕获那辆车的路线。

当骑行从23:55开始，到00:22结束的情况下，应该被认为是两条不同的路线。

我如何使用Hive 1.2.1来实现它？

更新：

正如答案中所建议的那样，我测试了这个查询，但是route不包含->。路由的值看起来像000021或0450001等等。

df = sqlContext.sql("select      regexp_replace(route,'(?<=^|->)\\d{5}','')  as route " +
                                      ",count(*) / min(days)                        as frequency " +

                           "from       (select      concat_ws('->',sort_array(collect_list(radarids))) as route " +
                                                  ",count(distinct dt) over()                           as days " +
                                       "from       (select  car_id " +
                                                  ",to_date(datetime)   as dt " +
                                                  ",concat(printf('%05d',row_number() over " +
                                                  "(partition by car_id,to_date(datetime) " +
                                                  "order by to_unix_timestamp(datetime))),cast(radarid as string)) as radarids " +
                                                  "from    mytable " +
                                                  ") t " +
                                       "group by    car_id " +
                                      ",dt " +
                                      ") t " +
                           "group by    route " +      
                           "order by    frequency desc " +
                           "limit       5")

sql

hive

hiveql

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-08-06 05:04:43

select      regexp_replace(route,'(?<=^|->)\\d{5}','')  as route
           ,count(*) / min(days)                        as frequency

from       (select      concat_ws('->',sort_array(collect_list(radar_ids))) as route
                       ,count(distinct dt) over()                           as days
            from       (select  car_id
                               ,to_date(datetime)   as dt
                               ,concat(printf('%05d',row_number() over (partition by car_id,to_date(datetime) order by datetime)),radar_id) as radar_ids
                        from    mytable
                        ) t
            group by    car_id
                       ,dt
            ) t
group by    route          
order by    frequency desc
limit       2 
;

+---------------+-----------+
| route         | frequency |
+---------------+-----------+
| A21->B15      | 1.5       |
+---------------+-----------+
| A21->B15->C09 | 1.0       |
+---------------+-----------+

票数 4

Stack Overflow用户

发布于 2017-08-05 19:57:33

从文档看来，HIVE不支持递归CTE，但幸运的是它支持子查询、group by clasue、row_number分析函数、trunc(string date, string format)函数、concat函数和LIMIT x子句。

我没有访问Hive的权限，但是我可以展示如何在PostgreSQL上构建这样的查询，它们之间的差别很小，所以我相信您可以重写它。我认为唯一要替换的是来自postgres的date_trunc('day', datetim )函数，它由Hive的trunc(datetim , 'DD')代替。

SELECT route, avg( cnt ) as average
FROM (
        SELECT concat(route1, '>', route2, '>', route3, '>', route4) as Route,
               count(*) as cnt
        FROM (
                SELECT date_trunc('day', datetim ) As datetim, car_id,
                    max( case when rn = 1 then radar_id end ) as route1,
                    max( case when rn = 2 then radar_id end ) as route2,
                    max( case when rn = 3 then radar_id end ) as route3,
                    max( case when rn = 4 then radar_id end ) as route4
                /*  max( case when rn = 5 then radar_id end ) as route5
                    ......
                    max( case when rn = N then radar_id end ) as routeN */
                FROM (
                    select t.*,
                           row_number() over (
                               partition by date_trunc('day', datetim ),car_id 
                               order by datetim 
                           ) as rn
                    from table111 t
                ) x
                GROUP BY date_trunc('day', datetim ), car_id
        ) x
        group by concat(route1, '>', route2, '>', route3, '>', route4)
) x
GROUP BY route
order by avg( cnt ) desc
LIMIT 2
;

演示：http://sqlfiddle.com/#!15/53c7e/27

|        route | average |
|--------------|---------|
|    A21>B15>> |       3 |
| A21>B15>C09> |       2 |

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45522929

复制

相似问题

问如何从历史数据中获取最常用的路径？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从历史数据中获取最常用的路径？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从历史数据中获取最常用的路径？
EN