专栏首页Jed的技术阶梯Hive窗口函数04-LAG、LEAD、FIRST_VALUE、LAST_VALUE

Hive窗口函数04-LAG、LEAD、FIRST_VALUE、LAST_VALUE

Hive窗口函数LAG、LEAD、FIRST_VALUE、LAST_VALUE入门

1. 数据说明

现有 hive 表 cookie4, 内容如下:

hive> select * from cookie4;

cookie4.cookieid    cookie4.createtime          cookie4.url
cookie1             2015-04-10 10:00:02         url2
cookie1             2015-04-10 10:00:00         url1
cookie1             2015-04-10 10:03:04         1url3
cookie1             2015-04-10 10:50:05         url6
cookie1             2015-04-10 11:00:00         url7
cookie1             2015-04-10 10:10:00         url4
cookie1             2015-04-10 10:50:01         url5
cookie2             2015-04-10 10:00:02         url22
cookie2             2015-04-10 10:00:00         url11
cookie2             2015-04-10 10:03:04         1url33
cookie2             2015-04-10 10:50:05         url66
cookie2             2015-04-10 11:00:00         url77
cookie2             2015-04-10 10:10:00         url44
cookie2             2015-04-10 10:50:01         url55
  • 其中字段意义: cookieid(string), createtime(string), url(int)
  • 分别代表: cookieid, 创建时间, 访问的url

2. lag()操作

LAG(col,n,DEFAULT)用于统计窗口内往上第n行值 第一个参数为列名 第二个参数为往上第n行(可选,默认为1) 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

--(1)
hive> SELECT cookieid, createtime, url,
    > LAG(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime) AS last_2_time 
    > FROM cookie4;
    
结果:因为没有设置默认值,当没有上两行时显示为NULL
cookieid    createtime          url     last_2_time
cookie1     2015-04-10 10:00:00 url1    NULL 
cookie1     2015-04-10 10:00:02 url2    NULL
cookie1     2015-04-10 10:03:04 1url3   2015-04-10 10:00:00
cookie1     2015-04-10 10:10:00 url4    2015-04-10 10:00:02
cookie1     2015-04-10 10:50:01 url5    2015-04-10 10:03:04
cookie1     2015-04-10 10:50:05 url6    2015-04-10 10:10:00
cookie1     2015-04-10 11:00:00 url7    2015-04-10 10:50:01
cookie2     2015-04-10 10:00:00 url11   NULL
cookie2     2015-04-10 10:00:02 url22   NULL
cookie2     2015-04-10 10:03:04 1url33  2015-04-10 10:00:00
cookie2     2015-04-10 10:10:00 url44   2015-04-10 10:00:02
cookie2     2015-04-10 10:50:01 url55   2015-04-10 10:03:04
cookie2     2015-04-10 10:50:05 url66   2015-04-10 10:10:00
cookie2     2015-04-10 11:00:00 url77   2015-04-10 10:50:01

--(2)
hive> SELECT cookieid, createtime, url,
    > LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time 
    > FROM cookie4;
    
结果:
cookieid    createtime          url     last_1_time
cookie1     2015-04-10 10:00:00 url1    1970-01-01 00:00:00 (显示默认值)
cookie1     2015-04-10 10:00:02 url2    2015-04-10 10:00:00
cookie1     2015-04-10 10:03:04 1url3   2015-04-10 10:00:02
cookie1     2015-04-10 10:10:00 url4    2015-04-10 10:03:04
cookie1     2015-04-10 10:50:01 url5    2015-04-10 10:10:00
cookie1     2015-04-10 10:50:05 url6    2015-04-10 10:50:01
cookie1     2015-04-10 11:00:00 url7    2015-04-10 10:50:05
cookie2     2015-04-10 10:00:00 url11   1970-01-01 00:00:00 (显示默认值)
cookie2     2015-04-10 10:00:02 url22   2015-04-10 10:00:00
cookie2     2015-04-10 10:03:04 1url33  2015-04-10 10:00:02
cookie2     2015-04-10 10:10:00 url44   2015-04-10 10:03:04
cookie2     2015-04-10 10:50:01 url55   2015-04-10 10:10:00
cookie2     2015-04-10 10:50:05 url66   2015-04-10 10:50:01
cookie2     2015-04-10 11:00:00 url77   2015-04-10 10:50:05

3. lead()操作

lead的作用与lag相反 LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值 第一个参数为列名 第二个参数为往下第n行(可选,默认为1) 第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

hive> SELECT cookieid, createtime, url,
    > LEAD(createtime, 2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time,
    > LEAD(createtime, 1, '1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time
    > FROM cookie4;
    
结果:
cookieid    createtime          url     next_2_time         next_1_time
cookie1     2015-04-10 10:00:00 url1    2015-04-10 10:03:04 2015-04-10 10:00:02
cookie1     2015-04-10 10:00:02 url2    2015-04-10 10:10:00 2015-04-10 10:03:04
cookie1     2015-04-10 10:03:04 1url3   2015-04-10 10:50:01 2015-04-10 10:10:00
cookie1     2015-04-10 10:10:00 url4    2015-04-10 10:50:05 2015-04-10 10:50:01
cookie1     2015-04-10 10:50:01 url5    2015-04-10 11:00:00 2015-04-10 10:50:05
cookie1     2015-04-10 10:50:05 url6    NULL                2015-04-10 11:00:00
cookie1     2015-04-10 11:00:00 url7    NULL                1970-01-01 00:00:00
cookie2     2015-04-10 10:00:00 url11   2015-04-10 10:03:04 2015-04-10 10:00:02
cookie2     2015-04-10 10:00:02 url22   2015-04-10 10:10:00 2015-04-10 10:03:04
cookie2     2015-04-10 10:03:04 1url33  2015-04-10 10:50:01 2015-04-10 10:10:00
cookie2     2015-04-10 10:10:00 url44   2015-04-10 10:50:05 2015-04-10 10:50:01
cookie2     2015-04-10 10:50:01 url55   2015-04-10 11:00:00 2015-04-10 10:50:05
cookie2     2015-04-10 10:50:05 url66   NULL                2015-04-10 11:00:00
cookie2     2015-04-10 11:00:00 url77   NULL                1970-01-01 00:00:00

4.FIRST_VALUE()操作

取分组内排序后,截止到当前行,第一个值

hive> SELECT cookieid, createtime, url,
    > FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first 
    > FROM cookie4;
    
结果:
cookieid    createtime          url     first
cookie1     2015-04-10 10:00:00 url1    url1
cookie1     2015-04-10 10:00:02 url2    url1
cookie1     2015-04-10 10:03:04 1url3   url1
cookie1     2015-04-10 10:10:00 url4    url1
cookie1     2015-04-10 10:50:01 url5    url1
cookie1     2015-04-10 10:50:05 url6    url1
cookie1     2015-04-10 11:00:00 url7    url1
cookie2     2015-04-10 10:00:00 url11   url11
cookie2     2015-04-10 10:00:02 url22   url11
cookie2     2015-04-10 10:03:04 1url33  url11
cookie2     2015-04-10 10:10:00 url44   url11
cookie2     2015-04-10 10:50:01 url55   url11
cookie2     2015-04-10 10:50:05 url66   url11
cookie2     2015-04-10 11:00:00 url77   url11

5.LAST_VALUE()操作

取分组内排序后,截止到当前行,最后一个值

hive> SELECT cookieid, createtime, url,
    > LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last
    > FROM cookie4;
    
结果:注意,是截止到当前行的最后一个值,其实就是它本身
cookieid    createtime          url     last
cookie1     2015-04-10 10:00:00 url1    url1
cookie1     2015-04-10 10:00:02 url2    url2
cookie1     2015-04-10 10:03:04 1url3   1url3
cookie1     2015-04-10 10:10:00 url4    url4
cookie1     2015-04-10 10:50:01 url5    url5
cookie1     2015-04-10 10:50:05 url6    url6
cookie1     2015-04-10 11:00:00 url7    url7
cookie2     2015-04-10 10:00:00 url11   url11
cookie2     2015-04-10 10:00:02 url22   url22
cookie2     2015-04-10 10:03:04 1url33  1url33
cookie2     2015-04-10 10:10:00 url44   url44
cookie2     2015-04-10 10:50:01 url55   url55
cookie2     2015-04-10 10:50:05 url66   url66
cookie2     2015-04-10 11:00:00 url77   url77

参考文章:Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • Redis 3.x 单节点和伪分布式安装

    安装的时候指定端口号和命令执行路径,其余保持默认即可,这里演示另一个服务的安装并省略部分重复内容

    CoderJed
  • 通过已有的虚拟机克隆四台虚拟机

    环境准备:一个已经安装好的虚拟机 要求: 删除了 /etc/udev/rules.d/70-persistent-net.rules 这个文件后,关机,保存...

    CoderJed
  • Hive窗口函数05-GROUPING SETS、GROUPING__ID、CUBE、ROLLUP

    Hive窗口函数GROUPING SETS、GROUPING__ID、CUBE、ROLLUP入门

    CoderJed
  • 注意:C++中double的表示是有误差的

    注意:C++中double的表示是有误差的,直接通过下面的例子看一下 1 #include<iostream> 2 using namespace std;...

    用户1215536
  • 自增自减表达式-c语言学习笔记

    Youngxj
  • 不同时间复杂度的规模上限

    饶文津
  • Python基础——切片实例

    py3study
  • Python之指数与E记法

    py3study
  • Leetcode-Easy 806. Number of Lines To Write String

    给一个字符串S,从左到右将它们排列行,每行最大长度为100,,同时给定一个数组withds,widths[0]对应着 a的宽度, widths[1]对应着b的宽...

    致Great
  • LeetCode 806. 写字符串需要的行数

    我们要把给定的字符串 S 从左到右写到每一行上,每一行的最大宽度为100个单位,如果我们在写某个字母的时候会使这行超过了100 个单位,那么我们应该把这个字母写...

    Michael阿明

扫码关注云+社区

领取腾讯云代金券