首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >hive sql(十)—— 编写sql语句实现每班前三名,分数一样并列, 同时求出前三名按名次排序的分差

hive sql(十)—— 编写sql语句实现每班前三名,分数一样并列, 同时求出前三名按名次排序的分差

作者头像
大数据最后一公里
发布2021-08-05 10:18:25
7020
发布2021-08-05 10:18:25
举报

需求

编写sql语句实现每班前三名,分数一样并列,
同时求出前三名按名次排序的分差

建表语句

create table student(
    sid string,--学号
    cid string,--班级号
    score string -- 分数
)
row format delimited fields terminated by '\t'
;

数据

#说明:
数据1具有偶然性,适合不重复的情况,实现可以用扩展部分写法1实现
数据2具有通用性,适合重复数据的情况
 
#数据1
insert overwrite table student values
("1","1901","90"),
("2","1901","90"),
("3","1901","83"),
("4","1901","60"),
("5","1902","66"),
("6","1902","23"),
("7","1902","99"),
("8","1902","67"),
("9","1902","87");
 
#数据2
insert overwrite table student values
("1","1901","90"),
("2","1901","80"),
("3","1901","80"),
("4","1901","70"),
("5","1901","75"),
("6","1902","60"),
("7","1902","66"),
("8","1902","23"),
("9","1902","99"),
("10","1902","67");

实现

select
    t1.cid,
    t1.score,
    t1.rank,
    t2.diff_score
from
    (select
        cid,
        score,
        dense_rank() over(partition by cid order by score desc) rank
    from
        student
    )t1
    left join
    (select
        cid,
        score,
        nvl(score-lag(score) over(partition by cid order by score desc),0) diff_score
    from
        student
    group by cid,score
    )t2
    on t1.score = t2.score
where t1.rank <= 3
;

结果

t1.cid  t1.score  t1.rank  t2.diff_score
1901  90  1  0.0
1901  80  2  -10.0
1901  80  2  -10.0
1901  75  3  -5.0
1902  99  1  0.0
1902  67  2  -32.0
1902  66  3  -1.0
Time taken: 88.539 seconds, Fetched: 7 row(s)

分析

1、根据需求发现需要对班级开窗,使用over函数指定班级为分区键
2、根据分数排序,按常识倒序,且分数相同名次一样,所以使用dense_rank,
3、根据名次计算差值,使用lag函数,取出当前行的前第几个数,然后计算差值
4、lag函数,如果是第一个数,则默认取不到,返回null值,需要转化为0,所以使用nvl函数,可以减少使用if函数
 
#补充
lag函数根据排序规则默认取上一行,比如数据排序是[90,80,80..],那么第二个80取到的还是80,导致计算差值时结果变成了[0,-10,0],显然这不是我们想要的结果。
 
解决办法【序号和差值分开做,再关联】:先做排序,再补充分数差a表。分数差的做法,做一个去重排序差值,得到有序名词、分数、差值的b表,再用a表和b表通过排序序号关联补充分数差。

扩展

#写法1
select
    t1.cid,
    t1.score,
    t1.rank,
    t1.diff_score
from
    (select
        cid,
        score,
        dense_rank() over(partition by cid order by score desc) rank,
        nvl(score-lag(score) over(partition by cid order by score desc),0) diff_score
    from
        student
    )t1
where rank <= 3
;
OK
1901  90  1  0.0
1901  90  1  0.0
1901  83  2  -7.0
1901  60  3  -23.0
1902  99  1  0.0
1902  87  2  -12.0
1902  67  3  -20.0
Time taken: 29.836 seconds, Fetched: 7 row(s)
 
#写法2
select
    t1.cid,
    t1.score,
    t1.rank,
    nvl(t1.score-lag(t1.score) over(partition by t1.cid order by t1.score desc),0) diff_score
from
    (select
        cid,
        score,
        dense_rank() over(partition by cid order by score desc) rank
    from
        student
    )t1
where t1.rank <= 3
;
Total MapReduce CPU Time Spent: 6 seconds 740 msec
OK
1901  90  1  0.0
1901  90  1  0.0
1901  83  2  -7.0
1901  60  3  -23.0
1902  99  1  0.0
1902  87  2  -12.0
1902  67  3  -20.0
Time taken: 49.058 seconds, Fetched: 7 row(s)
 
#写法2日志
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-04-15 15:05:31,734 Stage-1 map = 0%,  reduce = 0%
2021-04-15 15:05:39,196 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.4 sec
2021-04-15 15:05:46,505 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.88 sec
MapReduce Total cumulative CPU time: 3 seconds 880 msec
Ended Job = job_1618468831901_0003
Launching Job 2 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1618468831901_0004, Tracking URL = http://my-hdp2:8088/proxy/application_1618468831901_0004/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job  -kill job_1618468831901_0004
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2021-04-15 15:05:55,424 Stage-2 map = 0%,  reduce = 0%
2021-04-15 15:06:02,852 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2021-04-15 15:06:11,485 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.86 sec
MapReduce Total cumulative CPU time: 2 seconds 860 msec
Ended Job = job_1618468831901_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.88 sec   HDFS Read: 8747 HDFS Write: 278 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 2.86 sec   HDFS Read: 7919 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 740 msec
 
#写法1日志
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1618468831901_0002, Tracking URL = http://my-hdp2:8088/proxy/application_1618468831901_0002/
Kill Command = /usr/hdp/2.6.5.0-292/hadoop/bin/hadoop job  -kill job_1618468831901_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-04-15 14:59:54,428 Stage-1 map = 0%,  reduce = 0%
2021-04-15 15:00:06,550 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.97 sec
2021-04-15 15:00:14,991 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1618468831901_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.91 sec   HDFS Read: 10536 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec
 
 
对比实现部分sql,从控制台任务执行日志来看,写法1性能较高,时间较短
 
#lag默认用法
hive> select
    >     cid,
    >     score,
    >     dense_rank() over(partition by cid order by score desc) rank,
    >     lag(score) over(partition by cid order by score desc)
    > from
    >     student
    > ;
OK
1901  90  1  NULL
1901  90  1  90
1901  83  2  90
1901  60  3  83
1902  99  1  NULL
1902  87  2  99
1902  67  3  87
1902  66  4  67
1902  23  5  66
Time taken: 22.886 seconds, Fetched: 9 row(s)

知识点

1、nvl(value1,value2):如果value1是null,则返回value2,如果不是,则返回value1
2、dense_rank():基于over开窗函数的排序函数,如果值相同,则排序的序号相同,紧接的序号不跳过。举例123,1123,1223这样
3、lag(字段,n):基于over开窗函数,根据排序规则取当前行前第n个数,如果不指定n,则默认取前一个
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2021-07-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 大数据最后一公里 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档