Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.8 HQL:排序)(草稿)

第11章 Hive:SQL on Hadoop

11.8 HQL:排序

11.8.1 order by

Hive 中的 order by与SQL 中的order by语义相同,会对查询结果进行全局排序,但是Hive 语句最终要转换为 MapReduce 程序放到 Hadoop 分布式集群上去执行,多个Mapper 后汇集到一个 Reducer 上执行,如果结果数据量大,那就会造成 Reduce 执行相当漫长。

hive> select * from emp
    > order by hiredate;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814100601_114fbe7b-a71a-4575-8a85-582f83ead490
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0001, Tracking URL = http://node1:8088/proxy/application_1502717288397_0001/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:06:36,164 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:06:57,150 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.01 sec
2017-08-14 10:07:13,160 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.46 sec
MapReduce Total cumulative CPU time: 7 seconds 460 msec
Ended Job = job_1502717288397_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.46 sec   HDFS Read: 10295 HDFS Write: 821 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 460 msec
OK
7369    SMITH   CLERK   7902    1980-12-17  800.0   0.0 20
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02  2975.0  0.0 20
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7782    CLARK   MANAGER 7839    1981-06-09  2450.0  0.0 10
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7839    KING    PRESIDENT   0   1981-11-17  5000.0  0.0 10
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
7902    FORD    ANALYST 7566    1981-12-03  3000.0  0.0 20
7934    MILLER  CLERK   7782    1982-01-23  1300.0  0.0 10
Time taken: 74.031 seconds, Fetched: 12 row(s)
hive> 
hive> select * from emp
    > where did=30
    > order by sal desc,eid asc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101221_ac53277e-41d7-42a7-aa11-3ba2fe041795
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0002, Tracking URL = http://node1:8088/proxy/application_1502717288397_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:12:45,371 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:13:06,327 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.0 sec
2017-08-14 10:13:23,493 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.38 sec
MapReduce Total cumulative CPU time: 9 seconds 380 msec
Ended Job = job_1502717288397_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 9.38 sec   HDFS Read: 10967 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 380 msec
OK
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
Time taken: 65.629 seconds, Fetched: 6 row(s)
hive> 

请注意 , Hive中尽量不要用Order By, 除非非常确定结果集很小 。但是排序的需求总是有的,Hive 中使用下面的几种排序来满足需求。

11.8.2 sort by

sort by是在每个reduce中进行排序,是一个局部排序,可以保证每个 Reducer 中是按照指定字段进行排好序的,但是全局上来说,相同字段可以被分配到不同的Reduce 上,虽然在各个 Reduce 上是排好序的,但是全局上不一定是排好序的。

在执行Hive查询时,可以看到如下信息:

Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

  • hive.exec.reducers.bytes.per.reducer 此参数从Hive 0.2.0开始引入。在Hive 0.14.0版本之前默认值是1G;而从Hive 0.14.0开始以后的版本,默认值变成了256M。
  • mapred.reduce.tasks 如果设置了mapred.reduce.tasks,那么Hive会直接使用它的值作为Reduce的个数; 如果mapred.reduce.tasks的值没有设置(默认值就是-1),那么Hive会根据输入文件的大小估算出Reduce的个数。根据输入文件估算Reduce的个数可能未必很准确,因为Reduce的输入是Map的输出,而Map的输出可能会比输入要小,所以最准确的数根据Map的输出估算Reduce的个数。
[root@node3 ~]# cd /opt/hive-2.1.1/conf/
[root@node3 conf]# vi hive-site.xml 
[root@node3 conf]# cat hive-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
          <name>hive.metastore.local</name>
          <!--本地存储,MySQL和hive安装在同一节点-->
          <value>true</value>
    </property>
    <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://node3:3306/hive?createDatabaseIfNotExist=true</value>      
    </property>
    <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>          
          <name>javax.jdo.option.ConnectionUserName</name>        
          <!--mysql数据库用户名-->            
          <value>root</value>
    </property>
    <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>123456</value>
    </property>
    <property>
          <name>mapred.reduce.tasks</name>
          <value>2</value>
    </property>

</configuration>
[root@node3 conf]# 
hive> select eid,ename,sal from emp sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820064647_81b26e52-d158-4861-89f0-677ec0db0b52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0009, Tracking URL = http://node1:8088/proxy/application_1503220733636_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 06:47:02,958 Stage-1 map = 0%,  reduce = 0%
2017-08-20 06:47:12,622 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
2017-08-20 06:47:24,950 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.35 sec
MapReduce Total cumulative CPU time: 6 seconds 350 msec
Ended Job = job_1503220733636_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.35 sec   HDFS Read: 12635 HDFS Write: 559 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 350 msec
OK
8888    HADRON  NULL
7369    SMITH   800.0
7900    JAMES   950.0
7521    WARD    1250.0
7934    MILLER  1300.0
7844    TURNER  1500.0
7782    CLARK   2450.0
7698    BLAKE   2850.0
7566    JONES   2975.0
7654    MARTIN  1250.0
7499    ALLEN   1600.0
7902    FORD    3000.0
7839    KING    5000.0
Time taken: 39.092 seconds, Fetched: 13 row(s)
hive> 

11.8.3 distribute by

distribute by 指定 map 输出结果怎么样划分后分配到各个 Reduce 上去,比如 distribute by cid,就可以保证 cid 字段相同的结果被分配到同一个 reduce 上去执行。然后再指定 sort by cid,则在 Reduce 上进行按照 cid 进行排序。但是这种还是不能做到全局排序,只能保证排序字段值相同的放在一起,并且在 reduce 上局部是排好序的。 需要注意的是 distribute by 必须写在 sort by 前面。

hive> select eid,ename,sal,did from emp distribute by did sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090113_637965de-369d-4a26-b233-0ee69513d6f3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0014, Tracking URL = http://node1:8088/proxy/application_1503220733636_0014/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:01:36,959 Stage-1 map = 0%,  reduce = 0%
2017-08-20 09:01:57,587 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec
2017-08-20 09:02:14,251 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.04 sec
MapReduce Total cumulative CPU time: 9 seconds 40 msec
Ended Job = job_1503220733636_0014
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 9.04 sec   HDFS Read: 13200 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 40 msec
OK
8888    HADRON  NULL    NULL
7369    SMITH   800.0   20
7900    JAMES   950.0   30
7654    MARTIN  1250.0  30
7521    WARD    1250.0  30
7934    MILLER  1300.0  10
7844    TURNER  1500.0  30
7499    ALLEN   1600.0  30
7782    CLARK   2450.0  10
7698    BLAKE   2850.0  30
7566    JONES   2975.0  20
7902    FORD    3000.0  20
7839    KING    5000.0  10
Time taken: 61.594 seconds, Fetched: 13 row(s)
hive> 

11.8.4 cluster by

如果 distribute by 和 sort by 的字段是同一个,可以简写为cluster by

hive> select eid,ename,sal,did from emp cluster by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090813_f764950e-bcdd-4100-9554-2fc8a9c2bde1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0015, Tracking URL = http://node1:8088/proxy/application_1503220733636_0015/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:08:28,249 Stage-1 map = 0%,  reduce = 0%
2017-08-20 09:08:37,776 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.92 sec
2017-08-20 09:08:49,509 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.1 sec
MapReduce Total cumulative CPU time: 7 seconds 100 msec
Ended Job = job_1503220733636_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 7.1 sec   HDFS Read: 13186 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 100 msec
OK
8888    HADRON  NULL    NULL
7369    SMITH   800.0   20
7900    JAMES   950.0   30
7654    MARTIN  1250.0  30
7521    WARD    1250.0  30
7934    MILLER  1300.0  10
7844    TURNER  1500.0  30
7499    ALLEN   1600.0  30
7782    CLARK   2450.0  10
7698    BLAKE   2850.0  30
7566    JONES   2975.0  20
7902    FORD    3000.0  20
7839    KING    5000.0  10
Time taken: 37.126 seconds, Fetched: 13 row(s)
hive> 

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏xingoo, 一个梦想做发明家的程序员

汇编语言 手记2

汇编由以下3类组成:   1 汇编指令(机器码的助记符)   2 伪指令 (由编译器执行)   3 其他符号 (由编译器识别) 汇编语言的核心是汇编指令,它决定...

1768
来自专栏函数式编程语言及工具

Cats(1)- 从Free开始,Free cats

  cats是scala的一个新的函数式编程工具库,其设计原理基本继承了scalaz:大家都是haskell typeclass的scala版实现。当然,cat...

19510
来自专栏大数据学习笔记

Hive数据倾斜问题总结

Hive数据倾斜问题总结 1、MapReduce数据倾斜 Hive查询最终转换为MapReduce操作,所以要先了解MapReduce数据倾斜问题。 MapRe...

5308
来自专栏函数式编程语言及工具

Scalaz(38)- Free :Coproduct-Monadic语句组合

   很多函数式编程爱好者都把FP称为Monadic Programming,意思是用Monad进行编程。我想FP作为一种比较成熟的编程模式,应该有一套比较规范...

17710
来自专栏函数式编程语言及工具

Akka(20): Stream:异步运算,压力缓冲-Async, batching backpressure and buffering

   akka-stream原则上是一种推式(push-model)的数据流。push-model和pull-model的区别在于它们解决问题倾向性:push模...

2257
来自专栏用户画像

1.3.1 计算机的主要性能指标

机器字长是指计算机进行一次整数运算(即定点整数运算)所能处理的二进制数据的位数,通常与CPU的寄存器位数、加法器有关。所以机器字长一般等于内存寄存器的大小,字长...

733
来自专栏Java Edge

Java并发编程实战系列16之Java内存模型(JMM)

前面几章介绍的安全发布、同步策略的规范还有一致性,这些安全性都来自于JMM。 16.1 什么是内存模型,为什么需要它? 假设 a=3 内存模型要解决的问题是:...

4026
来自专栏技术专栏

Spark SQL/Hive调优

任务进度长时间维持在99%(或100%),查看任务监控页面,发现只有少量(1个或几个)reduce子任务未完成。因为其处理的数据量和其他reduce差异过大。 ...

762
来自专栏Python研发

电脑配置

     电脑组成:1电源                      2主板                    3CPU                  ...

632
来自专栏程序之美

国标协议解析

2323 起始标示 02 命令标示(这里表示实时数据上报) FE 应答标示 4C4C584132413430344A41303030303231 vi码...

641

扫码关注云+社区