Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.8 HQL:排序)(草稿)

第11章 Hive:SQL on Hadoop

11.8 HQL:排序

11.8.1 order by

Hive 中的 order by与SQL 中的order by语义相同,会对查询结果进行全局排序,但是Hive 语句最终要转换为 MapReduce 程序放到 Hadoop 分布式集群上去执行,多个Mapper 后汇集到一个 Reducer 上执行,如果结果数据量大,那就会造成 Reduce 执行相当漫长。

hive> select * from emp
    > order by hiredate;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814100601_114fbe7b-a71a-4575-8a85-582f83ead490
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0001, Tracking URL = http://node1:8088/proxy/application_1502717288397_0001/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:06:36,164 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:06:57,150 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.01 sec
2017-08-14 10:07:13,160 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.46 sec
MapReduce Total cumulative CPU time: 7 seconds 460 msec
Ended Job = job_1502717288397_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.46 sec   HDFS Read: 10295 HDFS Write: 821 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 460 msec
OK
7369    SMITH   CLERK   7902    1980-12-17  800.0   0.0 20
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02  2975.0  0.0 20
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7782    CLARK   MANAGER 7839    1981-06-09  2450.0  0.0 10
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7839    KING    PRESIDENT   0   1981-11-17  5000.0  0.0 10
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
7902    FORD    ANALYST 7566    1981-12-03  3000.0  0.0 20
7934    MILLER  CLERK   7782    1982-01-23  1300.0  0.0 10
Time taken: 74.031 seconds, Fetched: 12 row(s)
hive> 
hive> select * from emp
    > where did=30
    > order by sal desc,eid asc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101221_ac53277e-41d7-42a7-aa11-3ba2fe041795
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0002, Tracking URL = http://node1:8088/proxy/application_1502717288397_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:12:45,371 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:13:06,327 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.0 sec
2017-08-14 10:13:23,493 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.38 sec
MapReduce Total cumulative CPU time: 9 seconds 380 msec
Ended Job = job_1502717288397_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 9.38 sec   HDFS Read: 10967 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 380 msec
OK
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
Time taken: 65.629 seconds, Fetched: 6 row(s)
hive> 

请注意 , Hive中尽量不要用Order By, 除非非常确定结果集很小 。但是排序的需求总是有的,Hive 中使用下面的几种排序来满足需求。

11.8.2 sort by

sort by是在每个reduce中进行排序,是一个局部排序,可以保证每个 Reducer 中是按照指定字段进行排好序的,但是全局上来说,相同字段可以被分配到不同的Reduce 上,虽然在各个 Reduce 上是排好序的,但是全局上不一定是排好序的。

在执行Hive查询时,可以看到如下信息:

Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

  • hive.exec.reducers.bytes.per.reducer 此参数从Hive 0.2.0开始引入。在Hive 0.14.0版本之前默认值是1G;而从Hive 0.14.0开始以后的版本,默认值变成了256M。
  • mapred.reduce.tasks 如果设置了mapred.reduce.tasks,那么Hive会直接使用它的值作为Reduce的个数; 如果mapred.reduce.tasks的值没有设置(默认值就是-1),那么Hive会根据输入文件的大小估算出Reduce的个数。根据输入文件估算Reduce的个数可能未必很准确,因为Reduce的输入是Map的输出,而Map的输出可能会比输入要小,所以最准确的数根据Map的输出估算Reduce的个数。
[root@node3 ~]# cd /opt/hive-2.1.1/conf/
[root@node3 conf]# vi hive-site.xml 
[root@node3 conf]# cat hive-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
          <name>hive.metastore.local</name>
          <!--本地存储,MySQL和hive安装在同一节点-->
          <value>true</value>
    </property>
    <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://node3:3306/hive?createDatabaseIfNotExist=true</value>      
    </property>
    <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>          
          <name>javax.jdo.option.ConnectionUserName</name>        
          <!--mysql数据库用户名-->            
          <value>root</value>
    </property>
    <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>123456</value>
    </property>
    <property>
          <name>mapred.reduce.tasks</name>
          <value>2</value>
    </property>

</configuration>
[root@node3 conf]# 
hive> select eid,ename,sal from emp sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820064647_81b26e52-d158-4861-89f0-677ec0db0b52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0009, Tracking URL = http://node1:8088/proxy/application_1503220733636_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 06:47:02,958 Stage-1 map = 0%,  reduce = 0%
2017-08-20 06:47:12,622 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
2017-08-20 06:47:24,950 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.35 sec
MapReduce Total cumulative CPU time: 6 seconds 350 msec
Ended Job = job_1503220733636_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.35 sec   HDFS Read: 12635 HDFS Write: 559 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 350 msec
OK
8888    HADRON  NULL
7369    SMITH   800.0
7900    JAMES   950.0
7521    WARD    1250.0
7934    MILLER  1300.0
7844    TURNER  1500.0
7782    CLARK   2450.0
7698    BLAKE   2850.0
7566    JONES   2975.0
7654    MARTIN  1250.0
7499    ALLEN   1600.0
7902    FORD    3000.0
7839    KING    5000.0
Time taken: 39.092 seconds, Fetched: 13 row(s)
hive> 

11.8.3 distribute by

distribute by 指定 map 输出结果怎么样划分后分配到各个 Reduce 上去,比如 distribute by cid,就可以保证 cid 字段相同的结果被分配到同一个 reduce 上去执行。然后再指定 sort by cid,则在 Reduce 上进行按照 cid 进行排序。但是这种还是不能做到全局排序,只能保证排序字段值相同的放在一起,并且在 reduce 上局部是排好序的。 需要注意的是 distribute by 必须写在 sort by 前面。

hive> select eid,ename,sal,did from emp distribute by did sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090113_637965de-369d-4a26-b233-0ee69513d6f3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0014, Tracking URL = http://node1:8088/proxy/application_1503220733636_0014/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:01:36,959 Stage-1 map = 0%,  reduce = 0%
2017-08-20 09:01:57,587 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec
2017-08-20 09:02:14,251 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.04 sec
MapReduce Total cumulative CPU time: 9 seconds 40 msec
Ended Job = job_1503220733636_0014
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 9.04 sec   HDFS Read: 13200 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 40 msec
OK
8888    HADRON  NULL    NULL
7369    SMITH   800.0   20
7900    JAMES   950.0   30
7654    MARTIN  1250.0  30
7521    WARD    1250.0  30
7934    MILLER  1300.0  10
7844    TURNER  1500.0  30
7499    ALLEN   1600.0  30
7782    CLARK   2450.0  10
7698    BLAKE   2850.0  30
7566    JONES   2975.0  20
7902    FORD    3000.0  20
7839    KING    5000.0  10
Time taken: 61.594 seconds, Fetched: 13 row(s)
hive> 

11.8.4 cluster by

如果 distribute by 和 sort by 的字段是同一个,可以简写为cluster by

hive> select eid,ename,sal,did from emp cluster by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090813_f764950e-bcdd-4100-9554-2fc8a9c2bde1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0015, Tracking URL = http://node1:8088/proxy/application_1503220733636_0015/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:08:28,249 Stage-1 map = 0%,  reduce = 0%
2017-08-20 09:08:37,776 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.92 sec
2017-08-20 09:08:49,509 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.1 sec
MapReduce Total cumulative CPU time: 7 seconds 100 msec
Ended Job = job_1503220733636_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 7.1 sec   HDFS Read: 13186 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 100 msec
OK
8888    HADRON  NULL    NULL
7369    SMITH   800.0   20
7900    JAMES   950.0   30
7654    MARTIN  1250.0  30
7521    WARD    1250.0  30
7934    MILLER  1300.0  10
7844    TURNER  1500.0  30
7499    ALLEN   1600.0  30
7782    CLARK   2450.0  10
7698    BLAKE   2850.0  30
7566    JONES   2975.0  20
7902    FORD    3000.0  20
7839    KING    5000.0  10
Time taken: 37.126 seconds, Fetched: 13 row(s)
hive> 

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏CSDN技术头条

大数据同步工具DataX与Sqoop之比较

DataX是一个在异构的数据库/文件系统之间高速交换数据的工具,实现了在任意的数据处理系统(RDBMS/Hdfs/Local filesystem)之间的数据交...

61610
来自专栏华章科技

大数据技术Hadoop面试题,看看你能答对多少?答案在后面

a)NameNode b)Jobtracker c)Datanode d)secondaryNameNode e)tasktracker

862
来自专栏流柯技术学院

jmeter之JDBC Request各种数据库配置

jdbc:oracle:thin:user/pass@//host:port/service

851
来自专栏Spark学习技巧

Spark设计理念和基本架构

1575
来自专栏行者悟空

SparkContext初始化过程

1354
来自专栏大数据学习笔记

Spark2.x学习笔记:11、RDD依赖关系与stage划分

11、 RDD依赖关系与stage划分 Spark中RDD的高效与DAG图有着莫大的关系,在DAG调度中需要对计算过程划分stage,而划分依据就是RDD之间的...

2295
来自专栏https://www.cnblogs.com/L

【Spark篇】---Spark中资源和任务调度源码分析与资源配置参数应用

Spark中资源调度是一个非常核心的模块,尤其对于我们提交参数来说,需要具体到某些配置,所以提交配置的参数于源码一一对应,掌握此节对于Spark在任务执行过程中...

1383
来自专栏行者悟空

Spark DAG调度

993
来自专栏行者悟空

Spark RDD中的持久化

843
来自专栏懒人开发

hadoop(1):hadoop概述

hadoop是 Doug Cutting 在 Lucene 之后的一个项目 主要用于 计算 是一个 开源,可靠,可扩展 的分布式计算框架 主要有

703

扫码关注云+社区