Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.8 HQL:排序)(草稿)

第11章 Hive:SQL on Hadoop

11.8 HQL:排序

11.8.1 order by

Hive 中的 order by与SQL 中的order by语义相同,会对查询结果进行全局排序,但是Hive 语句最终要转换为 MapReduce 程序放到 Hadoop 分布式集群上去执行,多个Mapper 后汇集到一个 Reducer 上执行,如果结果数据量大,那就会造成 Reduce 执行相当漫长。

hive> select * from emp
    > order by hiredate;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814100601_114fbe7b-a71a-4575-8a85-582f83ead490
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0001, Tracking URL = http://node1:8088/proxy/application_1502717288397_0001/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:06:36,164 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:06:57,150 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.01 sec
2017-08-14 10:07:13,160 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.46 sec
MapReduce Total cumulative CPU time: 7 seconds 460 msec
Ended Job = job_1502717288397_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 7.46 sec   HDFS Read: 10295 HDFS Write: 821 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 460 msec
OK
7369    SMITH   CLERK   7902    1980-12-17  800.0   0.0 20
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-04-02  2975.0  0.0 20
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7782    CLARK   MANAGER 7839    1981-06-09  2450.0  0.0 10
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7839    KING    PRESIDENT   0   1981-11-17  5000.0  0.0 10
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
7902    FORD    ANALYST 7566    1981-12-03  3000.0  0.0 20
7934    MILLER  CLERK   7782    1982-01-23  1300.0  0.0 10
Time taken: 74.031 seconds, Fetched: 12 row(s)
hive> 
hive> select * from emp
    > where did=30
    > order by sal desc,eid asc;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101221_ac53277e-41d7-42a7-aa11-3ba2fe041795
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0002, Tracking URL = http://node1:8088/proxy/application_1502717288397_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:12:45,371 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:13:06,327 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.0 sec
2017-08-14 10:13:23,493 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.38 sec
MapReduce Total cumulative CPU time: 9 seconds 380 msec
Ended Job = job_1502717288397_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 9.38 sec   HDFS Read: 10967 HDFS Write: 462 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 380 msec
OK
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
Time taken: 65.629 seconds, Fetched: 6 row(s)
hive> 

请注意 , Hive中尽量不要用Order By, 除非非常确定结果集很小 。但是排序的需求总是有的,Hive 中使用下面的几种排序来满足需求。

11.8.2 sort by

sort by是在每个reduce中进行排序,是一个局部排序,可以保证每个 Reducer 中是按照指定字段进行排好序的,但是全局上来说,相同字段可以被分配到不同的Reduce 上,虽然在各个 Reduce 上是排好序的,但是全局上不一定是排好序的。

在执行Hive查询时,可以看到如下信息:

Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

  • hive.exec.reducers.bytes.per.reducer 此参数从Hive 0.2.0开始引入。在Hive 0.14.0版本之前默认值是1G;而从Hive 0.14.0开始以后的版本,默认值变成了256M。
  • mapred.reduce.tasks 如果设置了mapred.reduce.tasks,那么Hive会直接使用它的值作为Reduce的个数; 如果mapred.reduce.tasks的值没有设置(默认值就是-1),那么Hive会根据输入文件的大小估算出Reduce的个数。根据输入文件估算Reduce的个数可能未必很准确,因为Reduce的输入是Map的输出,而Map的输出可能会比输入要小,所以最准确的数根据Map的输出估算Reduce的个数。
[root@node3 ~]# cd /opt/hive-2.1.1/conf/
[root@node3 conf]# vi hive-site.xml 
[root@node3 conf]# cat hive-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
          <name>hive.metastore.local</name>
          <!--本地存储,MySQL和hive安装在同一节点-->
          <value>true</value>
    </property>
    <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://node3:3306/hive?createDatabaseIfNotExist=true</value>      
    </property>
    <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>          
          <name>javax.jdo.option.ConnectionUserName</name>        
          <!--mysql数据库用户名-->            
          <value>root</value>
    </property>
    <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>123456</value>
    </property>
    <property>
          <name>mapred.reduce.tasks</name>
          <value>2</value>
    </property>

</configuration>
[root@node3 conf]# 
hive> select eid,ename,sal from emp sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820064647_81b26e52-d158-4861-89f0-677ec0db0b52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0009, Tracking URL = http://node1:8088/proxy/application_1503220733636_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 06:47:02,958 Stage-1 map = 0%,  reduce = 0%
2017-08-20 06:47:12,622 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
2017-08-20 06:47:24,950 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.35 sec
MapReduce Total cumulative CPU time: 6 seconds 350 msec
Ended Job = job_1503220733636_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 6.35 sec   HDFS Read: 12635 HDFS Write: 559 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 350 msec
OK
8888    HADRON  NULL
7369    SMITH   800.0
7900    JAMES   950.0
7521    WARD    1250.0
7934    MILLER  1300.0
7844    TURNER  1500.0
7782    CLARK   2450.0
7698    BLAKE   2850.0
7566    JONES   2975.0
7654    MARTIN  1250.0
7499    ALLEN   1600.0
7902    FORD    3000.0
7839    KING    5000.0
Time taken: 39.092 seconds, Fetched: 13 row(s)
hive> 

11.8.3 distribute by

distribute by 指定 map 输出结果怎么样划分后分配到各个 Reduce 上去,比如 distribute by cid,就可以保证 cid 字段相同的结果被分配到同一个 reduce 上去执行。然后再指定 sort by cid,则在 Reduce 上进行按照 cid 进行排序。但是这种还是不能做到全局排序,只能保证排序字段值相同的放在一起,并且在 reduce 上局部是排好序的。 需要注意的是 distribute by 必须写在 sort by 前面。

hive> select eid,ename,sal,did from emp distribute by did sort by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090113_637965de-369d-4a26-b233-0ee69513d6f3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0014, Tracking URL = http://node1:8088/proxy/application_1503220733636_0014/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:01:36,959 Stage-1 map = 0%,  reduce = 0%
2017-08-20 09:01:57,587 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec
2017-08-20 09:02:14,251 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.04 sec
MapReduce Total cumulative CPU time: 9 seconds 40 msec
Ended Job = job_1503220733636_0014
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 9.04 sec   HDFS Read: 13200 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 40 msec
OK
8888    HADRON  NULL    NULL
7369    SMITH   800.0   20
7900    JAMES   950.0   30
7654    MARTIN  1250.0  30
7521    WARD    1250.0  30
7934    MILLER  1300.0  10
7844    TURNER  1500.0  30
7499    ALLEN   1600.0  30
7782    CLARK   2450.0  10
7698    BLAKE   2850.0  30
7566    JONES   2975.0  20
7902    FORD    3000.0  20
7839    KING    5000.0  10
Time taken: 61.594 seconds, Fetched: 13 row(s)
hive> 

11.8.4 cluster by

如果 distribute by 和 sort by 的字段是同一个,可以简写为cluster by

hive> select eid,ename,sal,did from emp cluster by sal;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170820090813_f764950e-bcdd-4100-9554-2fc8a9c2bde1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1503220733636_0015, Tracking URL = http://node1:8088/proxy/application_1503220733636_0015/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1503220733636_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-20 09:08:28,249 Stage-1 map = 0%,  reduce = 0%
2017-08-20 09:08:37,776 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.92 sec
2017-08-20 09:08:49,509 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 7.1 sec
MapReduce Total cumulative CPU time: 7 seconds 100 msec
Ended Job = job_1503220733636_0015
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 2   Cumulative CPU: 7.1 sec   HDFS Read: 13186 HDFS Write: 598 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 100 msec
OK
8888    HADRON  NULL    NULL
7369    SMITH   800.0   20
7900    JAMES   950.0   30
7654    MARTIN  1250.0  30
7521    WARD    1250.0  30
7934    MILLER  1300.0  10
7844    TURNER  1500.0  30
7499    ALLEN   1600.0  30
7782    CLARK   2450.0  10
7698    BLAKE   2850.0  30
7566    JONES   2975.0  20
7902    FORD    3000.0  20
7839    KING    5000.0  10
Time taken: 37.126 seconds, Fetched: 13 row(s)
hive> 

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏云资讯小编的专栏

腾讯云采购节服务器三折起,这才是人民的云计算!

4月24日,腾讯云正式启动 “腾讯云采购节”大促活动,活动期间旗下云服务器、云数据库、域名等多款核心产品大幅降价,最低折扣达3折,优惠方式也涵盖了新用户礼包、大...

90300
来自专栏EarlGrey的专栏

公有云的商业模式分析

最近看到一篇文章,里面以酒店为比喻,对公有云的商业模式进行了分析,还比较生动、形象。在社区里摘录如下。

1.5K00
来自专栏iMySQL的专栏

基于 MySQL 5.7多源复制及 Keepalived 搭建三节点高可用架构

MySQL 5.7发布后,在复制方面有了很大的改进和提升。比如开始支持多源复制以及真正的支持多线程复制。本文讲述基于 MySQL 5.7多源复制及 Keepal...

42600
来自专栏苏强的专栏

腾讯云分布式数据库(DCDB)

DCDB 是部署在腾讯云公有云上的一种兼容MySQL协议和语法,支持自动水平拆分的share nothing架构的分布式数据库。分布式数据库即业务获取是完整的逻...

84900
来自专栏iMySQL的专栏

分区表场景下的 SQL 优化

有个表做了分区,每天一个分区。该表上有个查询,经常只查询表中某一天数据,但每次都几乎要扫描整个分区的所有数据,有什么办法进行优化吗?

15100
来自专栏崔庆才的专栏

Scrapyd 日志输出定时清理

现在维护着一个新浪微博爬虫,爬取量已经5亿+,使用了Scrapyd部署分布式。Scrapyd运行时会输出日志到本地,导致日志文件会越来越大,几十个G,这个其实就...

1.2K00
来自专栏谭伟华)的专栏

Amazon Aurora:云时代的数据库 ( 中)

文章详尽的介绍了 Aurora 设计背后的驱动和思考,以及如何在云上实现一个同时满足高并发、高吞吐量、高稳定性、高可用、高扩展的云数据库。

57500
来自专栏VMCloud

【腾讯云的1001种玩法】构建企业级应用环境之数据层面优化(一)

本系列为两年前 VMCloud 云平台的进阶篇,本次借助 QCloud 的《1001种玩法》活动来继续完成进阶篇,主要以在 QCloud 上搭建一个完整的应用环...

1.3K00
来自专栏谭伟华)的专栏

Amazon Aurora:云时代的数据库 ( 上)

文章是 Amazon 在 SIGMOD\'17 上最新发表的关于 Aurora 论文的翻译版本,详尽的介绍了Aurora 设计背后的驱动和思考,以及如何在云上实...

93610
来自专栏张秀云的专栏

Spider 引擎分布式数据库解决方案(最全的 spider 教程)

最近开始负责财付通的数据库的相关维护工作,其中有几套系统使用的 spider 引擎,本文将 spider 引擎的功能、使用场景、部署、实战测试等做个简单的总结。

1.4K10

扫码关注云+社区

领取腾讯云代金券

年度创作总结 领取年终奖励