Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.7 HQL:数据查询)(草稿)

第11章 Hive:SQL on Hadoop

11.7 HQL:数据查询

本节操作使用默认default数据库。

hive> use default;
OK
Time taken: 0.035 seconds
hive>

11.7.1 limit 子句

hive> select * from emp limit 5;
OK
7782    CLARK   MANAGER 7839    1981-06-09  2450.0  0.0 10
7839    KING    PRESIDENT   0   1981-11-17  5000.0  0.0 10
7934    MILLER  CLERK   7782    1982-01-23  1300.0  0.0 10
7369    SMITH   CLERK   7902    1980-12-17  800.0   0.0 20
7566    JONES   MANAGER 7839    1981-04-02  2975.0  0.0 20
Time taken: 2.18 seconds, Fetched: 5 row(s)
hive> 
hive> select * from emp limit 1,3;
OK
7839    KING    PRESIDENT   0   1981-11-17  5000.0  0.0 10
7934    MILLER  CLERK   7782    1982-01-23  1300.0  0.0 10
7369    SMITH   CLERK   7902    1980-12-17  800.0   0.0 20
Time taken: 0.303 seconds, Fetched: 3 row(s)
hive> 

11.7.2 case when then 语句

hive> select eid,ename,
    > case
    > when did=10 then 'd10'
    > when did=20 then 'd20'
    > when did=30 then 'd30'
    > end
    > from emp;
OK
7782    CLARK   d10
7839    KING    d10
7934    MILLER  d10
7369    SMITH   d20
7566    JONES   d20
7902    FORD    d20
7499    ALLEN   d30
7521    WARD    d30
7654    MARTIN  d30
7698    BLAKE   d30
7844    TURNER  d30
7900    JAMES   d30
Time taken: 0.469 seconds, Fetched: 12 row(s)
hive> 

11.7.3 where 子句

可以用orand连接多个表达式。

操作符

数据类型

说明

A=B

基本类型

A等于B返回true,否则返回false

A<>B,A!=B

基本类型

A或B为null,返回null; A和B都不空时,A不等于B返回true,否则返回false

A<B

基本类型

A或B为null,返回null; A小于B返回true,否则返回false

A<=B

基本类型

A或B为null,返回null; A小于等于B返回true,否则返回false

A>B

基本类型

A或B为null,返回null; A大于B返回true,否则返回false

A>=B

基本类型

A或B为null,返回null; A大于等于B返回true,否则返回false

A Between B And C

基本类型

筛选 A 的值处于 B 和 C 之间

A not between B and C

基本类型

筛选 A 的值不处于 B 和 C 之间

A is null

所有类型

如果A等于null,返回true,否则返回false。注意,不能使用A=null

A is not null

所有类型

筛选 A 值不是null的

A like B

string

B是一个SQL正则表达式,%表示一个或者多个字符,_表示一个字符

A not like B

string

与正则表达式B相反结果

A rlike B

string

正则匹配

hive> select * from emp
    > where did=30 and comm<>0.0;
OK
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
Time taken: 0.667 seconds, Fetched: 3 row(s)
hive> 
hive> select * from emp
    > where hiredate between '1981-01-01' and '1981-12-30';
OK
7782    CLARK   MANAGER 7839    1981-06-09  2450.0  0.0 10
7839    KING    PRESIDENT   0   1981-11-17  5000.0  0.0 10
7566    JONES   MANAGER 7839    1981-04-02  2975.0  0.0 20
7902    FORD    ANALYST 7566    1981-12-03  3000.0  0.0 20
7499    ALLEN   SALESMAN    7698    1981-02-20  1600.0  300.0   30
7521    WARD    SALESMAN    7698    1981-02-22  1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-09-28  1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-05-01  2850.0  0.0 30
7844    TURNER  SALESMAN    7698    1981-09-08  1500.0  0.0 30
7900    JAMES   CLERK   7698    1981-12-03  950.0   0.0 30
Time taken: 0.311 seconds, Fetched: 10 row(s)
hive>

11.7.4 group by 子句

【例】查询每个部门的人数

hive> select count(*) from emp
    > group by did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814101839_fe4cccbd-62c4-48cb-986b-a70f93c19535
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0003, Tracking URL = http://node1:8088/proxy/application_1502717288397_0003/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:19:10,549 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:19:27,526 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.86 sec
2017-08-14 10:19:42,984 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.84 sec
MapReduce Total cumulative CPU time: 6 seconds 840 msec
Ended Job = job_1502717288397_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.84 sec   HDFS Read: 9667 HDFS Write: 129 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 840 msec
OK
3
3
6
Time taken: 66.588 seconds, Fetched: 3 row(s)
hive> 

与Oracle数据库一样,如果使用group by子句,那么查询的字段如果没有出现在group by子句的后面,则必须使用聚合函数。

hive> select ename,avg(sal) from emp group by did;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'
hive>

如果想对分组结果进行过滤,可以使用having子句

hive> select avg(sal) from emp 
    > group by did
    > having avg(sal)>2000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170814103035_2106518a-aab4-4cdd-9b39-7b08fb1a44b6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502717288397_0007, Tracking URL = http://node1:8088/proxy/application_1502717288397_0007/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502717288397_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-08-14 10:31:08,601 Stage-1 map = 0%,  reduce = 0%
2017-08-14 10:31:25,888 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.88 sec
2017-08-14 10:31:43,122 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.34 sec
MapReduce Total cumulative CPU time: 8 seconds 340 msec
Ended Job = job_1502717288397_0007
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.34 sec   HDFS Read: 10807 HDFS Write: 149 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 340 msec
OK
2916.6666666666665
2258.3333333333335
Time taken: 70.202 seconds, Fetched: 2 row(s)
hive> 

11.7.5 内连接

(1)准备阶段 部门数据

[root@node3 data]# vi dept
[root@node3 data]# cat dept
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
[root@node3 data]# 

创建部门表dept,并导入数据

hive> create table if not exists dept(
    > did    int,
    > dname string,
    > dloc string)
    > row format delimited fields terminated by ',';
OK
Time taken: 0.185 seconds
hive> load data local inpath '/root/data/dept' into table dept;
Loading data to table default.dept
OK
Time taken: 0.705 seconds
hive> select * from dept;
OK
10  ACCOUNTING  NEW YORK
20  RESEARCH    DALLAS
30  SALES   CHICAGO
40  OPERATIONS  BOSTON
Time taken: 0.155 seconds, Fetched: 4 row(s)
hive> 

(2)等值连接 连接的两个表中,只有同时满足连接条件的记录才会放入结果表中。执行查询命令select eid,ename,dname from emp,dept where emp.did=dept.did;

hive> select eid,ename,dname
    > from emp,dept
    > where emp.did=dept.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170815101902_18fb066b-11f2-438d-8a92-9d4490b80ec2
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-15 10:19:13 Starting to launch local task to process map join;  maximum memory = 518979584
2017-08-15 10:19:17 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-19-02_412_1947825427115470836-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2017-08-15 10:19:17 Uploaded 1 File to: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-19-02_412_1947825427115470836-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (373 bytes)
2017-08-15 10:19:17 End of local task; Time Taken: 3.36 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502805552881_0002, Tracking URL = http://node1:8088/proxy/application_1502805552881_0002/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502805552881_0002
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-15 10:19:38,347 Stage-3 map = 0%,  reduce = 0%
2017-08-15 10:19:52,530 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1502805552881_0002
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.69 sec   HDFS Read: 7746 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 690 msec
OK
7782    CLARK   ACCOUNTING
7839    KING    ACCOUNTING
7934    MILLER  ACCOUNTING
7369    SMITH   RESEARCH
7566    JONES   RESEARCH
7902    FORD    RESEARCH
7499    ALLEN   SALES
7521    WARD    SALES
7654    MARTIN  SALES
7698    BLAKE   SALES
7844    TURNER  SALES
7900    JAMES   SALES
Time taken: 52.488 seconds, Fetched: 12 row(s)
hive> 

内连接另一种形式:select e.eid,e.ename,d.dname from emp e join dept d on e.did=d.did; 其中,e和d分别是表emp和dept的别名。

hive> select e.eid,e.ename,d.dname
    > from emp e join dept d
    > on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170815102454_0ca5cc0f-1802-4b4f-8156-d055482971b7
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-15 10:25:06 Starting to launch local task to process map join;  maximum memory = 518979584
2017-08-15 10:25:10 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-24-54_978_8533839352082414207-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2017-08-15 10:25:10 Uploaded 1 File to: file:/tmp/root/6e48493b-cca6-4657-a459-7ff36c6137b1/hive_2017-08-15_10-24-54_978_8533839352082414207-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (373 bytes)
2017-08-15 10:25:10 End of local task; Time Taken: 3.864 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502805552881_0003, Tracking URL = http://node1:8088/proxy/application_1502805552881_0003/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502805552881_0003
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-15 10:25:27,803 Stage-3 map = 0%,  reduce = 0%
2017-08-15 10:25:38,323 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.24 sec
MapReduce Total cumulative CPU time: 2 seconds 240 msec
Ended Job = job_1502805552881_0003
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.24 sec   HDFS Read: 7746 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 240 msec
OK
7782    CLARK   ACCOUNTING
7839    KING    ACCOUNTING
7934    MILLER  ACCOUNTING
7369    SMITH   RESEARCH
7566    JONES   RESEARCH
7902    FORD    RESEARCH
7499    ALLEN   SALES
7521    WARD    SALES
7654    MARTIN  SALES
7698    BLAKE   SALES
7844    TURNER  SALES
7900    JAMES   SALES
Time taken: 44.558 seconds, Fetched: 12 row(s)
hive> 

11.7.6 外连接

内连接只显示了俩张表都存在的数据,而外连接则是显示出所有的数据 。外连接又分为左外连接、右外连接以及全外连接,可以简称左连接、右连接和全连接。其中左连接是左边的表输出完整数据。右连接是右边的表输出完整数据,全连接就是两张表的所有行全部输出。

(1)补充数据 为了测试外连接,需要在雇员表中添加一个没有部门的新员工信息

hive> insert into emp(eid,ename,hiredate) values(8888,'HADRON','2016-08-31'); 
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816091805_3f28cfe1-278a-486e-9ed4-34ee791f5322
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0005, Tracking URL = http://node1:8088/proxy/application_1502887685471_0005/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502887685471_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-08-16 09:18:20,468 Stage-1 map = 0%,  reduce = 0%
2017-08-16 09:18:32,863 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.48 sec
MapReduce Total cumulative CPU time: 2 seconds 480 msec
Ended Job = job_1502887685471_0005
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://cetc/user/hive/warehouse/emp/.hive-staging_hive_2017-08-16_09-18-05_560_4497324549163312419-1/-ext-10000
Loading data to table default.emp
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 2.48 sec   HDFS Read: 4861 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 480 msec
OK
Time taken: 30.102 seconds
hive>

(2)左连接

hive> select e.eid,e.ename,d.dname from emp e left join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816091950_1483c055-0461-4293-9387-3221eb15b448
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:20:01 Starting to launch local task to process map join;  maximum memory = 518979584
2017-08-16 09:20:04 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-19-50_831_2141891134326428737-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2017-08-16 09:20:04 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-19-50_831_2141891134326428737-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (373 bytes)
2017-08-16 09:20:04 End of local task; Time Taken: 2.606 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0006, Tracking URL = http://node1:8088/proxy/application_1502887685471_0006/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502887685471_0006
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:20:20,596 Stage-3 map = 0%,  reduce = 0%
2017-08-16 09:20:30,466 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.82 sec
MapReduce Total cumulative CPU time: 1 seconds 820 msec
Ended Job = job_1502887685471_0006
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 1.82 sec   HDFS Read: 7158 HDFS Write: 486 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 820 msec
OK
7782    CLARK   ACCOUNTING
7839    KING    ACCOUNTING
7934    MILLER  ACCOUNTING
7369    SMITH   RESEARCH
7566    JONES   RESEARCH
7902    FORD    RESEARCH
7499    ALLEN   SALES
7521    WARD    SALES
7654    MARTIN  SALES
7698    BLAKE   SALES
7844    TURNER  SALES
7900    JAMES   SALES
8888    HADRON  NULL
Time taken: 41.877 seconds, Fetched: 13 row(s)
hive> 

(3)右连接

hive> select e.eid,e.ename,d.dname from emp e right join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092117_f97eea90-4039-45f9-8097-900c457f50f3
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:21:28 Starting to launch local task to process map join;  maximum memory = 518979584
2017-08-16 09:21:31 Dump the side-table for tag: 0 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-21-17_901_4210650042244478897-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable
2017-08-16 09:21:31 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-21-17_901_4210650042244478897-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (498 bytes)
2017-08-16 09:21:31 End of local task; Time Taken: 2.599 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0007, Tracking URL = http://node1:8088/proxy/application_1502887685471_0007/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502887685471_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:21:48,072 Stage-3 map = 0%,  reduce = 0%
2017-08-16 09:21:58,404 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.95 sec
MapReduce Total cumulative CPU time: 1 seconds 950 msec
Ended Job = job_1502887685471_0007
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 1.95 sec   HDFS Read: 6169 HDFS Write: 488 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 950 msec
OK
7782    CLARK   ACCOUNTING
7839    KING    ACCOUNTING
7934    MILLER  ACCOUNTING
7369    SMITH   RESEARCH
7566    JONES   RESEARCH
7902    FORD    RESEARCH
7499    ALLEN   SALES
7521    WARD    SALES
7654    MARTIN  SALES
7698    BLAKE   SALES
7844    TURNER  SALES
7900    JAMES   SALES
NULL    NULL    OPERATIONS
Time taken: 42.768 seconds, Fetched: 13 row(s)
hive> 

(4)全连接

hive> select e.eid,e.ename,d.dname from emp e full join dept d on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092230_3b3f51e8-2753-41a2-9f41-af98770847fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1502887685471_0008, Tracking URL = http://node1:8088/proxy/application_1502887685471_0008/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502887685471_0008
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-08-16 09:22:45,280 Stage-1 map = 0%,  reduce = 0%
2017-08-16 09:23:06,469 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.28 sec
2017-08-16 09:23:16,137 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.52 sec
MapReduce Total cumulative CPU time: 8 seconds 520 msec
Ended Job = job_1502887685471_0008
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 8.52 sec   HDFS Read: 16627 HDFS Write: 515 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 520 msec
OK
8888    HADRON  NULL
7782    CLARK   ACCOUNTING
7934    MILLER  ACCOUNTING
7839    KING    ACCOUNTING
7902    FORD    RESEARCH
7566    JONES   RESEARCH
7369    SMITH   RESEARCH
7654    MARTIN  SALES
7521    WARD    SALES
7900    JAMES   SALES
7499    ALLEN   SALES
7844    TURNER  SALES
7698    BLAKE   SALES
NULL    NULL    OPERATIONS
Time taken: 47.921 seconds, Fetched: 14 row(s)
hive> 

(5)左半连接

hive> select e.eid,e.ename from emp e 
    > left semi join dept d
    > on e.did=d.did;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816093310_97649892-9cb8-4b26-954a-1a301976184d
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:33:22 Starting to launch local task to process map join;  maximum memory = 518979584
2017-08-16 09:33:25 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-33-10_356_3322488615725018370-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile51--.hashtable
2017-08-16 09:33:25 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-33-10_356_3322488615725018370-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (332 bytes)
2017-08-16 09:33:25 End of local task; Time Taken: 2.642 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0010, Tracking URL = http://node1:8088/proxy/application_1502887685471_0010/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502887685471_0010
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:33:43,441 Stage-3 map = 0%,  reduce = 0%
2017-08-16 09:33:55,350 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.93 sec
MapReduce Total cumulative CPU time: 2 seconds 930 msec
Ended Job = job_1502887685471_0010
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.93 sec   HDFS Read: 7152 HDFS Write: 363 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 930 msec
OK
7782    CLARK
7839    KING
7934    MILLER
7369    SMITH
7566    JONES
7902    FORD
7499    ALLEN
7521    WARD
7654    MARTIN
7698    BLAKE
7844    TURNER
7900    JAMES
Time taken: 46.179 seconds, Fetched: 12 row(s)
hive> 

左半连接和内连接的结果一样,只不过使用IN查询。

hive> select e.eid,e.ename from emp e 
    > where e.did in(select d.did from dept d);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20170816092716_b479bf9e-fd03-48a1-be72-fb64d90f2efc
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-08-16 09:27:27 Starting to launch local task to process map join;  maximum memory = 518979584
2017-08-16 09:27:30 Dump the side-table for tag: 1 with group count: 4 into file: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-27-16_222_8300160340562225012-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable
2017-08-16 09:27:30 Uploaded 1 File to: file:/tmp/root/50703eac-08fc-4214-9bd0-488ad09f51e2/hive_2017-08-16_09-27-16_222_8300160340562225012-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable (332 bytes)
2017-08-16 09:27:30 End of local task; Time Taken: 2.987 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1502887685471_0009, Tracking URL = http://node1:8088/proxy/application_1502887685471_0009/
Kill Command = /opt/hadoop-2.7.3/bin/hadoop job  -kill job_1502887685471_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2017-08-16 09:27:46,861 Stage-3 map = 0%,  reduce = 0%
2017-08-16 09:27:59,352 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1502887685471_0009
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 1   Cumulative CPU: 2.91 sec   HDFS Read: 7152 HDFS Write: 363 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 910 msec
OK
7782    CLARK
7839    KING
7934    MILLER
7369    SMITH
7566    JONES
7902    FORD
7499    ALLEN
7521    WARD
7654    MARTIN
7698    BLAKE
7844    TURNER
7900    JAMES
Time taken: 44.323 seconds, Fetched: 12 row(s)
hive> 

注意,左半连接通常比内连接要高效,因为对于左表的一条 指定的记录,在右表一旦找到匹配的记录,Hive就会停止扫描。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Hadoop实操

Spark2Streaming读Kerberos环境的Kafka并写数据到Hive

在前面的文章Fayson介绍了一些关于Spark2Streaming的示例如《Spark2Streaming读Kerberos环境的Kafka并写数据到HBas...

3363
来自专栏青枫的专栏

day35_Spring学习笔记_03

掌握:PROPAGATION_REQUIRED、PROPAGATION_REQUIRES_NEW、PROPAGATION_NESTED

602
来自专栏YoungGy

SparkSQL入门_1

概述 DataFrame SQL query ReadWrite Example 概述 先说说准备工作吧。 目前使用的是伪分布式模式,hadoop,spar...

18210
来自专栏北京马哥教育

Spark:一个高效的分布式计算系统

马哥linux运维 | 最专业的linux培训机构 ---- 概述 什么是Spark Spark是UC Berkeley AMP lab所开源的类Hado...

3666
来自专栏华章科技

SparkR:数据科学家的新利器

摘要:R是数据科学家中最流行的编程语言和环境之一,在Spark中加入对R的支持是社区中较受关注的话题。作为增强Spark对数据科学家群体吸引力的最新举措,最近发...

542
来自专栏数据科学与人工智能

【数据科学家】SparkR:数据科学家的新利器

摘要:R是非常流行的数据统计分析和制图的语言及环境,有调查显示,R语言在数据科学家中使用的程度仅次于SQL,但大数据时代的海量数据处理对R构成了挑战。 摘要:R...

2149
来自专栏行者悟空

将SparkSQL计算结果写入Mysql中

854
来自专栏个人分享

SparkSql官方文档中文翻译(java版本)

Spark SQL是Spark的一个组件,用于结构化数据的计算。Spark SQL提供了一个称为DataFrames的编程抽象,DataFrames可以充当分布...

1043
来自专栏云计算与大数据

Import Kafka data into OSS using E-MapReduce service

Overview Kafka is a frequently-used message queue in open-source communities. A...

602
来自专栏数据科学与人工智能

【Spark研究】用Apache Spark进行大数据处理第二部分:Spark SQL

在Apache Spark文章系列的前一篇文章中,我们学习了什么是Apache Spark框架,以及如何用该框架帮助组织处理大数据处理分析的需求。 Spark ...

27110

扫码关注云+社区