Hive动态分区之多分区

云飞扬

发布于 2021-12-31 16:25:21

2.5K00

代码可运行

文章被收录于专栏：星汉技术星汉技术

运行总次数：0

代码可运行

Hive动态分区之多分区

在多分区的情况下，我们可以实现半自动分区和完全自动分区，可以根据需求进行选择。

1、半自动分区

在半自动分区中，本人将分别展示英文和中文的分区

1.1 英文分区

(1)准备数据

hive> create table user_3g_age(id int ,name string ,city string ,age int) 
row format delimited 
fields terminated by '\t';

OK
Time taken: 0.463 seconds
hive> load data local inpath '/root/book/3guserage.txt' overwrite into table user_3g_age;
Loading data to table test.user_3g_age
Table test.user_3g_age stats: [numFiles=1, numRows=0, totalSize=153, rawDataSize=0]
OK
Time taken: 0.678 seconds

hive> select * from user_3g_age;
OK
1	刘备	shu	30
2	关羽	shu	29
3	张飞	shu	28
4	曹操	wei	32
5	曹丕	wei	5
6	孙权	wu	30
7	孙策	wu	28
8	孙尚香	wu	20
9	夏侯?	wei	30
Time taken: 0.427 seconds, Fetched: 9 row(s)

(2)创建分区表

hive> create table user_3g_age_area(id int, name string) 
partitioned by(city string, age int) 
row format delimited 
fields terminated by '\t';
OK
Time taken: 0.079 seconds

(3)设置动态分区参数

hive>set hive.exec.dynamici.partition=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict;   #半自动分区这个参数可以不设置。

(4)开始半自动分区

hive> insert into table user_3g_age_area 
partition(city='wu', age) 
select id, name, age #注意查询语句中只写普通字段和动态分区字段。
from user_3g_age 
where city='wu';

Query ID = root_20200617140809_f71a64ab-8a96-4d30-b933-aecdc9c08b5f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0006, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0006/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 14:08:18,789 Stage-1 map = 0%,  reduce = 0%
2020-06-17 14:08:27,359 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.82 sec
MapReduce Total cumulative CPU time: 1 seconds 820 msec
Ended Job = job_1592152196890_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_age_area/city=wu/.hive-staging_hive_2020-06-17_14-08-09_773_4025041538196608852-1/-ext-10000
Loading data to table test.user_3g_age_area partition (city=wu, age=null)
	 Time taken for load dynamic partitions : 424
	Loading partition {city=wu, age=20}
	Loading partition {city=wu, age=28}
	Loading partition {city=wu, age=30}
	 Time taken for adding to write entity : 2
Partition test.user_3g_age_area{city=wu, age=20} stats: [numFiles=2, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_3g_age_area{city=wu, age=28} stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
Partition test.user_3g_age_area{city=wu, age=30} stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.82 sec   HDFS Read: 4620 HDFS Write: 232 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 820 msec
OK

导入数据的查询语句中，最后的where子句，根据数据具体情况使用。

如下有几种错误的导入写法，仅供参考：

错误写法一：

hive> insert into table user_3g_age_area 
partition(city='wu', age) 
select id, name, city #将查询字段写成了静态分区字段
from user_3g_age where city='wu';

Query ID = root_20200617140142_fa4672a3-3d7d-43f6-901e-88b5392c119b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0003, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0003/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 14:01:51,192 Stage-1 map = 0%,  reduce = 0%
2020-06-17 14:02:00,899 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1592152196890_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_age_area/city=wu/.hive-staging_hive_2020-06-17_14-01-42_632_4964455617566151732-1/-ext-10000
Loading data to table test.user_3g_age_area partition (city=wu, age=null)
	 Time taken for load dynamic partitions : 120
	Loading partition {city=wu, age=wu}
	 Time taken for adding to write entity : 1
Partition test.user_3g_age_area{city=wu, age=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.97 sec   HDFS Read: 4613 HDFS Write: 122 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 970 msec
OK
Time taken: 19.886 seconds
hive> select * from user_3g_age_area;
OK
6	孙权	wu	NULL
7	孙策	wu	NULL
8	孙尚香	wu	NULL
Time taken: 0.107 seconds, Fetched: 3 row(s)

错误写法二：

hive> insert into table user_3g_age_area 
partition(city='wu', age) 
select id,name,city,age #将所有字段都写到了查询语句中
from user_3g_age;
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'age': Table insclause-0 has 3 columns, but query has 4 columns.

(5)查看分区

hive> show partitions user_3g_age_area;
OK
city=wu/age=20
city=wu/age=28
city=wu/age=30
Time taken: 0.064 seconds, Fetched: 7 row(s)

1.2 中文分区

想实现中文分区，如果遇到问题，请查看：Hive动态分区

创建中文分区步骤都一样，就不做详细介绍了，直接演示过程：

#创建表
hive> create table user_xy_race(id int, name string, city string , race string) 
row format delimited 
fields terminated by '\t';
OK
Time taken: 0.04 seconds
#加载数据
hive> load data local inpath '/root/book/xyuserrace.txt' overwrite into table user_xy_race;
Loading data to table test.user_xy_race
Table test.user_xy_race stats: [numFiles=1, numRows=0, totalSize=272, rawDataSize=0]
OK
Time taken: 0.558 seconds
#查看原始数据
hive> select * from user_xy_race;
OK
1	孙悟空	花果山	妖
2	猪八戒	高老庄	妖
3	沙和尚	流沙河	妖
4	唐玄奘	西安	人
5	小白龙	清潭涧	仙
6	唐玄宗	西安	人
7	观世音	南海	佛
8	玉皇大帝	凌霄宝殿	仙
9	黄风怪	小雷音寺	妖
10	如来佛祖	雷音寺	佛
Time taken: 0.059 seconds, Fetched: 10 row(s)
#创建分区表
hive> create table user_xy_race_area(id int, name string) 
partitioned by(city string,race string) 
row format delimited 
fields terminated by '\t';
OK
Time taken: 0.059 seconds
#设置动态分区参数
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
#半自动动态分区
hive> insert into table user_xy_race_area 
partition(city='西安',race) 
select id,name,race 
from user_xy_race 
where city='西安';

Query ID = root_20200617174813_5d9c2185-926d-4948-9894-96e4e1b30749
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0009, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0009/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:48:22,325 Stage-1 map = 0%,  reduce = 0%
2020-06-17 17:48:31,876 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.78 sec
MapReduce Total cumulative CPU time: 1 seconds 780 msec
Ended Job = job_1592152196890_0009
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_race_area/city=西安/.hive-staging_hive_2020-06-17_17-48-13_156_5420308159658982499-1/-ext-10000
Loading data to table test.user_xy_race_area partition (city=西安, race=null)
	 Time taken for load dynamic partitions : 93
	Loading partition {city=西安, race=人}
	 Time taken for adding to write entity : 0
Partition test.user_xy_race_area{city=西安, race=人} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.78 sec   HDFS Read: 4654 HDFS Write: 124 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
Time taken: 20.209 seconds
#查看数据
hive> select * from user_xy_race_area;
OK
4	唐玄奘	西安	人
6	唐玄宗	西安	人
Time taken: 0.063 seconds, Fetched: 2 row(s)
#查看分区
hive> show partitions user_xy_race_area;
OK
city�/race=�
Time taken: 0.098 seconds, Fetched: 1 row(s)

中文分区在hive中显示为乱码，但是不影响正常使用。Mysql中也会有显示乱码的情况，但是MySQL有一个names可以设置编码格式。

2、全自动分区

2.1英文

全自动还是用半自动的数据进行，操作如下：

#建表
hive> create table user_3g_area_all (id int,name string) 
partitioned by (city string,age string) #此处将类型写错了
row format delimited 
fields terminated by '\t';
OK
Time taken: 0.17 seconds

#修改字段类型
hive> alter table user_3g_area_all partition column (age int);
OK
Time taken: 0.097 seconds

#设置参数
set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

#开始自动分区导入数据
hive> insert into table user_3g_area_all 
partition(city,age) 
select id, name, city, age #这里要把字段写完整。
from user_3g_age;

Query ID = root_20200617171344_b3dae4d6-e248-440b-b977-2db4e191e298
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0008, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0008/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:13:54,523 Stage-1 map = 0%,  reduce = 0%
2020-06-17 17:14:03,066 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.4 sec
MapReduce Total cumulative CPU time: 1 seconds 400 msec
Ended Job = job_1592152196890_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_area_all/.hive-staging_hive_2020-06-17_17-13-44_647_9180685517392249162-1/-ext-10000
Loading data to table test.user_3g_area_all partition (city=null, age=null)
	 Time taken for load dynamic partitions : 632
	Loading partition {city=wu, age=28}
	Loading partition {city=shu, age=28}
	Loading partition {city=shu, age=29}
	Loading partition {city=wei, age=32}
	Loading partition {city=wu, age=20}
	Loading partition {city=wei, age=30}
	Loading partition {city=wu, age=30}
	Loading partition {city=shu, age=30}
	Loading partition {city=wei, age=5}
	 Time taken for adding to write entity : 3
Partition test.user_3g_area_all{city=shu, age=28} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=shu, age=29} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=shu, age=30} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wei, age=30} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_3g_area_all{city=wei, age=32} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wei, age=5} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wu, age=20} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_3g_area_all{city=wu, age=28} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wu, age=30} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.4 sec   HDFS Read: 4247 HDFS Write: 624 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 400 msec
OK
Time taken: 22.086 seconds

#查看数据
hive> select * from user_3g_area_all ;
OK
3	张飞	shu	28
2	关羽	shu	29
1	刘备	shu	30
9	夏侯惇	wei	30
4	曹操	wei	32
5	曹丕	wei	5
8	孙尚香	wu	20
7	孙策	wu	28
6	孙权	wu	30
Time taken: 0.125 seconds, Fetched: 9 row(s)

如果少写了字段会报以下错误：

hive> insert into table user_3g_area_all partition(city,age) select id, name, age from user_3g_age;
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'age': Table insclause-0 has 4 columns, but query has 3 columns.

查看分区：

hive> show partitions user_3g_area_all;
OK
city=shu/age=28
city=shu/age=29
city=shu/age=30
city=wei/age=30
city=wei/age=32
city=wei/age=5
city=wu/age=20
city=wu/age=28
city=wu/age=30
Time taken: 0.097 seconds, Fetched: 9 row(s)

2.2中文

过程和英文一样，不做具体介绍，只演示命令：

#创建分区表
hive> create table user_xy_area_all (id int,name string) 
partitioned by(city string,race string) 
row format delimited 
fields terminated by '\t';
OK
Time taken: 0.059 seconds

#设置参数
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;

#开始导入数据
hive> insert into table user_xy_area_all 
partition(city,race) 
select id,name,city,race 
from user_xy_race;

Query ID = root_20200617175222_16bf2b69-361f-4df3-939f-78c3bb8f7ef5
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0010, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0010/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:52:30,712 Stage-1 map = 0%,  reduce = 0%
2020-06-17 17:52:39,123 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
MapReduce Total cumulative CPU time: 1 seconds 390 msec
Ended Job = job_1592152196890_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_area_all/.hive-staging_hive_2020-06-17_17-52-22_348_8759507579611250211-1/-ext-10000
Loading data to table test.user_xy_area_all partition (city=null, race=null)
	 Time taken for load dynamic partitions : 592
	Loading partition {city=花果山, race=妖}
	Loading partition {city=西安, race=人}
	Loading partition {city=雷音寺, race=佛}
	Loading partition {city=凌霄宝殿, race=仙}
	Loading partition {city=高老庄, race=妖}
	Loading partition {city=流沙河, race=妖}
	Loading partition {city=清潭涧, race=仙}
	Loading partition {city=南海, race=佛}
	Loading partition {city=小雷音寺, race=妖}
	 Time taken for adding to write entity : 4
Partition test.user_xy_area_all{city=凌霄宝殿, race=仙} stats: [numFiles=1, numRows=1, totalSize=15, rawDataSize=14]
Partition test.user_xy_area_all{city=南海, race=佛} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=小雷音寺, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=流沙河, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=清潭涧, race=仙} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=花果山, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=西安, race=人} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_xy_area_all{city=雷音寺, race=佛} stats: [numFiles=1, numRows=1, totalSize=16, rawDataSize=15]
Partition test.user_xy_area_all{city=高老庄, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.39 sec   HDFS Read: 4321 HDFS Write: 749 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 390 msec
OK
Time taken: 19.139 seconds

#查看数据
hive> select * from user_xy_area_all;
OK
8	玉皇大帝	凌霄宝殿	仙
7	观世音	南海	佛
9	黄风怪	小雷音寺	妖
3	沙和尚	流沙河	妖
5	小白龙	清潭涧	仙
1	孙悟空	花果山	妖
4	唐玄奘	西安	人
6	唐玄宗	西安	人
10	如来佛祖	雷音寺	佛
2	猪八戒	高老庄	妖
Time taken: 0.063 seconds, Fetched: 10 row(s)

#查看分区
hive> show partitions user_xy_area_all;
OK
city=� ��/race=�
city=Ww/race=[
city=���/race=�
city=A��/race=�
city=m�/race=�
city=��q/race=�
city�/race=�
city=���/race=[
city=� �/race=�
Time taken: 0.062 seconds, Fetched: 9 row(s)

以上为全过程！

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2020/06/17 ，如有侵权请联系 cloudcommunity@tencent.com 删除

hive

数据库

云数据库 SQL Server

sql

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

Hive动态分区之多分区

Hive动态分区之多分区

Hive动态分区之多分区

1、半自动分区

1.1 英文分区

(1)准备数据

(2)创建分区表

(3)设置动态分区参数

(4)开始半自动分区

(5)查看分区

1.2 中文分区

2、全自动分区

2.1英文

2.2中文

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

Hive动态分区之多分区

Hive动态分区之多分区

Hive动态分区之多分区

1、 半自动分区

1.1 英文分区

(1)准备数据

(2)创建分区表

(3)设置动态分区参数

(4)开始半自动分区

(5)查看分区

1.2 中文分区

2、全自动分区

2.1英文

2.2中文

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

1、半自动分区