前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Apache Hive MSCK

Apache Hive MSCK

作者头像
DataScience
发布2019-12-30 17:34:58
8970
发布2019-12-30 17:34:58
举报
文章被收录于专栏:A2DataA2Data

Apache Hive MSCK

(自动添加分区)

Apache Hive MSCK

修复Hive表分区

⊙一起学Hive——使用MSCK命令修复Hive分区

代码语言:javascript
复制
最近在使用Hive的过程中,在备份数据时,经常会使用cp或mv命令来拷贝数据,
将数据拷贝到我们新建备份表的目录下面,如果不是分区表,则上面的操作之后,
新建的备份表可以正常使用,但是如果是分区表的,一般都是使用alter table add partition命令将分区信息添加到新建的表中,每添加一条分区信息就执行一个alter table add partition命令,如果分区数量少还好办,但是遇到分区数量多的情况,特别是分区数量大于50的情况,如果还是使用alter命令添加分区,那是一件耗时耗力的事情,还容易出错。

MSCK命令

代码语言:javascript
复制
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command), the metastore (and hence Hive) will not be aware of these partitions unless the user runs ALTER TABLE table_name ADD PARTITION commands on each of the newly added partitions.
However, users can run a metastore check command with the repair table option:
MSCK REPAIR TABLE table_name;
which will add metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore. See HIVE-874 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME. By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE table_name RECOVER PARTITIONS;
Starting with Hive 1.3, MSCK will throw exceptions if directories with disallowed characters in partition values are found on HDFS. Use hive.msck.path.validation setting on the client to alter this behavior; "skip" will simply skip the directories. "ignore" will try to create partitions anyway (old behavior). This may or may not work.

翻译成中文的大概意思就是:Hive将每个表的分区信息保存在metastore中,如果通过hadoop fs -put命令直接将分区信息添加到HDFS,metastore是不会感知到这些新增的分区,除非执行了ALTER TABLE table_name ADD PARTITION命令。但是用户可以运行metastore检查命令MSCK REPAIR TABLE table_name;该命令将关于分区的元信息添加到Hive metastore中,这是对于那些没有元信息的分区来说的。换句话说,就是将任何存在于HDFS上但不在metastore上的分区添加到metastore。

实用案例化

代码语言:javascript
复制
下面介绍如果使用MSCK命令,创建了一个a2data.msck_table表,然后通过cp命令将几个目录的文件拷贝到a2data.msck_table目录下面,然后执行show partitions a2data.msck_table命令,将不会显示分区的信息:

查看分区

代码语言:javascript
复制
hive> show partitions a2data.msck_table;
OK
Time taken: 1.121 seconds
然后执行MSCK REPAIR TABLE a2data.msck_table;命令添加分区:

hive> MSCK REPAIR TABLE a2data.msck_table;
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 10124
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 21234
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 346783
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 2532162
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 2901198
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 3129087
19/05/25 17:27:15 WARN log: Updating partition stats fast for: a2data.msck_table;
19/05/25 17:27:15 WARN log: Updated size to 23190876
OK
Partitions not in metastore:    a2data.msck_table;:dt=2019-05-15    a2data.msck_table;:dt=2019-05-16    a2data.msck_table;:dt=2019-05-17    a2data.msck_table;:dt=2019-05-18    a2data.msck_table;:dt=2019-05-19     a2data.msck_table;:dt=2019-05-20    a2data.msck_table;:dt=2019-05-21
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-15
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-16
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-17
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-18
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-19
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-20
Repair: Added partition to metastore a2data.msck_table;:dt=2019-05-21
Time taken: 0.613 seconds, Fetched 8 row(s) 
通过上述的结果可以看到已经将a2data.msck_table表的分区信息添加到Hive metastore中,和add partition命令比起来既方便又高效。

特别注意

代码语言:javascript
复制
有点需要注意的是,分区的目录结构必遵循
/partition_name=partition_value/结构,
否则msck无法自动添加分区,只能使用add partition命令。
本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-05-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DataScience 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
大数据
全栈大数据产品,面向海量数据场景,帮助您 “智理无数,心中有数”!
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档