使用Distcp和HMS-Mirror同步Hive到CDP

大数据杂货铺

发布于 2022-03-29 19:42:39

1.4K0

发布于 2022-03-29 19:42:39

文章被收录于专栏：大数据杂货铺

文档编写目的

对于Hive迁移到CDP平台，如果源平台为CDH且具有Cloudera的使用许可证，则可以通过CDP提供的Replication Manager轻松将Hive迁移到CDP平台中；如果源平台不是CDH或者没有Cloudera的许可证，则可以使用本文提供的方法进行迁移。

本文提供的迁移方法适用于Hive1/2迁移到Hive3，支持从CDH/HDP/AWS EMR/HDInsight/Tencent EMR/Alibaba EMR等平台将Hive迁移到CDP。

本文主要使用CDH5平台为示例，将非安全的CDH5中的Hive数据迁移到安全的CDP集群中的Hive。

内容概述

本文主要介绍将非安全的CDH5中的Hive数据迁移到安全的CDP集群中的Hive。通过实际操作，大家对如何进行Hive的迁移有更好的认识。

测试环境

	源集群	目标集群
CDH版本	5.16.2	7.1.7
是否启用Kerberos	未启用	启用
Hive版本	1.1.0+cdh5.16.2+1450	3.1.3000.7.1.7.0-551
Cloudera Manager版本	5.16.2.4505	7.4.4

测试内容

将非启用Kerberos的CDH的Hive数据和元数据迁移到启用安全的CDP平台中。

迁移步骤

将Hive的数据通过Distcp迁移到CDP平台对应的目录
利用HMS Mirror将Hive的元数据迁移到CDP平台中

实验环境确认

源集群环境

源集群未启用Kerberos安全

Test_db库中有四张表，分别对应四种文件类型：avro,ORC,Parquet,text.

四张表都是内部表

目标集群环境

目标集群是启用了Kerberos的CDP Base集群。

CDP Base集群中使用的Hive版本为3.1.3

使用Distcp将数据迁移到CDP

源库数据可以直接使用distcp进行迁移，如果数据会更新，减少数据更新导致的数据不一致和后续比较增量，推荐对需要迁移的数据制作快照。

制作快照

之前介绍hdfs有很多种方式制作快照，这里直接使用hdfs命令行制作快照

设置目录允许快照

需要拥有superuser权限 Allow Snapshots 允许一个目录可以创建快照。如果操作成功，这个目录即为snapshottable目录

[root@ccycloud hive-testbench]# hdfs dfsadmin -allowSnapshot /user/hive/warehouse/test_db.db
Allowing snaphot on /user/hive/warehouse/test_db.db succeeded
[root@ccycloud hive-testbench]#

指定名称创建快照

[root@ccycloud hive-testbench]# hdfs dfs -createSnapshot /user/hive/warehouse/test_db.db test_db_snapshot
Created snapshot /user/hive/warehouse/test_db.db/.snapshot/test_db_snapshot
[root@ccycloud hive-testbench]#

在创建快照的目录下面的.snapshot目中可以看到对应的快照

[root@ccycloud hive-testbench]# hdfs dfs -ls /user/hive/warehouse/test_db.db/.snapshot
Found 1 items
drwxrwxrwt   - admin hive          0 2022-01-18 03:28 /user/hive/warehouse/test_db.db/.snapshot/test_db_snapshot
[root@ccycloud hive-testbench]#

使用DistCP全量数据迁移到CDP

我们使用hive用户来进行数据迁移。该实操源集群没有启用Kerberos，目标集群启用了Kerberos。

hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true  hdfs://172.27.38.73:8020/user/hive/warehouse/test_db.db hdfs:// 172.27.116.128/user/hive/warehouse/test_db.db

关于distcp命令的具体使用方法参考官方文档。

查看同步后的数据

[root@ccycloud 79-hive_on_tez-HIVESERVER2]# hdfs dfs -ls /user/hive/warehouse/test_db.db
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Found 4 items
drwxr-xr-x   - hive hive          0 2022-01-23 02:00 /user/hive/warehouse/test_db.db/supplier_avro
drwxr-xr-x   - hive hive          0 2022-01-23 02:00 /user/hive/warehouse/test_db.db/supplier_orc
drwxr-xr-x   - hive hive          0 2022-01-23 02:00 /user/hive/warehouse/test_db.db/supplier_parquet
drwxr-xr-x   - hive hive          0 2022-01-23 02:00 /user/hive/warehouse/test_db.db/supplier_text
[root@ccycloud 79-hive_on_tez-HIVESERVER2]# hdfs dfs -du -h /user/hive/warehouse/test_db.db
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
2.7 M     2.7 M     /user/hive/warehouse/test_db.db/supplier_avro
1003.2 K  1003.2 K  /user/hive/warehouse/test_db.db/supplier_orc
2.9 M     2.9 M     /user/hive/warehouse/test_db.db/supplier_parquet
2.7 M     2.7 M     /user/hive/warehouse/test_db.db/supplier_text
[root@ccycloud 79-hive_on_tez-HIVESERVER2]#

使用HMS-Mirror将元数据迁移到CDP

我们这里使用HMS mirror来迁移元数据。

HMS mirror是管理两个 Hive 平台之间元数据和数据复制的命令行实用程序。是 Hive Metastore 的元数据复制工具。

您可以在使用较低集群中的数据进行测试时链接集群并复制元数据，也可以使用“distcp”迁移数据并将元数据复制到新集群或 CDP Cloud。

支持模式同步和 DR“只读”方案。

HMS Mirror的项目地址：https://github.com/dstreev/hms-mirror

HMS Mirror有多种Strategy，对于迁移升级场景，我们使用Schema Only。

HMS Mirror安装

在CDP集群中选择一台服务器（边缘节点）进行安装。

下载：https://github.com/dstreev/hms-mirror/releases

直接下载已经预先编译好的包hms-mirror-dist.tar.gz，解压,我们使用root用户安装，脚本会把程序安装在/usr/local/hms-mirror并且创建软链接方便使用。

https://github.com/dstreev/hms-mirror/releases/download/1.4.0.2-SNAPSHOT/hms-mirror-dist.tar.gz
tar zxvf hms-mirror-dist.tar.gz
hms-mirror-install/setup.sh

配置

hms-mirror需要一个描述左（源）和右（目标）集群连接的配置文件。有两种方法可以创建配置：

hms-mirror --setup - 提示LEFT和RIGHT集群构建默认配置文件的一系列问题。
使用默认配置模板作为起点。在这里$HOME/.hms-mirror/cfg/default.yaml编辑并放置一个副本。

我们源集群是没有启用Kerberos的，目标集群是启用Kerberos的。这里在外部单独创建了一个配置文件，示例配置如下：

[root@ccycloud hms-test]# cat default-template.yaml
# Copyright 2021 Cloudera, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


transfer:
# Optional (default: 4)
concurrency: 10
# Optional (default: 'transfer_')
transferPrefix: "transfer_"
# This directory is appended to the 'clusters:...:hcfsNamespace' value to store the transfer package for hive export/import.
# Optional (default: '/apps/hive/warehouse/export_')
exportBaseDirPrefix: "/apps/hive/warehouse/export_"
clusters:
LEFT:
# Set for Hive 1/2 environments
legacyHive: true
# Is the 'Hadoop COMPATIBLE File System' used to prefix data locations for this cluster.
# It is mainly used as the transfer location for metadata (export)
# If the primary storage for this cluster is 'hdfs' than use 'hdfs://...'
# If the primary storage for this action is cloud storage, use the
#    cloud storage prefix. IE: s3a://my_bucket
hcfsNamespace: "hdfs://172.27.38.73:8020"
hiveServer2:
# URI is the Hive JDBC URL in the form of:
# jdbc:hive2://<server>:<port>
# See docs for restrictions
uri: "jdbc:hive2://172.27.38.73:10000"
connectionProperties:
user: "*****"
password: "*****"
# Standalone jar file used to connect via JDBC to the LEFT environment Hive Server 2
# NOTE: Hive 3 jars will NOT work against Hive 1.  The protocol isn't compatible.
jarFile: "/root/hms-test/aux_libs/hive-jdbc-1.1.0-cdh5.16.2-standalone.jar"
RIGHT:
legacyHive: false
# Is the 'Hadoop COMPATIBLE File System' used to prefix data locations for this cluster.
# It is mainly used to as a baseline for where "DATA" will be transfered in the
# STORAGE stage.  The data location in the source location will be move to this
# base location + the extended path where it existed in the source system.
# The intent is to keep the data in the same relative location for this new cluster
# as the old cluster.
# If the LEFT and RIGHT clusters are share the same cloud storage, then use the same
# hcfs base location as the LEFT cluster.
hcfsNamespace: "hdfs://172.27.116.128:8020"
hiveServer2:
# URI is the Hive JDBC URL in the form of:
# jdbc:hive2://<server>:<port>
# See docs for restrictions
uri: "jdbc:hive2://172.27.116.128:10000/default;principal=hive/ccycloud.xfwangcdp.root.hwx.site@CLOUDERA.COM"
connectionProperties:
user: "*****"
password: "*****"
# Standalone jar file used to connect via JDBC to the LEFT environment Hive Server 2
# NOTE: Hive 3 jars will NOT work against Hive 1.  The protocol isn't compatible.
#jarFile: "<environment-specific-jdbc-standalone-driver>"
partitionDiscovery:
# Addition HMS configuration needed for this "discover.partitions"="true"
auto: true
# When a table is created, run MSCK when there are partitions.
initMSCK: true

配置文件解释：transfer部分没有使用到。Clusters部分分为Left Right

Left 集群：src集群，CDH5集群，无Kerberos，无高可用。jarFile从CDH5集群获取（standalone的jdbc.jar包）。user password不涉及（没加密，如果是LDAP集群可能是LDAP用户密码）

Right集群：target集群，CDP7集群，有Kerberos，无高可用。jarFile从CDH7集群获取。user password不涉及（已经用Kerberos Keytab认证了，文档中提到如果集群有Kerberos，不要写jar包路径，直接放到aux_libs目录，没理解这么做的原因）

Hiveserver2的URI为开启Kerberos的HS2 JDBC写法。也可以从beeline命令行获取（高可用的写法）

其他，需要使用到对应两个hive版本的standalone的jdbc jar包，未启用Kerberos的对应版本jdbc jar放在HOME/.hms-mirror/aux_lib之外的其他目录，目标端的对应版本的jdbc jar放在HOME/.hms-mirror/aux_lib下。

[root@ccycloud ~]# tree ~/.hms-mirror/
/root/.hms-mirror/
|-- aux_libs
|   `-- hive-jdbc-3.1.3000.7.1.7.0-551-standalone.jar
|-- cfg
|   `-- default.yaml
`-- retry


3 directories, 2 files
[root@ccycloud ~]#

[root@ccycloud ~]# tree hms-test/
hms-test/
|-- aux_libs
|   `-- hive-jdbc-1.1.0-cdh5.16.2-standalone.jar
`-- default-template.yaml


1 directory, 2 files
[root@ccycloud ~]#

使用

hms-mirror -cfg /root/hms-test/default-template.yaml  -db test_db -o temp

其中：test_db为库名，temp为目录名

在操作目录下执行, 执行前确认有Kerberos认证

从执行日志可以看出，供涉及1个数据库，4张表，表都执行完成。

执行完成后，在temp目录下生成一堆文件。

在目标集群建表

需要执行的SQL在DBName_Right_execute.sql，这里有一处错误。SQL中的Location中的HDFS Schema为源集群的地址，原因未知。我们批量修改成目标集群的路径。然后使用beeline -f 执行（如果目标数据库不存在，则需要先创建数据库）。

使用vi打开test_db_RIGHT_execute.sql，类似下列命令进行替换

:1,$s/ccycloud.cdh5.root.hwx.site/ccycloud.xfwangcdp.root.hwx.site/g

替换完成后，使用beeline执行该文件

beeline -f test_db_RIGHT_execute.sql

因此脚本会自动执行MSCK操作，因此建议在DistCP之后再进行，否则可以手动执行MSCK操作。

验证

可以看到test_db数据库中有同步过来的表

使用DistCP同步增量数据到CDP

源集群表修改数据

通过Hive插入两条数据

生成新快照

通过hdfs的文件管理器来生成test_db.db目录的新快照（也可以使用其他方式生成快照）

同步hdfs增量数据

在目标集群上使用distcp命令同步增量数据。

检查数据

hdfs dfs -ls /user/hive/warehouse/test_db.db
hdfs dfs -du -h /user/hive/warehouse/test_db.db

检查Hive数据

通过Hue或者beeline检查变更表中的数据

Troubleshooting

在进行数据同步时，如果遇到同步的用户不是超级用户导致distcp报错，则通过Ranger进行赋权。

类似报错信息如下：

解决方案：在Ranger中对同步时使用的用户赋/user/hdfs目录的所有权限。

总结

对于不能使用Cloudera Replication Manager来复制Hive数据和元数据的情况下，例如源是EMR或者HDP等，可以使用distcp和HMS-Mirror两个工具来完成hive全量和增量数据及元数据的迁移工作，将Hive迁移到CDP平台。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-01-25，如有侵权请联系 cloudcommunity@tencent.com 删除

hive

本文分享自大数据杂货铺微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度