【HDFS】distcp报错Check0sum mismatch

原创

runzhliu

发布于 2020-08-08 11:17:27

8900

发布于 2020-08-08 11:17:27

本来想写个 spark 任务来导数据的，但是时间有限，为了快速实现把数据从 HDFS 集群 A 转移到集群 B，还是选择用 hadoop distcp 命令来拷贝数据。具体的命令如下。

hadoop distcp hdfs://clusterA/xxx hdfs://clusterB:/xxx

没想到报错了。

错误信息的分析也很简单，就是没有 Check-sum 这个文件。看一下 help 信息。

# bin/hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -async                 Should distcp execution be blocking
 -atomic                Commit all changes or none
 -bandwidth <arg>       Specify bandwidth per map in MB
 -delete                Delete from target, files missing in source
 -f <arg>               List of files that need to be copied
 -filelimit <arg>       (Deprecated!) Limit number of files copied to <= n
 -i                     Ignore failures during copy
 -log <arg>             Folder on DFS where distcp execution logs are
                        saved
 -m <arg>               Max number of concurrent maps to use for copy
 -mapredSslConf <arg>   Configuration for ssl config file, to use with
                        hftps://
 -overwrite             Choose to overwrite target files unconditionally,
                        even if they exist.
 -p <arg>               preserve status (rbugp)(replication, block-size,
                        user, group, permission)
 -sizelimit <arg>       (Deprecated!) Limit number of files copied to <= n
                        bytes
 -skipcrccheck          Whether to skip CRC checks between source and
                        target paths.
 -strategy <arg>        Copy strategy to use. Default is dividing work
                        based on file sizes
 -tmp <arg>             Intermediate work path to be used for atomic
                        commit
 -update                Update target, copying only missingfiles or
                        directories

注意 -skipcrccheckt 和 -update 两个命令要一起用，用过之后，在拷贝数据文件之后，就不会再去校验 Check sum 文件了。