sqoop是连接关系型数据库和hadoop的桥梁: (1),把关系型数据库的数据导入到hadoop与其相关的系统(hbase和hive); (2),把数据从hadoop导出到关系型数据库里。 sqoop是利用mapreudude加快数据的传输速度,批处理的方式进行数据传输。
两个版本完全不兼容。版本的划分方式是apache:1.4.x,1.99.x。 sqoop2相对于sqoop1有很大改进:首先引入了sqoop server,集中化管理connector等,其次,sqoop2有多种访问方式:CLI,Web UI,REST API;最后sqoop2引入了角色的安全机制。 下图是sqoop架构图:
3,sqoop三要素
导入导出过程类似,主要是分为两步: 1.获得元数据; 2. 提交map任务。 以import为例画出七过程图。
$tar -zxvf zookeeper-3.4.5-cdh5.3.6.tar.gz -C ./
在sqoop的conf/目录下的sqoop-env.sh
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/cdh2.3.6/hadoop-2.5.0-cdh5.3.6
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/cdh2.3.6/hadoop-2.5.0-cdh5.3.6
#set the path to where bin/hbase is available
#HBASE_HOME=
#Set the path to where bin/hive is available
export HIVE_HOME=/opt/cdh2.3.6/hive-0.13.1-cdh5.3.6
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=/opt/cdh2.3.6/hive-0.13.1-cdh5.3.6
ZOOKEEPER_HOME=/opt/cdh2.3.6/zookeeper-3.4.5-cdh5.3.
在configure-sqoop
## Moved to be a runtime check in sqoop.
#if [ ! -d "${HBASE_HOME}" ]; then
# echo "Warning: $HBASE_HOME does not exist! HBase imports will fail."
# echo 'Please set $HBASE_HOME to the root of your HBase installation.'
#fi
## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
# echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
# echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi
#if [ ! -d "${ACCUMULO_HOME}" ]; then
# echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
# echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
shell中sqoop的安装目录下输入:
$bin/sqoop help
安装数据库驱动
cp /opt/modules/hive-0.13.1/lib/mysql-connector-java-5.1.27-bin.jar lib/
在mysql数据库中hadoop数据库创建一张表并导入数据
CREATE TABLE `my_user` (
`id` tinyint(4) NOT NULL AUTO_INCREMENT,
`account` varchar(255) DEFAULT NULL,
`passwd` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `my_user` VALUES ('1', 'admin', 'admin');
INSERT INTO `my_user` VALUES ('2', 'pu', '12345');
INSERT INTO `my_user` VALUES ('3', 'system', 'system');
INSERT INTO `my_user` VALUES ('4', 'zxh', 'zxh');
INSERT INTO `my_user` VALUES ('5', 'test', 'test');
INSERT INTO `my_user` VALUES ('6', 'pudong', 'pudong');
INSERT INTO `my_user` VALUES ('7', 'qiqi', 'qiqi');
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--target-dir /user/hadoop/sqoop/import/imp_my_user \
--delete-target-dir \
--direct \
--fields-terminated-by "\t"
解释
--delete-target-dir :目标目录存在就删除
--target-dir:指定输出目录,不指定就直接在主目录下生产。
--num-mappers 1:设置map的个数
--direct:manager.DirectMySQLManager: Beginning mysqldump fast path import.使用这个命令会很快,本机装有mysql时,才可以使用。
--fields-terminated-by "\t":设置输出文件分分割方式
--as-parquetfile :设置文件格式为parquetfile
--columns <col,col,col...> :指定表中部分字段进行导入
--query:直接查询
--where <where clause> :条件查询
四个提醒
WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
WARN manager.MySQLManager: It looks like you are importing from mysql.
WARN manager.MySQLManager: This transfer can be faster! Use the --direct
WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--target-dir /user/hadoop/sqoop/import/imp_my_user \
--fields-terminated-by "\t" \
--check-column id \
--incremental append \
--last-value 4
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--target-dir /user/hadoop/sqoop/import/imp_my_user_parquet \
--fields-terminated-by "\t" \
--as-parquetfile
$tar -zxvf cdh5.3.6-snappy-lib-natirve.tar.gz $ cp -r ./lib hadoop-2.5.0-cdh5.3.6$ bin/hadoop checknative
import数据到hdfs,并进行压缩
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--target-dir /user/hadoop/sqoop/import/imp_my_user_snappy \
--fields-terminated-by "\t" \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--columns id \
--target-dir /user/hadoop/sqoop/import/imp_my_user_id \
--fields-terminated-by "\t"
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--num-mappers 1 \
--query "select id,account from my_user where \$CONDITIONS" \
--target-dir /user/hadoop/sqoop/import/imp_my_user_quary \
--delete-target-dir \
--fields-terminated-by "\t"
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--num-mappers 1 \
--table my_user \
--where "id>3" \
--target-dir /user/hadoop/sqoop/import/imp_my_user_quary \
--delete-target-dir \
--fields-terminated-by "\t"
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--query "select id,account from my_user where \$CONDITIONS" \
--where "id>3" \
--num-mappers 1 \
--target-dir /user/hadoop/sqoop/import/imp_my_user \
--fields-terminated-by "\t" \
--check-column id \
--incremental append \
--last-value 4
总结: 1)使用query,直接写SQL 2)导入数据全量和增量 全量: 所有数据 增量: * incremental paramters * query where
CREATE TABLE `my_user2` (
`id` tinyint(4) NOT NULL AUTO_INCREMENT,
`account` varchar(255) DEFAULT NULL,
`passwd` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
);
bin/sqoop export \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user2 \
--num-mappers 1 \
--input-fields-terminated-by "\t" \
--export-dir /user/hadoop/sqoop/import/imp_my_user
hive导入参数
--hive-home 重写$HIVE_HOME
--hive-import 插入数据到hive当中,使用hive的默认分隔符
--hive-overwrite 重写插入
--create-hive-table 建表,如果表已经存在,该操作会报错!
--hive-table [table] 设置到hive当中的表名
--hive-drop-import-delims 导入到hive时删除 \n, \r, and \01
--hive-delims-replacement 导入到hive时用自定义的字符替换掉 \n, \r, and \01
--hive-partition-key hive分区的key
--hive-partition-value hive分区的值
--map-column-hive 类型匹配,sql类型对应到hive类型
hive空值处理 sqoop会自动把NULL转换为null处理,但是hive中默认是把\N来表示null,因为预先处理不会生效的我们需要使用 --null-string 和 --null-non-string来处理空值 把\N转为\N
sqoop import ... --null-string '\\N' --null-non-string '\\N'
在hive中创建一张表
drop table if exists db_1206.h_user ;
create table db_1206.h_user(
id int,
account string,
password string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
导入数据到hive中
bin/sqoop import \
--connect jdbc:mysql://luffy.onepiece1.host:3306/hadoop \
--username root \
--password 123456 \
--table my_user \
--num-mappers 1 \
--fields-terminated-by "\t" \
--delete-target-dir \
--hive-database db_1206 \
--hive-import \
--hive-table h_user
过程: mysql-table -> hdfs: /user/beifeng/my_user/xx 存在可查看 -> hive-table-hdfs-dir
在mysql中创建一张表
CREATE TABLE `user_export` (
`id` tinyint(4) NOT NULL AUTO_INCREMENT,
`account` varchar(255) DEFAULT NULL,
`passwd` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
);
bin/sqoop export \
--connect jdbc:mysql://luffy.onepiece1.host:3306/db_1206 \
--username root \
--password 123456 \
--table user_export \
--num-mappers 1 \
--input-fields-terminated-by "\t" \
--export-dir /user/hive/warehouse/db_1206.db/h_user
流程:
shell scripts
## step 1
load data .....
## step 2
bin/hive -f xxx.sql
## step 3
bin/sqoop --option-file yy.txt
import.txt封装sqoop指令
mport
--connect
jdbc:mysql://luffy.onepiece1.host:3306/hadoop
--username
root
--password
123456
--table
my_user
--num-mappers
1
--target-dir
/user/hadoop/sqoop/import/imp_my_user
--delete-target-dir
执行指令
bin/sqoop --options-file /opt/cdh2.3.6/sqoop-1.4.5-cdh5.3.6/import.txt
其实,sqoop完全可以用spark代替掉。
推荐阅读: