sqoop之旅3-数据导入

皮大大

发布于 2021-03-02 16:07:17

8490

发布于 2021-03-02 16:07:17

文章被收录于专栏：机器学习/数据可视化机器学习/数据可视化

sqoop-import

introduction

导入的主要语法是

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

命令格式为：

$ sqoop import
--connect \  (连接字符)
--username \ (用户名)
--password [-P |--password-file]\ (单个密码或用文件形式，文件权限一般为400)
--query \   (查询字符串)
--warehouse-dir \  (导入的HDFS目录)
--fields-terminal-by \  (分隔符，默认是逗号)
-m [--num-mappers]   (MR任务数量，控制导入并行度)

Connecting to a Database Server

Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument. This describes the server and database to connect to; it may also specify the port.

连接数据库服务器
指定参数—-connect
通过类似url的形式，有时还要指定端口

# 1. 普通形式
# 表示：连接上名为employees的主机database.example.com
$ sqoop import --connect jdbc:mysql://database.example.com/employees

# 2. 使用用户和密码(明文)
$ sqoop import
--connect jdbc:mysql://database.example.com/employees \
--username root
--password 12345

# 3. 密码为隐藏形式
$ sqoop import
--connect jdbc://mysql://database.example.com/employees \
--username root
-- password-file ${user.name}/.password

# 4. 指定端口
$sqoop import
--connect jdbc:mysql://ubuntu:3306/db    # ubuntu为主机名；db为数据库名；如果是分布式集群，不要用localhost作为主机名，因为这是本机而不是连接过去的服务器

Selecting the Data to Import

Sqoop typically imports data in a table-centric fashion. Use the --table argument to select the table to import. For example, --table employees. This argument can also identify a VIEW or other table-like entity in a database. By default, all columns within a table are selected for import. Imported data is written to HDFS in its "natural order;” You can select a subset of columns and control their ordering by using the --columns argument. For example: --columns "name,employee_id,jobtitle". You can control which rows are imported by adding a SQL WHERE clause to the import statement. By default, Sqoop generates statements of the form SELECT <column_list> FROM <table_name> . Append a WHERE clause to this with the --where argument. For example: --where "id > 400".

默认是通过—table参数导入表中的全部数据；
不仅可以是表，也可以是视图、或者其他类似表的结构
默认导入的属性顺序和原来的相同，但是可以人为的指定属性顺序
可以带上where条件，使用查询语句

# 带上where条件
$sqoop import \
--connect jdbc:mysql://database.example.com/employees \
--username root \
--password 12345 \
--where "name='xiaoming' " \
--table help_keyword \
--target-dir /sqoop/hadoop/data \  # 导出的目录
-m 1  # mapreduce作业数量

# 查询指定的列
$sqoop import \
--connect jdbc:mysql://database:3306/mysql \
--username root \
--password root \
--where "name='xiaoming' " \
--table employees  \
--target-dir /sqoop/hadoop/data  \
-m 1
select name from employees where name="xiaoming"

# 指定自定义查询SQL
$sqoop import  \
--connect jdbc:mysql://database:3306/mysql \
--username root \
--password root \
--target-dir /sqoop/hadoop/data  \
--query 'select id,name from mysql.employees where $CONDITIONS and name="xiaoming"' \
--split-by id \
-m 4

# 导入部分数据
$sqoop import \
--connect jdbc:mysql://ubuntu:3306/sqooptest \  # ubuntu为主机名；sqooptest为数据库名
--username root \  # 指定用户和密码
--password 123456 \
--query "select * from sqoop where student_id <= 500 AND \$CONDITIONS"

总结2点：

引号问题：要么外层使用单引号，内层使用双引号，CONDITIONS的符号不用转义要么外层使用双引号，内层使用单引号，CONDITIONS的符号需要转义
自定义query的SQL语句中必须带有where $CONDITIONS 以及AND

If you are issuing the query wrapped with double quotes ("), you will have to use \

Free-form Query Imports

Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument. When importing a free-form query, you must specify a destination directory with --target-dir. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

除了指定表、字段和带上where条件外，还可以直接使用—query进行导入
必须指定到处的目标目录
必须带上$CONDITIONS；同时指定分隔符
注意单引号和双引号的使用问题

$sqoop import \
--query 'select a.*,b.* from a join b on (a.id == b.id) where $CONDITIONS' \
--split-by a.id \  # 指定分隔符
--target-dir /usr/foo/joinresults  # 指定目录


$sqoop import \
--query 'select a.*,b.* from a join b on (a.id == b.id) where $CONDITIONS' \
-m 1
--target-dir /usr/foo/joinresults

Controlling Parallelism

Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. If a table does not have a primary key defined and the is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.

-m用来指定mapreduce的任务数量，后面接上一个整数，表示MR的并行度
在导入的时候，sqoop会使用split-by进行负载切分：获取字段的max和min值，再记性切分；
并行度默认是4，有时可能是8或者16
如果字段不是均匀分布的，会导致任务的不平衡；sqoop目前不支持使用多个字段进行切分字段

Controlling Distributed Cache

Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job.
Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.

Controlling the Import Process

By default, the import process will use JDBC which provides a reasonable cross-vendor import channel. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the mysqldump tool . By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC. By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files). You can adjust the parent directory of the import with the --warehouse-dir argument. --target-dir is incompatible with --warehouse-dir.

sqoop默认使用的是jdbc 作为导入通道，可以根据数据库进行修改，进行提速，比如MySQL使用mysqldump
sqoop默认是将数据导入到当前用户的foo目录下
--target-dir和--warehouse-dir不能同时存在

$sqoop import \
--connect jdbc:mysql://ubuntu:3306/mysql \
--table foo
--warehouse-dir /shared \   # 数据将会被导入到/shared/foo/目录下面

$sqoop import \
--connect jdbc:mysql://ubuntu:3306/mysql \
--table foo
--target-dir /dest \ # 数据将会被导入到/dest/foo/目录下面

Controlling transaction isolation

By default, Sqoop uses the read committed transaction isolation in the mappers to import data. This may not be the ideal in all ETL workflows and it may desired to reduce the isolation guarantees. The --relaxed-isolation option can be used to instruct Sqoop to use read uncommitted isolation level. the option --relaxed-isolation may not be supported on all databases.

默认情况下，sqoop使用已读提交的事务隔离级别来导入数据；方式不理想
relaxed-isolation用于指示sqoop使用读取未提交的隔离级别；但是这个参数不是所有的数据库都支持

File Formats

Two file formats: delimited text or SequenceFiles.Delimited text is the default import format. You can also specify it explicitly by using the --as-textfile argument. These delimiters may be commas, tabs, or other characters. SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.

两种主要的文件：带有分隔符的文件和序列文件；带分隔符的文件是默认的
分隔符可以是逗号、制表符或其他
序列文件是二进制格式的文件，用于自定义记录特定的数据类型存储
数据默认是不会压缩的，可以通过指定的参数进行压缩。

Importing Data Into Hive

Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line. If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced. Sqoop will by default import NULL values as string null. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. You can import compressed tables into Hive using the --compress and --compression-codec options.

$ sqoop import  ... --null-string '\\N' --null-non-string '\\N'

Importing Data Into HBase

By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. If the input table has composite key, the --hbase-row-key must be in the form of a comma-separated list of composite key attributes. You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family

本文参与腾讯云自媒体分享计划，分享自作者个人站点/博客。

原始发表：2019-11-19，如有侵权请联系 cloudcommunity@tencent.com 删除

数据库