导入的主要语法是
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
命令格式为:
$ sqoop import
--connect \ (连接字符)
--username \ (用户名)
--password [-P |--password-file]\ (单个密码或用文件形式,文件权限一般为400)
--query \ (查询字符串)
--warehouse-dir \ (导入的HDFS目录)
--fields-terminal-by \ (分隔符,默认是逗号)
-m [--num-mappers] (MR任务数量,控制导入并行度)
Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the
--connect
argument. This describes the server and database to connect to; it may also specify the port.
—-connect
url
的形式,有时还要指定端口# 1. 普通形式
# 表示:连接上名为employees的主机database.example.com
$ sqoop import --connect jdbc:mysql://database.example.com/employees
# 2. 使用用户和密码(明文)
$ sqoop import
--connect jdbc:mysql://database.example.com/employees \
--username root
--password 12345
# 3. 密码为隐藏形式
$ sqoop import
--connect jdbc://mysql://database.example.com/employees \
--username root
-- password-file ${user.name}/.password
# 4. 指定端口
$sqoop import
--connect jdbc:mysql://ubuntu:3306/db # ubuntu为主机名;db为数据库名;如果是分布式集群,不要用localhost作为主机名,因为这是本机而不是连接过去的服务器
Sqoop typically imports data in a table-centric fashion. Use the
--table
argument to select the table to import. For example,--table employees
. This argument can also identify aVIEW
or other table-like entity in a database. By default, all columns within a table are selected for import. Imported data is written to HDFS in its "natural order;” You can select a subset of columns and control their ordering by using the--columns
argument. For example:--columns "name,employee_id,jobtitle"
. You can control which rows are imported by adding a SQLWHERE
clause to the import statement. By default, Sqoop generates statements of the formSELECT <column_list> FROM <table_name>
. Append aWHERE
clause to this with the--where
argument. For example:--where "id > 400"
.
—table
参数导入表中的全部数据;where
条件,使用查询语句# 带上where条件
$sqoop import \
--connect jdbc:mysql://database.example.com/employees \
--username root \
--password 12345 \
--where "name='xiaoming' " \
--table help_keyword \
--target-dir /sqoop/hadoop/data \ # 导出的目录
-m 1 # mapreduce作业数量
# 查询指定的列
$sqoop import \
--connect jdbc:mysql://database:3306/mysql \
--username root \
--password root \
--where "name='xiaoming' " \
--table employees \
--target-dir /sqoop/hadoop/data \
-m 1
select name from employees where name="xiaoming"
# 指定自定义查询SQL
$sqoop import \
--connect jdbc:mysql://database:3306/mysql \
--username root \
--password root \
--target-dir /sqoop/hadoop/data \
--query 'select id,name from mysql.employees where $CONDITIONS and name="xiaoming"' \
--split-by id \
-m 4
# 导入部分数据
$sqoop import \
--connect jdbc:mysql://ubuntu:3306/sqooptest \ # ubuntu为主机名;sqooptest为数据库名
--username root \ # 指定用户和密码
--password 123456 \
--query "select * from sqoop where student_id <= 500 AND \$CONDITIONS"
总结2点:
query
的SQL
语句中必须带有where $CONDITIONS
以及AND
If you are issuing the query wrapped with double quotes ("), you will have to use \
Sqoop can also import the result set of an arbitrary SQL query. Instead of using the
--table
,--columns
and--where
arguments, you can specify a SQL statement with the--query
argument. When importing a free-form query, you must specify a destination directory with--target-dir
. Your query must include the token$CONDITIONS
which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with--split-by
.
—query
进行导入目标目录
$CONDITIONS
;同时指定分隔符$sqoop import \
--query 'select a.*,b.* from a join b on (a.id == b.id) where $CONDITIONS' \
--split-by a.id \ # 指定分隔符
--target-dir /usr/foo/joinresults # 指定目录
$sqoop import \
--query 'select a.*,b.* from a join b on (a.id == b.id) where $CONDITIONS' \
-m 1
--target-dir /usr/foo/joinresults
Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the
-m
or--num-mappers
argument. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. If a table does not have a primary key defined and the is not provided, then import will fail unless the number of mappers is explicitly set to one with the--num-mappers 1
option or the--autoreset-to-one-mapper
option is used. The option--autoreset-to-one-mapper
is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.
-m
用来指定mapreduce的任务数量,后面接上一个整数,表示MR的并行度--skip-dist-cache
in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.By default, the import process will use JDBC which provides a reasonable cross-vendor import channel. Some databases can perform imports in a more high-performance fashion by using database-specific data movement tools. For example, MySQL provides the
mysqldump
tool . By supplying the--direct
argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC. By default, Sqoop will import a table namedfoo
to a directory namedfoo
inside your home directory in HDFS. For example, if your username issomeuser
, then the import tool will write to/user/someuser/foo/(files)
. You can adjust the parent directory of the import with the--warehouse-dir
argument.--target-dir
is incompatible with--warehouse-dir
.
sqoop
默认使用的是jdbc
作为导入通道,可以根据数据库进行修改,进行提速,比如MySQL
使用mysqldump
sqoop
默认是将数据导入到当前用户的foo目录下--target-dir
和--warehouse-dir
不能同时存在$sqoop import \
--connect jdbc:mysql://ubuntu:3306/mysql \
--table foo
--warehouse-dir /shared \ # 数据将会被导入到/shared/foo/目录下面
$sqoop import \
--connect jdbc:mysql://ubuntu:3306/mysql \
--table foo
--target-dir /dest \ # 数据将会被导入到/dest/foo/目录下面
By default, Sqoop uses the read committed transaction isolation in the mappers to import data. This may not be the ideal in all ETL workflows and it may desired to reduce the isolation guarantees. The
--relaxed-isolation
option can be used to instruct Sqoop to use read uncommitted isolation level. the option--relaxed-isolation
may not be supported on all databases.
relaxed-isolation
用于指示sqoop使用读取未提交的隔离级别;但是这个参数不是所有的数据库都支持Two file formats: delimited text or SequenceFiles.Delimited text is the default import format. You can also specify it explicitly by using the
--as-textfile
argument. These delimiters may be commas, tabs, or other characters. SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the-z
or--compress
argument, or specify any Hadoop compression codec using the--compression-codec
argument. This applies to SequenceFile, text, and Avro files.
Importing data into Hive is as simple as adding the
--hive-import
option to your Sqoop command line. If the Hive table already exists, you can specify the--hive-overwrite
option to indicate that existing table in hive must be replaced. Sqoop will by default import NULL values as stringnull
. You should append parameters--null-string
and--null-non-string
in case of import job or--input-null-string
and--input-null-non-string
in case of an export job if you wish to properly preserveNULL
values. You can import compressed tables into Hive using the--compress
and--compression-codec
options.
$ sqoop import ... --null-string '\\N' --null-non-string '\\N'
By specifying
--hbase-table
, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. If the input table has composite key, the--hbase-row-key
must be in the form of a comma-separated list of composite key attributes. You can manually specify the row key column with--hbase-row-key
. Each output column will be placed in the same column family, which must be specified with--column-family