Sqoop is an open-source tool, primarily designed for data transfer between Hadoop and traditional databases such as MySQL, PostgreSQL, and others. It facilitates the importation of data from a relational database (like MySQL, Oracle, Postgres, etc.) into Hadoop's HDFS, and vice versa. A significant highlight of Sqoop is its ability to import data from relational databases into HDFS via Hadoop's MapReduce.
This document elucidates the incremental import operations of Sqoop, that is, synchronizing the modifications in the database to the data imported into HDFS when data in the database is increased or updated. This operation is divided into two modes: append and lastmodified. The append mode is exclusively applicable in scenarios where the database data increases but does not update, while the lastmodified mode is used in scenarios where data increases and updates.
Development Preparations
Ensure that you have activated Tencent Cloud and have created an EMR cluster. During the creation of the EMR cluster, it is necessary to select the Sqoop component in the software configuration interface.
Sqoop and related software are installed in the path
/usr/local/service/ of the EMR cloud server.Utilizing the Append Mode
This section will continue to utilize the use case from the previous section.
Enter the EMR Console, copy the instance ID of the target cluster, which is the name of the cluster. Then, proceed to the relational database console, use Ctrl+F to search, locate the MySQL database corresponding to the cluster, and view the internal address $mysqlIP of this database.
Log into any machine in the EMR cluster, preferably the Master node. For the method of logging into EMR, please refer to Logging into Linux Instance. Here, we can choose to log in using WebShell. Click on the login on the right side of the corresponding cloud server to enter the login interface. The username is set to root by default, and the password is the one entered by the user when creating EMR. After entering correctly, you can access the command line interface.
In the EMR command line, first use the following command to switch to the Hadoop user and enter the Sqoop folder:
[root@172 ~]# su hadoop[hadoop@172 ~]# cd /usr/local/service/sqoop
Connect to MySQL Database:
[hadoop@172 sqoop]$ mysql -h $mysqlIP –pEnter password:
The password is the one you set when creating the EMR cluster.
After connecting to the MySQL database, add a new piece of data to the sqoop_test table as follows:
mysql> use test;Database changedmysql> insert into sqoop_test values(null, 'forth', now(), 'hbase');Query ok, 1 row affected(0.00 sec)
View the data in the table:
Mysql> select * from sqoop_test;+----+--------+---------------------+---------+| id | title | time | content |+----+--------+---------------------+---------+| 1 | first | 2018-07-03 15:29:37 | hdfs || 2 | second | 2018-07-03 15:30:57 | mr || 3 | third | 2018-07-03 15:31:07 | yarn || 4 | forth | 2018-07-03 15:39:38 | hbase |+----+--------+---------------------+---------+4 rows in set (0.00 sec)
Utilize the append mode to synchronize the newly added data to the HDFS path where the data was stored in the previous section:
[hadoop@172 sqoop]$ bin/sqoop-import --connect jdbc:mysql://$mysqlIP/test --usernameroot -P --table sqoop_test --check-column id --incremental append --last-value 3 --target-dir/sqoop
Where $mysqlIP is the internal address of your MySQL database.
Executing the command will require you to input the database password, which by default is the password you set when creating the EMR cluster. This command has more parameters than the standard sqoop-import command, where --check-column is the data referred to during import, --incremental is the import mode, in this case, append, and --last-value is the reference value of the reference data. All data updated beyond this value will be imported into HDFS.
Upon successful execution, you can view the updated data in the corresponding HDFS directory:
[hadoop@172 sqoop]$ hadoop fs -cat /sqoop/*1, first, 2018-07-03 15:29:37.0,hdfs2,second,2018-07-03 15:30:57.0,mr3,third,2018-07-03 15:31:07.0,yarn4,forth,2018-07-03 15:39:38.0,hbase
Utilize Sqoop job
Using append to synchronize data in HDFS requires manual input of --last-value each time. Alternatively, you can use the sqoop job method, where Sqoop will automatically save the last-value from the last successful import. To use sqoop job, you need to start the sqoop-metastore process. The operation steps are as follows:
Firstly, initiate the sqoop-metastore process in conf/sqoop-site.xml:
<property><name>sqoop.metastore.client.enable.autoconnect</name><value>true</value></property>
Then, initiate the sqoop-metastore service in the bin directory:
./sqoop-metastore &
Use the following command to create a Sqoop job:
Note
This command is applicable for Sqoop version 1.4.6.
[hadoop@172 sqoop]$ bin/sqoop job --create job1 -- import --connectjdbc:mysql://$mysqlIP/test --username root -P --table sqoop_test --check-column id--incremental append --last-value 4 --target-dir /sqoop
Where $mysqlIP is the internal address of your MySQL. By using this command, a Sqoop job is successfully created. Each execution will automatically update from the last-value updated previously.
Add a new record to the sqoop_test table in MySQL:
mysql> insert into sqoop_test values(null, 'fifth', now(), 'hive');Query ok, 1 row affected(0.00 sec)Mysql> select * from sqoop_test;+----+--------+---------------------+---------+| id | title | time | content |+----+--------+---------------------+---------+| 1 | first | 2018-07-03 15:29:37 | hdfs || 2 | second | 2018-07-03 15:30:57 | mr || 3 | third | 2018-07-03 15:31:07 | yarn || 4 | forth | 2018-07-03 15:39:38 | hbase || 5 | fifth | 2018-07-03 16:02:29 | hive |+----+--------+---------------------+---------+5 rows in set (0.00 sec)
Then execute the Sqoop job:
[hadoop@172 sqoop]$ bin/sqoop job --exec job1
Executing this command will prompt you to enter the MySQL password. Upon successful execution, you can view the updated data in the corresponding HDFS directory:
[hadoop@172 sqoop]$ hadoop fs -cat /sqoop/*1, first, 2018-07-03 15:29:37.0,hdfs2,second,2018-07-03 15:30:57.0,mr3,third,2018-07-03 15:31:07.0,yarn4,forth,2018-07-03 15:39:38.0,hbase5,fifth,2018-07-03 16:02:29.0,hive
Utilizing the lastmodified mode:
To directly create a Sqoop job in lastmodified mode for sqoop-import, first query the last updated time in sqoop_test:
mysql> select max(time) from sqoop_test;
Create a Sqoop job:
[hadoop@172 sqoop]$ bin/sqoop job --create job2 -- import --connect jdbc:mysql://$mysqlIP/test --username root -P --table sqoop_test --check-column time --incremental lastmodified --merge-key id --last-value '2018-07-03 16:02:29' --target-dir /sqoop
Parameter Description:
$mysqlIP refers to the internal network address of your MySQL.
The --check-column must utilize a column of timestamp type.
The --incremental mode is set to lastmodified.
The --merge-key is set to ID.
The --last-value represents the most recent update time we have queried in the table. Any updates made after this time will be synchronized to HDFS, and the Sqoop job will automatically save and update this value each time.
Add data to the sqoop_test table in MySQL and make modifications:
mysql> insert into sqoop_test values(null, 'sixth', now(), 'sqoop');Query ok, 1 row affected(0.00 sec)mysql> update sqoop_test set time=now(), content='spark' where id = 1;Query OK, 1 row affected (0.00 sec)Rows matched: 1 changed: 1 warnings: 0Mysql> select * from sqoop_test;+----+--------+---------------------+---------+| id | title | time | content |+----+--------+---------------------+---------+| 1 | first | 2018-07-03 16:07:46 | spark || 2 | second | 2018-07-03 15:30:57 | mr || 3 | third | 2018-07-03 15:31:07 | yarn || 4 | forth | 2018-07-03 15:39:38 | hbase || 5 | fifth | 2018-07-03 16:02:29 | hive || 6 | fifth | 2018-07-03 16:09:58 | sqoop |+----+--------+---------------------+---------+6 rows in set (0.00 sec)
Execute the Sqoop job:
[hadoop@172 sqoop]$ bin/sqoop job --exec job2
Executing this command will prompt you to enter the MySQL password. Upon successful execution, you can view the updated data in the corresponding HDFS directory:
[hadoop@172 sqoop]$ hdfs dfs -cat /sqoop/*1,first,2018-07-03 16:07:46.0,spark2,second,2018-07-03 15:30:57.0,mr3,third,2018-07-03 15:31:07.0,yarn4,forth,2018-07-03 15:39:38.0,hbase5,fifth,2018-07-03 16:02:29.0,hive6,sixth,2018-07-03 16:09:58.0,sqoop