CDH (Cloudera’s distribution, including Apache Hadoop) is one of the most popular Hadoop distributions in the industry. This document describes how to use COSN, a flexible, cost-effective big-data solution, in a CDH environment to separate big data computing from storage.
Note
COSN is an abbreviation for Hadoop-COS file system.
The table below shows which big data modules are supported by COSN:
Module
Required
Service Module to Restart
Yarn
This feature is supported.
NodeManager
Hive
This feature is supported.
HiveServer and HiveMetastore
Spark
This feature is supported.
NodeManager
Sqoop
This feature is supported.
NodeManager
Presto
This feature is supported.
HiveServer, HiveMetastore, and Presto
Flink
This feature is supported.
Not required
Impala
This feature is supported.
Not required
EMR
This feature is supported.
Not required
Self-built components
Will be supported later
-
HBase
Not recommended
-
Versions
This example uses software versions as follows:
CDH 5.16.1
Hadoop 2.6.0
How to Use
Configuring storage environment
1. Log in to the CDH management page.
2. On the system homepage, select Configuration > Service Scope > Advanced to access the advanced configuration code snippet page, as shown below:
3. Specify your COSN settings in the configuration snippet Cluster-wide Advanced Configuration Snippet(Safety Valve) for core-site.xml.
<property>
<name>fs.cosn.userinfo.secretId</name>
<value>AK***</value>
</property>
<property>
<name>fs.cosn.userinfo.secretKey</name>
<value></value>
</property>
<property>
<name>fs.cosn.impl</name>
<value>org.apache.hadoop.fs.CosFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.cosn.impl</name>
<value>org.apache.hadoop.fs.CosN</value>
</property>
<property>
<name>fs.cosn.bucket.region</name>
<value>ap-shanghai</value>
</property>
The following are the required COSN configuration items (to be added to core-site.xml). For other COSN configurations, please refer to the Hadoop Tools documentation.
COSN Parameter
Value
Description
fs.cosn.userinfo.secretId
AKxxxx
API key information of the account
fs.cosn.userinfo.secretKey
Wpxxxx
API key information of the account
fs.cosn.bucket.region
ap-shanghai
Bucket region
fs.cosn.impl
org.apache.hadoop.fs.CosFileSystem
The implementation class of cosn in FileSystem, which is always org.apache.hadoop.fs.CosFileSystem.
fs.AbstractFileSystem.cosn.impl
org.apache.hadoop.fs.CosN
The implementation class of COSN for AbstractFileSystem, which is fixed at org.apache.hadoop.fs.CosN
4. Take action on your HDFS service by clicking. Now, the core-site.xml settings above will apply to servers in the cluster.
5. Place the latest SDK package of COSN in the path of the JAR package of the CDH HDFS service and replace the relevant information with the actual value as shown below:
(1) Follow the instructions in the Data Migration section to configure the HDFS settings and place the COSN SDK jar file in the appropriate HDFS directory.
(2) On the CDH system homepage, locate YARN and restart the NodeManager service (TeraGen command does not require a restart, but TeraSort requires a NodeManager restart due to internal business logic; it is recommended to restart the NodeManager service for both).
Example
The example below shows TeraGen and TeraSort in Hadoop standard test:
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar teragen -Dmapred.job.maps=500 -Dfs.cosn.upload.buffer=mapped_disk -Dfs.cosn.upload.buffer.size=-1 1099 cosn://examplebucket-1250000000/terasortv1/1k-input
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar terasort -Dmapred.max.split.size=134217728 -Dmapred.min.split.size=134217728 -Dfs.cosn.read.ahead.block.size=4194304 -Dfs.cosn.read.ahead.queue.size=32 cosn://examplebucket-1250000000/terasortv1/1k-input cosn://examplebucket-1250000000/terasortv1/1k-output
Note
Please replace the cosn:// schema with the storage bucket path for your big data business.
2. Hive
2.1 MR engine
Operation Steps
(1) Follow the Data Migration section to configure the relevant HDFS settings and place the COSN SDK jar file in the appropriate HDFS directory.
(2) On the CDH main page, locate the HIVE service and restart the Hiveserver2 and HiverMetastore roles.
Example
To query your actual business data, use the Hive command line to create a location as a partitioned table on CHDFS:
select count(1) from report.report_o2o_pid_credit_detail_grant_daily;
The observed results are as follows:
2.2 Tez engine
You need to import the COSN JAR file as part of a Tez tar.gz file. The following example uses apache-tez.0.8.5:
Operation Steps
(1) Locate the Tez package installed in the CDH cluster and extract it, for example, /usr/local/service/tez/tez-0.8.5.tar.gz.
(2) Place the COSN .jar file in the extracted directory, and then re-compress it into a new archive.
(3) Upload the new archive to the path specified by tez.lib.uris (replace the existing file if it already exists).
(4) In the CDH main page, locate HIVE and restart hiveserver and hivemetastore.
3. Spark
Operation Steps
(1) Follow the instructions in the Data Migration section to configure the HDFS settings and place the COSN SDK jar file in the appropriate HDFS directory.
(2) Restart the NodeManager service.
Example
The following takes the Spark example word count test conducted with COSN as an example.
(1) Follow the instructions in the Data Migration section to configure the HDFS settings and place the COSN SDK JAR package in the appropriate HDFS directory.
(2) The COSN SDK JAR package also needs to be placed in the Presto directory (e.g., /usr/local/services/cos_presto/plugin/hive-hadoop2).
(3) Since Presto does not load gson-2...jar from the Hadoop common directory, you need to place gson-2...jar in the Presto directory as well (e.g., /usr/local/services/cos_presto/plugin/hive-hadoop2, only CHDFS depends on gson).
(4) Restart the HiveServer, HiveMetaStore, and Presto services.
Example
The example below queries the COSN scheme table as a Hive-created location:
select * from cosn_test_table where bucket is not null limit 1;
Note
cosn_test_table is a table with location as cosn scheme.