CDH (Cloudera's distribution, including Apache Hadoop) is one of the most popular Hadoop distributions in the industry. This document describes how to access a COS bucket over the HDFS protocol, a flexible, cost-effective big-data solution, in a CDH environment to separate big data computing from storage.
Note
To access a COS bucket via the HDFS protocol, you must first enable metadata acceleration capabilities.
Currently, the support for big data modules by COS is as follows:
Module
Supported by CHDFS
Service Module to Restart
Yarn
This feature is supported.
NodeManager
Yarn
This feature is supported.
NodeManager
Hive
This feature is supported.
HiveServer and HiveMetastore
Spark
This feature is supported.
NodeManager
Sqoop
This feature is supported.
NodeManager
Presto
This feature is supported.
HiveServer, HiveMetastore, and Presto
Flink
This feature is supported.
Not required
Impala
This feature is supported.
Not required
EMR
This feature is supported.
Not required
Self-built components
Will be supported later.
-
HBase
Not recommended.
-
Versions
This example uses software versions as follows:
CDH 5.16.1
Hadoop 2.6.0
How to Use
Configuring storage environment
1. Log in to the CDH management page.
2. On the system homepage, select Configuration > Service-Wide > Advanced to access the advanced configuration code snippet page, as shown below:
3. Specify your COS settings in the configuration snippet Cluster-wide Advanced Configuration Snippet(Safety Valve) for core-site.xml.
<!--Temporary directory of the local cache. For data read/write, data will be written to the local disk when the memory cache is insufficient. This path will be created automatically if it does not exist-->
The implementation class of CHDFS for FileSystem is fixed at com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter.
fs.AbstractFileSystem.ofs.impl
com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
The implementation class of CHDFS for AbstractFileSystem is fixed at com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter.
4. Take action on your HDFS service by clicking. Now, the core-site.xml settings above will apply to servers in the cluster.
5. Place the latest client installation package in the path of the JAR package of the CDH HDFS service and replace the relevant information with the actual value as shown below:
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. On the Cloudera Manager homepage, find YARN and restart the NodeManager service (recommended). You can choose not to restart it for the TeraGen command, but must restart it for the TeraSort command because of the internal business logic.
Example
The example below shows TeraGen and TeraSort in Hadoop standard test:
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar teragen -Dmapred.map.tasks=41099 ofs://examplebucket-1250000000/teragen_5/
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar terasort -Dmapred.map.tasks=4 ofs://examplebucket-1250000000/teragen_5/ ofs://examplebucket-1250000000/result14
Note
Please replace the path after the ofs:// schema with the user's CHDFS mount point path.
Hive
MR engine
Operation Steps
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. On the Cloudera Manager homepage, find HIVE and restart the Hiveserver2 and HiverMetastore roles.
Example
To query your actual business data, use the Hive command line to create a location as a partitioned table on CHDFS:
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. Put the client installation package of COS in the presto directory, for example, /usr/local/services/cos_presto/plugin/hive-hadoop2.
3. Since Presto does not load gson-2...jar from the Hadoop common directory, you need to place gson-2...jar in the Presto directory as well (e.g., /usr/local/services/cos_presto/plugin/hive-hadoop2, as only COS depends on gson).
4. Restart HiveServer, HiveMetaStore, and Presto.
Example
The example below queries the COS scheme table as a HIVE-created Location:
select * from chdfs_test_table where bucket is not null limit 1;
Note
chdfs_test_table is a table with a location using the ofs scheme.