Accessing COS over HDFS in CDH Cluster

Last updated: 2023-09-20 12:16:58

Feature Overview

CDH (Cloudera's distribution, including Apache Hadoop) is one of the most popular Hadoop distributions in the industry. This document describes how to access a COS bucket over the HDFS protocol, a flexible, cost-effective big-data solution, in a CDH environment to separate big data computing from storage.
Note
To access a COS bucket via the HDFS protocol, you must first enable metadata acceleration capabilities.
Currently, the support for big data modules by COS is as follows:
Module
Supported by CHDFS
Service Module to Restart
Yarn
This feature is supported.
NodeManager
Yarn
This feature is supported.
NodeManager
Hive
This feature is supported.
HiveServer and HiveMetastore
Spark
This feature is supported.
NodeManager
Sqoop
This feature is supported.
NodeManager
Presto
This feature is supported.
HiveServer, HiveMetastore, and Presto
Flink
This feature is supported.
Not required
Impala
This feature is supported.
Not required
EMR
This feature is supported.
Not required
Self-built components
Will be supported later.
-
HBase
Not recommended.
-

Versions

This example uses software versions as follows:
CDH 5.16.1
Hadoop 2.6.0

How to Use

Configuring storage environment

1. Log in to the CDH management page.
2. On the system homepage, select Configuration > Service-Wide > Advanced to access the advanced configuration code snippet page, as shown below:


3. Specify your COS settings in the configuration snippet Cluster-wide Advanced Configuration Snippet(Safety Valve) for core-site.xml.
<property>
<name>fs.AbstractFileSystem.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter</value>
</property>
<property>
<name>fs.ofs.impl</name>
<value>com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter</value>
</property>
<!--Temporary directory of the local cache. For data read/write, data will be written to the local disk when the memory cache is insufficient. This path will be created automatically if it does not exist-->
<property>
<name>fs.ofs.tmp.cache.dir</name>
<value>/data/emr/hdfs/tmp/chdfs/</value>
</property>
<!--appId-->
<property>
<name>fs.ofs.user.appid</name>
<value>1250000000</value>
</property>
The following are required configuration items (to be added to core-site.xml). For other configurations, please refer to Mounting COS Buckets in a Computing Cluster.
Configuration items
Value
Description
fs.ofs.user.appid
1250000000
User appid
fs.ofs.tmp.cache.dir
/data/emr/hdfs/tmp/chdfs/
Temporary directory of the local cache
fs.ofs.impl
com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter
The implementation class of CHDFS for FileSystem is fixed at com.qcloud.chdfs.fs.CHDFSHadoopFileSystemAdapter.
fs.AbstractFileSystem.ofs.impl
com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter
The implementation class of CHDFS for AbstractFileSystem is fixed at com.qcloud.chdfs.fs.CHDFSDelegateFSAdapter.
4. Take action on your HDFS service by clicking. Now, the core-site.xml settings above will apply to servers in the cluster.
5. Place the latest client installation package in the path of the JAR package of the CDH HDFS service and replace the relevant information with the actual value as shown below:
cp chdfs_hadoop_plugin_network-2.0.jar /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/hadoop-hdfs/
Note
The SDK package must be placed in the same location on each machine within the cluster.

Data Migration

Use Hadoop Distcp to migrate your data from CDH HDFS to a COS bucket. For details, see Migrating Data Between HDFS and COS.

Big Data Suite using CHDFS

MapReduce

Operation Steps
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. On the Cloudera Manager homepage, find YARN and restart the NodeManager service (recommended). You can choose not to restart it for the TeraGen command, but must restart it for the TeraSort command because of the internal business logic.
Example
The example below shows TeraGen and TeraSort in Hadoop standard test:
hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar teragen -Dmapred.map.tasks=4 1099 ofs://examplebucket-1250000000/teragen_5/

hadoop jar ./hadoop-mapreduce-examples-2.7.3.jar terasort -Dmapred.map.tasks=4 ofs://examplebucket-1250000000/teragen_5/ ofs://examplebucket-1250000000/result14
Note
Please replace the path after the ofs:// schema with the user's CHDFS mount point path.

Hive

MR engine
Operation Steps
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. On the Cloudera Manager homepage, find HIVE and restart the Hiveserver2 and HiverMetastore roles.
Example
To query your actual business data, use the Hive command line to create a location as a partitioned table on CHDFS:
CREATE TABLE report.report_o2o_pid_credit_detail_grant_daily(
cal_dt string,
change_time string,
merchant_id bigint,
store_id bigint,
store_name string,
wid string,
member_id bigint,
meber_card string,
nickname string,
name string,
gender string,
birthday string,
city string,
mobile string,
credit_grant bigint,
change_reason string,
available_point bigint,
date_time string,
channel_type bigint,
point_flow_id bigint)
PARTITIONED BY (
topicdate string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'ofs://examplebucket-1250000000/user/hive/warehouse/report.db/report_o2o_pid_credit_detail_grant_daily'
TBLPROPERTIES (
'last_modified_by'='work',
'last_modified_time'='1589310646',
'transient_lastDdlTime'='1589310646')
Perform a SQL query:
select count(1) from report.report_o2o_pid_credit_detail_grant_daily;
The observed results are as follows:


Tez engine

You need to import the client installation package of COS as part of a Tez tar.gz file. The following example uses apache-tez.0.8.5:
Operation Steps
1. Locate and decompress the Tez tar.gz file installed in the CDH cluster, e.g., /usr/local/service/tez/tez-0.8.5.tar.gz.
2. Put the client installation package of COS in the directory generated after decompression and recompress it to output a compressed package.
3. Upload this new file to the path as specified by tez.lib.uris, or simply replace the existing file with the same name.
4. On the Cloudera Manager homepage, find HIVE and restart hiveserver and hivemetastore.

Spark

Operation Steps
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. Restart NodeManager.
Example
The following uses the conducted Spark example word count test as an example.
spark-submit --class org.apache.spark.examples.JavaWordCount --executor-memory 4g --executor-cores 4 ./spark-examples-1.6.0-cdh5.16.1-hadoop2.6.0-cdh5.16.1.jar ofs://examplebucket-1250000000/wordcount
The execution result is as follows:


Sqoop

Operation Steps
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. Put the client installation package of COS in the sqoop directory, for example, /opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/lib/sqoop/.
3. Restart NodeManager.
Example
For example, to export MySQL tables to COS, refer to Import/Export of Relational Database and HDFS.
sqoop import --connect "jdbc:mysql://IP:PORT/mysql" --table sqoop_test --username root --password 123 --target-dir ofs://examplebucket-1250000000/sqoop_test
The execution result is as follows:


Presto

Operation Steps
1. Configure HDFS as instructed in Data migration and put the client installation package of COS in the correct HDFS directory.
2. Put the client installation package of COS in the presto directory, for example, /usr/local/services/cos_presto/plugin/hive-hadoop2.
3. Since Presto does not load gson-2...jar from the Hadoop common directory, you need to place gson-2...jar in the Presto directory as well (e.g., /usr/local/services/cos_presto/plugin/hive-hadoop2, as only COS depends on gson).
4. Restart HiveServer, HiveMetaStore, and Presto.
Example
The example below queries the COS scheme table as a HIVE-created Location:
select * from chdfs_test_table where bucket is not null limit 1;
Note
chdfs_test_table is a table with a location using the ofs scheme.
The query results are as follows: