Overview Kafka is a frequently-used message queue in open-source communities. Although Kafka (Confluent) officially provides plug-ins to import data directly from Kafka to HDFS's connector, Alibaba Cloud provides no official support for the file storage system OSS. This article will give a simple example to implement data writes from Kafka to Alibaba Cloud OSS. Because Alibaba Cloud E-MapReduce service integrates a large number of open-source components and docking tools for Alibaba Cloud, in this article, the example is directly run in the E-MapReduce cluster. This example uses the open-source Flume tool as a transit to connect Kafka and OSS. Flume open-source components may also appear on the E-MapReduce platform in the future. Scenario example Next we will name a simple example. If you already have an online Kafka cluster, you can directly jump to Step 4. 1. In the Kafka Home directory, start the Kafka service process. Configure the Zookeeper address in the configuration file to the service address emr-header-1:2181 bin/kafka-server-start.sh config/server.properties 2. Create a Kafka topic with a name of test bin/kafka-topics.sh --create --zookeeper emr-header-1:2181 \ --replication-factor 1 --partitions 1 --topic test 3. Write data to Kafka test topic and the data content is the performance monitoring data of the local machine vmstat 1 | bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test 4. Configure and start the Flume service in the Flume Home directory Create a new configuration file: conf/kafka-example.conf. In specific, specify the source as the corresponding topic for Kafka, and use sink as the HDFS Sinker. Specify the path as the OSS path. Because the E-MapReduce service implements an efficient OSS FileSystem (compatible with Hadoop FileSystem) for us, the OSS path can be specified directly, and the HDFS Sinker data will be automatically written to OSS. # Name the components on this agent a1.sources = source1 a1.sinks = oss1 a1.channels = c1 # Describe/configure the source a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource a1.sources.source1.zookeeperConnect = localhost:2181 a1.sources.source1.topic = test a1.sources.source1.groupId = flume a1.sources.source1.channels = c1 a1.sources.source1.interceptors = i1 a1.sources.source1.interceptors.i1.type = timestamp a1.sources.source1.kafka.consumer.timeout.ms = 100 # Describe the sink a1.sinks.oss1.type = hdfs a1.sinks.oss1.hdfs.path = oss://emr-examples/kafka/%{topic}/%y-%m-%d a1.sinks.oss1.hdfs.rollInterval = 10 a1.sinks.oss1.hdfs.rollSize = 0 a1.sinks.oss1.hdfs.rollCount = 0 a1.sinks.oss1.hdfs.fileType = DataStream # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 100000 a1.channels.c1.transactionCapacity = 10000 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.oss1.channel = c1 In the configuration of Hadoop core-site.xml, you need to modify /etc/emr/hadoop-conf/core-site.xml, and add OSS-related configuration. <property> <name>fs.oss.endpoint</name> <value>http://oss-cn-hangzhou.aliyuncs.com/</value> </property> <property> <name>fs.oss.accessKeyId</name> <value>set-access-key-id</value> </property> <property> <name>fs.oss.accessKeySecret</name> <value>set-access-key-secret</value> </property> Start the Flume service: bin/flume-ng agent --conf conf --conf-file conf/kafka-example.conf --name a1 \ -Dflume.root.logger=INFO,console In the log you can see that Flume HDFS sinker writes data to the OSS and the writes rotate once every ten seconds. 2016-12-05 18:41:04,794 (hdfs-oss1-call-runner-1) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:618)] Renaming oss://emr-perform/kafka/test/16-12-05/Flume Data.1480934454657.tmp to oss://emr-perform/kafka/test/16-12-05/FlumeData.1480934454657 2016-12-05 18:41:04,852 (hdfs-oss1-roll-timer-0) [INFO - org.apache.flume.sink.hdfs.HDFSEventSink$1.run(HDFSEventSink.java:382)] Writer callback called. View results on OSS $ hadoop fs -ls oss://emr-examples/kafka/test/16-12-05/ Found 6 items -rw-rw-rw- 1 162691 2016-12-05 18:40 oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934394566 -rw-rw-rw- 1 925 2016-12-05 18:40 oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934407580 -rw-rw-rw- 1 1170 2016-12-05 18:40 oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934418597 -rw-rw-rw- 1 1092 2016-12-05 18:40 oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934430613 -rw-rw-rw- 1 1254 2016-12-05 18:40 oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934443638 -rw-rw-rw- 1 588 2016-12-05 18:41 oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934454657 $ hadoop fs -cat oss://emr-examples/kafka/test/16-12-05/FlumeData.1480934443638 0 0 0 1911216 50036 1343828 0 0 0 0 1341 2396 1 1 98 0 0 0 0 0 1896964 50052 1343824 0 0 0 112 1982 2511 15 1 84 0 0 1 0 0 1896552 50052 1343828 0 0 0 76 2314 3329 3 4 94 0 0 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 5 0 0 1903016 50052 1343828 0 0 0 0 2277 3249 2 4 94 0 0 0 0 0 1902892 50052 1343828 0 0 0 0 1417 2366 5 0 95 0 0 0 0 0 1902892 50052 1343828 0 0 0 0 1072 2243 0 0 99 0 0 0 0 0 1902892 50068 1343824 0 0 0 144 1275 2283 1 0 99 0 0 1 0 0 1903024 50068 1343828 0 0 0 24 1099 2071 1 1 99 0 0 0 0 0 1903272 50068 1343832 0 0 0 0 1294 2238 1 1 99 0 0 1 0 0 1903412 50068 1343832 0 0 0 0 1024 2094 1 0 99 0 0 2 0 0 1903148 50076 1343836 0 0 0 68 1879 2766 1 1 98 0 0 1 0 0 1903288 50092 1343840 0 0 0 92 1147 2240 1 0 99 0 0 0 0 0 1902792 50092 1343844 0 0 0 28 1456 2388 1 1 98 0 0 References 1. http://kafka.apache.org/quickstart 2. https://www.cloudera.com/documentation/kafka/latest/topics/kafka_flume.html