Spark Tips3: 在Spark Streaming job中读取Kafka messages及其offsetRange

在Spark Streaming job中读取Kafka topic(s)中的messages时,有时我们会需要同步记录下每次读取的messages的offsetRange。要达到这一目的,下面这两段代码(代码1和代码2)都是正确的,而且是等价的。

代码1(正确):

-----------------------

JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(

jssc,

String.class,

String.class,

StringDecoder.class,

StringDecoder.class,

kafkaParams,

topicsSet

);

messages.foreachRDD(

new Function<JavaPairRDD<String, String>, Void>() {

@Override

public Void call(JavaPairRDD<String, String> rdd) throws Exception {

OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();

JavaRDD<String> valueRDD = rdd.values();

long msgNum = processEachRDD(valueRDD, outputFolderPath, definedDuration);

if (msgNum > 0 && zkPathRoot!= null) {

writeOffsetToZookeeper(zkClient, zkPathRoot, offsets);

}

return null;

}

});

代码2(正确):

-----------------------

JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(

jssc,

String.class,

String.class,

StringDecoder.class,

StringDecoder.class,

kafkaParams,

topicsSet

);

final AtomicReference<OffsetRange[]> offsetRanges=new AtomicReference();

lines = messages.transformToPair(new Function<JavaPairRDD<String, String>, JavaPairRDD<String, String>>() {

@Override

public JavaPairRDD<String, String> call(JavaPairRDD<String, String> rdd) throws Exception {

OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();

offsetRanges.set(offsets);

return rdd;

}

}).map(new Function<Tuple2<String, String>, String>() {

@Override

public String call(Tuple2<String, String> tuple2) {

return tuple2._2();

}

});

lines.foreachRDD(new Function<JavaRDD<String>, Void>() {

@Override

public Void call(JavaRDD<String> rdd) throws Exception {

long msgNum = processEachRDD(rdd, outputFolderPath, definedDuration);

if (msgNum > 0 && zkPathRoot!= null) {

OffsetRange[] offsets = offsetRanges.get();

writeOffsetToZookeeper(zkClient, zkPathRoot, offsets);

}

return null;

}

});

但是要注意,下面这两段代码(代码3和代码4)是错误的,它们都会抛出一个exception:java.lang.ClassCastException: org.apache.spark.rdd.MapPartitionsRDD cannot be cast to org.apache.spark.streaming.kafka.HasOffsetRanges

代码3(错误):

-----------------------

JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(

jssc,

String.class,

String.class,

StringDecoder.class,

StringDecoder.class,

kafkaParams,

topicsSet

);

messages.transform(new Function<JavaPairRDD<String, String>, JavaRDD<String>>() {

@Override

public JavaRDD<String> call(JavaPairRDD<String, String> rdd) throws Exception {

return rdd.values();

}

}).foreachRDD(new Function<JavaRDD<String>, Void>() {

@Override

public Void call(JavaRDD<String> rdd) throws Exception {

long msgNum = processEachRDD(rdd, outputFolderPath, definedDuration);

if (msgNum > 0 && zkPathRoot!= null) {

OffsetRange[] offsets = offsetRanges.get();

writeOffsetToZookeeper(zkClient, zkPathRoot, offsets);

}

return null;

}

});

代码4(错误):

-----------------------

JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(

jssc,

String.class,

String.class,

StringDecoder.class,

StringDecoder.class,

kafkaParams,

topicsSet

);

messages.map(new Function<Tuple2<String, String>, String>() {

@Override

public String call(Tuple2<String, String> tuple2) {

return tuple2._2();

}

}).foreachRDD(new Function<JavaRDD<String>, Void>() {

@Override

public Void call(JavaRDD<String> rdd) throws Exception {

long msgNum = processEachRDD(rdd, outputFolderPath, definedDuration);

if (msgNum > 0 && zkPathRoot!= null) {

OffsetRange[] offsets = offsetRanges.get();

writeOffsetToZookeeper(zkClient, zkPathRoot, offsets);

}

return null;

}

});

原文发布于微信公众号 - 悦思悦读(yuesiyuedu)

原文发表时间:2015-11-26

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏函数式编程语言及工具

SDP(5):ScalikeJDBC- JDBC-Engine:Streaming

  作为一种通用的数据库编程引擎,用Streaming来应对海量数据的处理是必备功能。同样,我们还是通过一种Context传递产生流的要求。因为Streamin...

3094
来自专栏程序你好

如何使用Java Stream Collectors(归约器)?

Java 8引入了Stream API,它允许我们以声明的方式处理数据。此外,Stream还可以在不需要编写多线程代码的情况下使用多核架构。

791
来自专栏高性能服务器开发

(三)dict哈希结构1

昨天分析完adlist的Redis代码,今天马上马不停蹄的继续学习Redis代码中的哈希部分的结构学习,不过在这里他不叫什么hashMap,而是叫dict,而且...

2925
来自专栏码匠的流水账

聊聊flink的CsvReader

flink-java-1.6.2-sources.jar!/org/apache/flink/api/java/ExecutionEnvironment.jav...

1922
来自专栏高性能服务器开发

(二)结构体分析

继上次的redis源码分析(一)之后,本人开始订制着一份非常伟大的计划-啃完redis源代码,也对他进行了切块划分,鉴于本人目前对他的整个运行流畅还不特别清楚的...

3176
来自专栏码匠的流水账

聊聊flink的CsvReader

flink-java-1.6.2-sources.jar!/org/apache/flink/api/java/ExecutionEnvironment.jav...

1412
来自专栏Jed的技术阶梯

Spark常用Transformations算子(二)

介绍以下Transformations算子: aggregateByKey join cogroup cartesian pipe repartit...

1174
来自专栏码匠的流水账

聊聊hystrix的BucketedCounterStream

hystrix-core-1.5.12-sources.jar!/com/netflix/hystrix/metric/consumer/BucketedCou...

811
来自专栏小樱的经验随笔

BZOJ 3097: Hash Killer I【构造题,思维题】

3097: Hash Killer I Time Limit: 5 Sec  Memory Limit: 128 MBSec  Special Judge Su...

1996
来自专栏LhWorld哥陪你聊算法

【Spark篇】---Spark中Action算子

Action类算子也是一类算子(函数)叫做行动算子,如foreach,collect,count等。Transformations类算子是延迟执行,Action...

1212

扫码关注云+社区

领取腾讯云代金券