今天是2017年的第48周 今天是2017年的第331天
问题描述:
strom系统重启之后依然从kafka历史数据读取记录
问题分类: KafkaSpout重复消费问题
解决步骤:
1 检查代码没有发现问题
Strom从Kafka中读取数据
涉及代码:
public class SpoutConfig extends KafkaConfig implements Serializable
public class KafkaSpout extends BaseRichSpout
How KafkaSpout stores offsets of a Kafka topic and recovers in case of failures
As shown in the above KafkaConfig properties, you can control from where in the Kafka topic the spout begins to read by setting KafkaConfig.startOffsetTime as follows:
1. kafka.api.OffsetRequest.EarliestTime(): read from the beginning of the topic (i.e. from the oldest messages onwards)
2. kafka.api.OffsetRequest.LatestTime(): read from the end of the topic (i.e. any new messsages that are being written to the topic)
3. A Unix timestamp aka seconds since the epoch (e.g. via System.currentTimeMillis()): see How do I accurately get offsets of messages for a certain timestamp using OffsetRequest? in the Kafka FAQ
用法:
// 偏移量越界处理
spoutConf.ignoreZkOffsets = false; // false
spoutConf.useStartOffsetTimeIfOffsetOutOfRange = true;
spoutConf.startOffsetTime = kafka.api.OffsetRequest.LatestTime();
反复观察 offset 半个小时内没有被修改
最终判断是tuple树跟踪影响了ack性能
代码调整如下:
conf.setNumAckers(0);//tuple树不会被跟踪
strom流程知识回顾:
具体流程图如下:
ACK 解决分布式数据数据保证不会被重复处理或者遗漏处理
缺点:ack 存储这些数据用的LinkedList 顺序遍历还是消耗性能的
ack异步确认方式
这次写的比较混乱 后续计划
参考