kafka log解析

原创

用户9135661

修改于 2023-12-15 23:52:19

1500

修改于 2023-12-15 23:52:19

文章被收录于专栏：深入理解Kafka

log目录介绍

查看一个分区的log目录，文件如下：

ls -alth device-bsj-2
总用量 1.8G
-rw-r--r--   1 root root 771M 12月 15 23:29 00000000001396172243.log
drwxr-xr-x 150 root root  12K 12月 15 23:28 ..
-rw-r--r--   1 root root  10M 12月 15 23:28 00000000001396172243.index
-rw-r--r--   1 root root  10M 12月 15 23:28 00000000001396172243.timeindex
drwxr-xr-x   2 root root 4.0K 12月 12 17:37 .
-rw-r--r--   1 root root   64 12月 12 17:36 leader-epoch-checkpoint
-rw-r--r--   1 root root   10 12月 12 14:57 00000000001396172243.snapshot
-rw-r--r--   1 root root 2.0M 12月 12 14:57 00000000001390103904.index
-rw-r--r--   1 root root 3.0M 12月 12 14:57 00000000001390103904.timeindex
-rw-r--r--   1 root root 1.0G 12月 12 14:57 00000000001390103904.log
-rw-r--r--   1 root root   10 12月 12 00:30 00000000001395027430.snapshot
-rw-r--r--   1 root root   10 12月  8 14:11 00000000001390103904.snapshot
-rw-r--r--   1 root root   43 9月   9 11:29 partition.metadata

每个段，都有三个文件：

.log 存储原始消息

.index 存储offset和消息在.log文件中的位置映射关系。即存储了逻辑offset和物理offset的关系

.timeindex 存储时间（消息创建时间吗？）和逻辑offset的关系。

索引文件

有了.index 和 .timeindex两个索引文件，我们就可以干如下事情：

1、从一个指定的offset开始消费

 ./kafka-console-consumer.sh  --bootstrap-server 127.0.0.1:9092 --topic test2 --partition 0 --offset 100

2、从指定的时间开始消费。如有一个消费者组group1，要重新消费昨天10点开始的数据。

./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group group1  --reset-offsets --all-topics --to-datetime '2023-06-01T06:00:00.000+08:00' --execute

疑问：这两个索引文件进行查找时，有没有特殊的算法？如何快速找到一个时间点的offset？

这里我们尝试查看索引文件的内容：

/usr/local/kafka2.8/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files /mnt/kafka-disk/kaa-logs/device-bsj-3/00000000000152270769.timeindex

/usr/local/kafka2.8/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files /mnt/kafka-disk/kaa-logs/device-bsj-3/00000000000152270769.index

可以看到，索引文件中并没有保存所有的offset（这是关键点），从文件开头的截图可以看到，索引文件大小是10M，.log文件是771M，因此呢，在索引文件中进行顺序查找，找一个offset的开销，要比从原始日志文件中查询，性能要提升大约100倍。

https://stackoverflow.com/questions/19394669/why-do-index-files-exist-in-the-kafka-log-directory not every message within a log has it's corresponding message within the index. The configuration parameter index.interval.bytes, which is 4096 bytes by default, sets an index interval which basically describes how frequently (after how many bytes) an index entry will be added.

所以呢，索引文件中查询一个offset，并没有用到二分查找等算法，就是顺序遍历。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

kafka

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

kafka

登录后参与评论

0 条评论

热度

kafka log解析

kafka log解析

log目录介绍

索引文件

有了.index 和 .timeindex两个索引文件，我们就可以干如下事情：

疑问：这两个索引文件进行查找时，有没有特殊的算法？如何快速找到一个时间点的offset？

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

kafka log解析

kafka log解析

log目录介绍

索引文件

有了.index 和 .timeindex两个索引文件，我们就可以干如下事情：

疑问： 这两个索引文件进行查找时，有没有特殊的算法？如何快速找到一个时间点的offset？

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

疑问：这两个索引文件进行查找时，有没有特殊的算法？如何快速找到一个时间点的offset？