How to Monitor Zookeeper

Monitoring Zookeeper: Metrics and Alerts

As per previous articles, our general rule of thumb is “collect all possible/reasonable metrics that can help when troubleshooting, alert only on those that require an action from you”. Well, the Zookeeper list that satisfies this criteria is not that long.

Zookeeper process is running

Metric

Comments

Suggested Alert

Zookeeper process

Is the right binary daemon process running?

When process list contains the regexp /usr/bin/java*org.apache.zookeeper$.

You can also use the following script to check if the server is running:

$INSTALL_PREFIX/zk-server-3/bin/zkServer.sh status

Or if you run Zookeeper via supervisord (recommended) you can alert the supervisord resource instead.

System Metrics

Metric

Comments

Suggested Alert

Memory usage

Zookeeper should run entirely on RAM. JVM heap size shouldn’t be bigger than your available RAM. That is to avoid swapping.

None

Swap usage

Watch for swap usage, as it will degrade performance on Zookeeper and lead to operations timing out (set vm.swappiness = 0).

When used swap is > 128MB.

Network bandwidth

Zookeeper servers can incur a high network usage. Keep an eye on this, especially if you notice any performance degradation. Also look out for dropped packet errors. Zookeeper standards are: 20% writes, 80% reads. More nodes result in more writes and higher overall traffic.

None

Disk usage

Zookeeper data is usually ephemeral and small. Still we recommend dataLogDir to be on a dedicated partition and watch for disk usage. Use purge task to clean up dataDir and dataLogDir.

When disk is > 85% usage.

Zookeeper disk writes are asynchronous which means they shouldn’t have high IO requirements. Still, keep an eye on this, especially if your server is shared with other services, say Kafka.

Here is how Server Density graphs disk usage and memory usage. Note the up and down curves created by the purge task:

And here are some Zookeeper alerts configured in Server Density:

Zookeeper Metrics

Metric

Comments

Suggested Alert

Request Avg/Max Latency

Amount of time it takes for the server to respond to a client request (since the server was started).

When latency > 10 (Ticks).

Outstanding Requests

Number of queued requests in the server. This goes up when the server receives more requests than it can process.

When count > 10.

Received

Number of client requests (typically operations) received.

None

Sent

Number of client packets sent (responses and notifications).

None

File Descriptors

Number of file descriptors used over the limit.

When FD percentage > 85 %.

Mode

Serving mode: leader or follower, or standalone if not running in an ensemble.

None

Pending syncs

(Only exposed by the leader) number of pending syncs from the followers.

When pending > 10.

Followers

(Only exposed by the leader) number of followers within the ensemble. You can deduce the number of servers from the MBeam Quorum Size.

When followers != (number of ensemble servers -1).

Node count

Number of znodes in the Zookeeper namespace

None

Watch count

Number of watchers setup over Zookeeper nodes.

None

Heap Memory Usage

Memory allocated dynamically by the Java process, Zookeeper in this case.

None

Here is a Zookeeper monitoring graph including Latency average and Outstanding requests:

Zookeeper Monitoring Tools

The simplest way to monitor Zookeeper and collect these metrics is by using the commands known as “4 letter words” within the ZK community. You can run these using telnet or netcat directly:

$ echo ruok | nc 127.0.0.1 5111
imok
 
$ echo mntr | nc localhost 5111
zk_version  3.4.0
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 70
zk_packets_sent 69
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count   4
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    27
zk_followers    4                   - only exposed by the Leader
zk_synced_followers 4               - only exposed by the Leader
zk_pending_syncs    0               - only exposed by the Leader
zk_open_file_descriptor_count 23    - only available on Unix platforms
zk_max_file_descriptor_count 1024   - only available on Unix platforms

We’ve looked at mytop for MySQL, and memcache-top for Memcached. Well, Zookeeper has one too, zktop:

$ ./zktop.py --servers "localhost:2181,localhost:2182,localhost:2183"
Ensemble -- nodecount:10 zxid:0x1300000001 sessions:4
SERVER           PORT M      OUTST    RECVD     SENT CONNS MINLAT AVGLAT MAXLAT
localhost        2181 F          0       93       92     2      2      7     13
localhost        2182 F          0       37       36     1      0      0      0
localhost        2183 L          0       36       35     1      0      0      0

CLIENT           PORT I   QUEUE RECVD  SENT
127.0.0.1       34705 1       0    56    56
127.0.0.1       35943 1       0     1     0
127.0.0.1       33999 1       0     1     0
127.0.0.1       37988 1       0     1     0

If you are after more detailed metrics, you can access those through JMX. You could also take the DIY road and go for JMXTrans and Graphite, or use Nagios/Cacti/Ganglia with check_zookeeper.py. Alternatively, you can save time (and preserve your sanity) by choosing a hosted service like Server Density (that’s us!).

If you want to test the quality and performance of your Zookeeper ensemble, then zk-smoketest with zk-smoketest.py and zk-latencies.py are great tools to check out.

Zookeeper Management tools

There are not too many management options out there. The folks at Netflix have released Exhibitor, a tool that provides some basic monitoring, log cleaning up (for old versions), backup/restore, ensemble configuration and nodes visualization. There is also zookeeper_dashboard, but it hasn’t been updated in years.

Further reading

Did this article pique your interest in Zookeeper? Nice, keep reading. We found Scott Leberknight’s Zookeeper series of blog posts to be worthwhile. We also like these presentations:

  • Building an Impenetrable Zookeeper (includes video).
  • Apache Zookeeper is a long presentation covering some required concepts of distributed systems
  • Zookeeper in the Wild goes straight to the point on operating a Zookeeper ensemble.

原文发布于微信公众号 - 云计算与大数据(heidcloud)

原文发表时间:2018-08-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏数据和云

实践真知:使用ASM和文件系统的数据库在AIO上有何不同?

张大朋(Lunar)Oracle 资深技术专家 Lunar 拥有超过十年的 ORACLE SUPPORT 从业经验,曾经服务于ORACLE ACS部门,现就职...

2884
来自专栏北京马哥教育

两大Linux发行版迎来大幅更新 Debian 9及Fedora 26 Beta终于发布

Debian 9终于发布 Debian 发行版宣布正式释出代号为 Stretch 的 Debian 9,该版本将提供五年的支持。Stretch 将专门献给于 2...

3214
来自专栏醉生梦死

shell脚本--练习1(爬虫)

2814
来自专栏bboysoul

linux编译安装apache

wget http://mirrors.ustc.edu.cn/apache/httpd/httpd-2.4.25.tar.gz tar -zxvf http...

1853
来自专栏互联网技术栈

Spring Boot集成Mybatis

Spring Boot集成Mybatis的配置方式有很多种,可以使用mybatis-spring-boot-starter、注解方式、传统集成方式等。本文采用的...

742
来自专栏Duncan's Blog

springmvc-mybatis

本文基于原文http://doc.okbase.net/fengshizty/archive/126397.html配置环境。 首先说说几个问题 1.关于Myb...

1403
来自专栏Objective-C

iOS-安装和使用 CocoaPods

4107
来自专栏杨建荣的学习笔记

PCIE的简单配置(r8笔记第82天)

最近测试了一下PCIE-SSD在数据库环境的迁移 和加压情况,IOPS无可置疑,比起机械硬盘确实是高了很多个量级,在数据环境中的IO方面确实有很稳定的提升,目...

3628
来自专栏乐沙弥的世界

MyCAT 日志文件描述

    MyCat是一个基于cobar兴起的开源数据库中间件系统,当前深受广大开源爱好者的追捧以及DBA粉丝们的广泛研究。主要是面对解决高并发,高负载,海量存...

1212
来自专栏大数据学习笔记

ElasticSearch 6.x 学习笔记:1.下载安装与配置

1. 安装配置JDK 8 (1)下载并解压缩 [root@node1 ~]# tar -zxvf jdk-8u112-linux-x64.tar.gz -C /...

2.2K10

扫码关注云+社区