How to Monitor Zookeeper

Monitoring Zookeeper: Metrics and Alerts

As per previous articles, our general rule of thumb is “collect all possible/reasonable metrics that can help when troubleshooting, alert only on those that require an action from you”. Well, the Zookeeper list that satisfies this criteria is not that long.

Zookeeper process is running

Metric

Comments

Suggested Alert

Zookeeper process

Is the right binary daemon process running?

When process list contains the regexp /usr/bin/java*org.apache.zookeeper$.

You can also use the following script to check if the server is running:

$INSTALL_PREFIX/zk-server-3/bin/zkServer.sh status

Or if you run Zookeeper via supervisord (recommended) you can alert the supervisord resource instead.

System Metrics

Metric

Comments

Suggested Alert

Memory usage

Zookeeper should run entirely on RAM. JVM heap size shouldn’t be bigger than your available RAM. That is to avoid swapping.

None

Swap usage

Watch for swap usage, as it will degrade performance on Zookeeper and lead to operations timing out (set vm.swappiness = 0).

When used swap is > 128MB.

Network bandwidth

Zookeeper servers can incur a high network usage. Keep an eye on this, especially if you notice any performance degradation. Also look out for dropped packet errors. Zookeeper standards are: 20% writes, 80% reads. More nodes result in more writes and higher overall traffic.

None

Disk usage

Zookeeper data is usually ephemeral and small. Still we recommend dataLogDir to be on a dedicated partition and watch for disk usage. Use purge task to clean up dataDir and dataLogDir.

When disk is > 85% usage.

Zookeeper disk writes are asynchronous which means they shouldn’t have high IO requirements. Still, keep an eye on this, especially if your server is shared with other services, say Kafka.

Here is how Server Density graphs disk usage and memory usage. Note the up and down curves created by the purge task:

And here are some Zookeeper alerts configured in Server Density:

Zookeeper Metrics

Metric

Comments

Suggested Alert

Request Avg/Max Latency

Amount of time it takes for the server to respond to a client request (since the server was started).

When latency > 10 (Ticks).

Outstanding Requests

Number of queued requests in the server. This goes up when the server receives more requests than it can process.

When count > 10.

Received

Number of client requests (typically operations) received.

None

Sent

Number of client packets sent (responses and notifications).

None

File Descriptors

Number of file descriptors used over the limit.

When FD percentage > 85 %.

Mode

Serving mode: leader or follower, or standalone if not running in an ensemble.

None

Pending syncs

(Only exposed by the leader) number of pending syncs from the followers.

When pending > 10.

Followers

(Only exposed by the leader) number of followers within the ensemble. You can deduce the number of servers from the MBeam Quorum Size.

When followers != (number of ensemble servers -1).

Node count

Number of znodes in the Zookeeper namespace

None

Watch count

Number of watchers setup over Zookeeper nodes.

None

Heap Memory Usage

Memory allocated dynamically by the Java process, Zookeeper in this case.

None

Here is a Zookeeper monitoring graph including Latency average and Outstanding requests:

Zookeeper Monitoring Tools

The simplest way to monitor Zookeeper and collect these metrics is by using the commands known as “4 letter words” within the ZK community. You can run these using telnet or netcat directly:

$ echo ruok | nc 127.0.0.1 5111
imok
 
$ echo mntr | nc localhost 5111
zk_version  3.4.0
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 70
zk_packets_sent 69
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count   4
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    27
zk_followers    4                   - only exposed by the Leader
zk_synced_followers 4               - only exposed by the Leader
zk_pending_syncs    0               - only exposed by the Leader
zk_open_file_descriptor_count 23    - only available on Unix platforms
zk_max_file_descriptor_count 1024   - only available on Unix platforms

We’ve looked at mytop for MySQL, and memcache-top for Memcached. Well, Zookeeper has one too, zktop:

$ ./zktop.py --servers "localhost:2181,localhost:2182,localhost:2183"
Ensemble -- nodecount:10 zxid:0x1300000001 sessions:4
SERVER           PORT M      OUTST    RECVD     SENT CONNS MINLAT AVGLAT MAXLAT
localhost        2181 F          0       93       92     2      2      7     13
localhost        2182 F          0       37       36     1      0      0      0
localhost        2183 L          0       36       35     1      0      0      0

CLIENT           PORT I   QUEUE RECVD  SENT
127.0.0.1       34705 1       0    56    56
127.0.0.1       35943 1       0     1     0
127.0.0.1       33999 1       0     1     0
127.0.0.1       37988 1       0     1     0

If you are after more detailed metrics, you can access those through JMX. You could also take the DIY road and go for JMXTrans and Graphite, or use Nagios/Cacti/Ganglia with check_zookeeper.py. Alternatively, you can save time (and preserve your sanity) by choosing a hosted service like Server Density (that’s us!).

If you want to test the quality and performance of your Zookeeper ensemble, then zk-smoketest with zk-smoketest.py and zk-latencies.py are great tools to check out.

Zookeeper Management tools

There are not too many management options out there. The folks at Netflix have released Exhibitor, a tool that provides some basic monitoring, log cleaning up (for old versions), backup/restore, ensemble configuration and nodes visualization. There is also zookeeper_dashboard, but it hasn’t been updated in years.

Further reading

Did this article pique your interest in Zookeeper? Nice, keep reading. We found Scott Leberknight’s Zookeeper series of blog posts to be worthwhile. We also like these presentations:

  • Building an Impenetrable Zookeeper (includes video).
  • Apache Zookeeper is a long presentation covering some required concepts of distributed systems
  • Zookeeper in the Wild goes straight to the point on operating a Zookeeper ensemble.

原文发布于微信公众号 - 云计算与大数据(heidcloud)

原文发表时间:2018-08-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Laoqi's Linux运维专列

LAMP安装mysql 时遇到的问题汇总

1: 缺少 libaio 包, libaio是Linux下的一个异步非阻塞方式读写文件的接口。 1 2 3 [[email protect...

40160
来自专栏杨建荣的学习笔记

PCIE的简单配置(r8笔记第82天)

最近测试了一下PCIE-SSD在数据库环境的迁移 和加压情况,IOPS无可置疑,比起机械硬盘确实是高了很多个量级,在数据环境中的IO方面确实有很稳定的提升,目...

40480
来自专栏杨建荣的学习笔记

备库报警邮件的分析案例(一) (r7笔记第14天)

今天早上到了公司后,收到了这样一封报警邮件,发现收到备库的报警案例也比较多,着实颠覆了我对备库基本不需要关注管理的观点。后面可以把几个案例做成一个主题来说说。 ...

34130
来自专栏互联网技术栈

Spring Boot集成Mybatis

Spring Boot集成Mybatis的配置方式有很多种,可以使用mybatis-spring-boot-starter、注解方式、传统集成方式等。本文采用的...

8920
来自专栏软件开发

Spring MVC 学习总结(十一)——IDEA+Maven+多模块实现SSM框架集成

与SSH(Struts/Spring/Hibernate/)一样,Spring+SpringMVC+MyBatis也有一个简称SSM,Spring实现业务对象管...

33820
来自专栏Python与爬虫

佛系编程[如何创建一个安全可靠的应用程序]

推荐最近在GitHub上很火的一个项目,按照介绍,你也可以创建一个安全可靠的应用程序 项目地址在>>> nocode https://github.com/k...

46190
来自专栏我的博客

Lumen5.X使用频率限制组件笔记

编写中间件,是根据vendor/illuminate/routing/Middleware/ThrottleRequests.php改写 备注:需要先...

408120
来自专栏醉生梦死

shell脚本--练习1(爬虫)

39740
来自专栏后端沉思录

springboot、redis整合

15010
来自专栏一个会写诗的程序员的博客

《Springboot极简教程》使用Spring Boot, JPA, Mysql, ThymeLeaf,gradle, Kotlin快速构建一个CRUD Web App

使用Spring Boot, JPA, Mysql, ThymeLeaf,gradle, Kotlin快速构建一个CRUD Web App

22920

扫码关注云+社区

领取腾讯云代金券