HBase client访问ZooKeeper获取root-region-server DeadLock问题(zookeeper.ClientCnxn Unable to get data of zn

2012年11月28日 出现故障," Unable to get data of znode /hbase/root-region-server"

问题比较诡异,两个机房,只有一个机房故障,5台服务器相续故障,错误日志相同。使用的HBase客户端版本为0.94.0

1)分析步骤:

1 jstack jmap 查看是否有死锁、block或内存溢出

jmap 看内存回收状况没有什么异常,内存和CPU占用都不多

 jstack pid > test.log pid: Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be used when the target process is not responding

出现这种错误,使用-F参数 jstack -l -F pid >jstack.log

jstack -l pid >jstack.log

这时jstack.log并不会有有用信息,要去catalina.out里看,注意在服务下线后使用,容易造成tomcat僵死 catalina.out里发现deadlock死锁

Found one Java-level deadlock:
=============================
"catalina-exec-800":
  waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object),
  which is held by "catalina-exec-710"
"catalina-exec-710":
  waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
  which is held by "catalina-exec-29-EventThread"
"catalina-exec-29-EventThread":
  waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
  which is held by "catalina-exec-710"
Java stack information for the threads listed above:
===================================================
"catalina-exec-800":
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
        - waiting to lock <0x0000000731902200> (a java.lang.Object)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
        at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
        at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
        at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:171)
        at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:160)
        at com.weibo.api.commons.hbase.CustomHBase.get(CustomHBase.java:150)

"catalina-exec-710":
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
        - waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
        - locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
        ......
        at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
        at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
        - locked <0x0000000731902200> (a java.lang.Object)
  
 "catalina-exec-29-EventThread":
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
        - waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
        - locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Found 1 deadlock.

 warn.log中报告Interrupted异常 是由上述死锁引起的

2012-11-28 20:06:17 [WARN] hconnection-0x4384d0a47f41c63 Unable to get data of znode /hbase/root-region-server
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)
        at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
        ......
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1042)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
        at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
        at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
      
2012-11-28 20:06:17 [ERROR] [hbase_error]
java.io.IOException: Giving up after tries=1
        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:192)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
        at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
 
Caused by: java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:189)
        ... 13 more

https://issues.apache.org/jira/browse/HBASE-5060 HBase client is blocked forever ,跟这个问题有点相似,但没有解决这问题

2 同进程另启动线程访问root-region-server

                    // check if zk is ok
                    ZooKeeper zk = null;
                    Watcher watch = new Watcher() {
                        public void process(WatchedEvent event) {
                        }
                    };
                    String zookeeperQuorum = wbobjectHBaseConfMap.get("hbase.zookeeper.quorum");
                    
                    if (StringUtils.isNotBlank(zookeeperQuorum)) {
                        try {
                            zk = new ZooKeeper(zookeeperQuorum, 30000, watch);
                            byte[] data = zk.getData("/hbase/root-region-server", watch, null);
                            ApiLogger.info(" get root-region-server success! ip:"
                                    + Util.toStr(data));
                        } catch (Exception e) {
                            ApiLogger.error(" get root-region-server error!" + e.getMessage());
                        } finally {
                            try {
                                zk.close();
                            } catch (InterruptedException e) {
                                ApiLogger.error("close zk error!");
                            }
                        }
                    }
                    

发现独立线程在整个进程死锁时还能正常工作,HBase的zookeeper实例异常死锁后就不能恢复,导致scan操作超时到30s+,并且不能恢复,正常应该在ms级别。

因此认为是HBase客户端连接ZooKeeper时出问题,流程:

出现网络抖动或者root表迁移,缓存表未命中,重新去获取root-region-server,结果获取失败,进行ZooKeeper重置操作

经过研究死锁代码

HConnectionImplementation 发现 get root-region-server异常,锁定RootRegionTracker,试图锁定自身,执行resetZooKeeperTrackers并调用RootRegionTracker.stop 去重置 ZooKeeper

而与此同时RootRegionTracker在执行stop方法时,已经锁定了HConnectionImplementation,还需要锁定它本身(synchronized方法),但其已经在HConnectionImplementation中被锁定。

HConnectionImplementation代码中:

 this.rootRegionTracker = new RootRegionTracker(this.zooKeeper, this);

调用时也没有超时cancel代码 这样就导致了两个资源循环等待的死锁

2)仿真模拟 

以下是个小程序,简单模拟这种死锁问题:

import java.util.concurrent.TimeUnit;

import org.apache.hadoop.hbase.ServerName;

class ZooKeeperNodeTracker {
    private boolean stopped = false;
    private AbortAble hConnectionImplementation;
    public ZooKeeperNodeTracker(AbortAble hConnectionImplementation) {
        this.hConnectionImplementation = hConnectionImplementation;
    }

    public synchronized void stop()  throws InterruptedException {
        this.stopped = true;
        System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"stop zknode");
        TimeUnit.MICROSECONDS.sleep(100);
        notifyAll();
    }

    public boolean condition() {
        return stopped;
    }

    public boolean start() {
         stopped=false;
         return true;
    }
    
    public synchronized boolean getData(int i) throws InterruptedException {
        //error in get root region server
        if (i %100 ==0){
            hConnectionImplementation.resetZooKeeperTrackers();
            throw new InterruptedException("interrupted");            
        }
        return true;
    }

}

public class HConnectionManagerTest {

    static class HConnectionImplementation implements AbortAble{
        public HConnectionImplementation() {
            rootRegionTracker = new ZooKeeperNodeTracker(this);
        }

        private volatile ZooKeeperNodeTracker rootRegionTracker;
        @Override
        public synchronized void resetZooKeeperTrackers() {
            try{
                if (rootRegionTracker != null) {
                    rootRegionTracker.stop();
                    rootRegionTracker = null;
                    System.out.println(Thread.currentThread()+"|"+Thread.currentThread().getId()+"resetZooKeeperTrackers");
                }
            }catch(InterruptedException e){
                System.out.println(Thread.currentThread()+"----------resetZooKeeperTrackers Interrupted-----------");
                
            }
        }


        public void testGetData(String name) {
            int i = 1;
            while (i >0) {
                i++;
                try {
                    rootRegionTracker.getData(i);
                } catch (Exception e) {
                    resetZooKeeperTrackers();
                }
                if(i %100 ==0){
                    rootRegionTracker = new ZooKeeperNodeTracker(this);
                    System.out.println(name+" restart test");
                }
            }
        }
    }

    public static void main(String[] args)  {

        final HConnectionImplementation hcon = new HConnectionImplementation();
        Thread scan1 = new Thread(new Runnable() {
            public void run() {
                hcon.testGetData("[scan1]");
            }
        });
        
        Thread scan2 = new Thread(new Runnable() {
            public void run() {
                hcon.testGetData("[scan2]");
            }
        });
        try{
            scan1.start();
            scan2.start();
            TimeUnit.SECONDS.sleep(2);
        }catch (InterruptedException e){
            System.out.println("----------testgetdata -------interrupt");
        }
    }

}

类名构造函数都模拟HBase Client,并且放大getData error的情形,当同时并发两个scan操作时,前一个scan过程中,获取不到root-region-server,在ZooKeeperNodeTracker中做stop() 时,后一个scan也开始在HConnectionImplementation中执行resetZooKeeperTrackers(),两个资源ZooKeeperNodeTracker和HConnectionImplementation被各自分别占用等待,导致死锁。模拟程序的死锁解除可以更改

public synchronized boolean getData
方法为
public boolean getData
public synchronized void resetZooKeeperTrackers() {
            try{
                if (rootRegionTracker != null) {
                    rootRegionTracker.stop();
public void resetZooKeeperTrackers() {
            try{
                if (rootRegionTracker != null) {
            synchronized(rootRegionTracker){
              rootRegionTracker.stop();
解除互斥条件解决问题

 3)最终解决方案

通过Hadoop大会现场跟HBase开发者Ted Yu咨询,称0.94.0有很多bug不稳定,建议升级到0.94.2,通过查看relase note, 官方的两个patch地址已在0.94.1中修复 (通过对hbase源码进行分析找对问题对应点,再看对应源码svn详细修改记录) 1 通过避免嵌套重试循环来解决rpc线程卡死: https://issues.apache.org/jira/browse/HBASE-6326  . 2 通过等待-root-的region地址设置到root region tracker 来避免deadlock问题:https://issues.apache.org/jira/browse/HBASE-6115

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Crossin的编程教室

【Git 第63课】python 2到3的新手坑

昨天挖了个坑,论坛上已经有不少解答了,还有c语言的版本。今天先不填坑,让题目再飞一会儿,没做的同学可以周末试着写写玩儿。 周三的时候去参加“编程一小时”活动,过...

34470
来自专栏码洞

全栈虚拟机GraalVM初体验

近日Oracle开源了一个实验性的产品GraalVM,官方称之为Universal GraalVM。它打通了不同语言之间的鸿沟,让我们可以进行混合式多语言编程。...

22220
来自专栏BeJavaGod

BeJavaGod - 如何正确使用数据字典进行分类统一操作(一)

先说说什么是数据字典,这个玩意一般不太会解释,举个栗子吧~ 每个系统都会有用户表,性别:男(1)女(0) 另外我们做物流的会涉及到车型:卡车(1),轿车(2),...

37370
来自专栏互联网杂技

如何去了解JavaScript引擎的工作原理

1. 什么是JavaScript解析引擎? 简单地说,JavaScript解析引擎就是能够“读懂”JavaScript代码,并准确地给出代码运行结果的一段程序。...

40670
来自专栏aCloudDeveloper

防御性编程

Author:bakari       Date:2012.8.25 本篇是我根据网上的一些陈述经过整理和总结而得。其中详细的内容我会标注出处。看不懂的可以查看...

30280
来自专栏企鹅号快讯

ForeSpider教程连载之链接抽取

自从来到前嗅,小编从一个爬虫小白到现在能够熟练的采集各种网站各种数据真的是有很大的成长,当然,成长过程中肯定少不了踩坑(很多网站都有防爬措施),为了让各位用户能...

23770
来自专栏C语言及其他语言

【工具资源】迈进C世界的第一步

开始学c的小伙伴 肯定对两个问题焦头烂额 如何选择编译器 到哪里去下载想要的编译器 下面就让小编来帮大家解决这两个问题 ? 细心的小伙伴其实已经发现 咱们C语言...

37070
来自专栏知识分享

12-MQTT介绍

看到这个项目第一想法肯定需要一个服务器,所有的wifi设备和手机都去连接这个服务器,然后服务器进行信息的中转

26740
来自专栏企鹅号快讯

实战:从Python分析17-18赛季NBA胜率超70%球队数据开始…

干货 观点 案例 资讯 我们 ? 撸主: Casey 岂安业务风险分析师 主要负责岂安科技RED.Q的数据分析和运营工作。 就在昨天,12月19日,科比再...

28470
来自专栏H2Cloud

Event Store框架探究

摘要:   游戏开发中,经常会越到千奇百怪的Bug。后台程序都是以demon 方式运行,要么GDB,要么Log。一些确定性的bug可以直接使用GDB调试,比如特...

41870

扫码关注云+社区

领取腾讯云代金券