前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【Flink】第二十篇:HBase GC 调优实战

【Flink】第二十篇:HBase GC 调优实战

作者头像
章鱼carl
发布2022-03-31 11:13:06
5690
发布2022-03-31 11:13:06
举报
文章被收录于专栏:章鱼carl的专栏

【Flink】第四篇:【迷思】对update语义拆解D-、I+后造成update原子性丢失

【Flink】第五篇:checkpoint【1】

【Flink】第五篇:checkpoint【2】

【Flink】第八篇:Flink 内存管理

【Flink】第九篇:Flink SQL 性能优化实战

【Flink】第十篇:join 之 regular join

【Flink】第十三篇:JVM思维导图

【Flink】第十四篇:LSM-Tree一般性总结

【Flink】第十五篇:Redis Connector 数据保序思考

【Flink】第十六篇:源码角度分析 sink 端的数据一致性

【Flink】第十七篇:记一次牛轰轰的OOM故障排查

【Flink】第十八篇:Direct Memory 一箩筐

【Flink】第十九篇:从一个批量写HBase性能问题到一个Flink issue的距离

上一篇提到在用Flink SQL批量写HBase,遇到了三个坑,

  • HBase 写热点
  • HBase gc 调优
  • HBase Canary报警,Slow Read

这一篇就来谈谈HBase的JVM GC调优。

经过测试环境的基本验证后,在线上试跑我们的Flink SQL批量从Hive抽数到HBase程序,出现了GC时间超阈值的警告,

对于我来说,之前的JVM经验多是排查java的进程和线程或者JVM内存等问题,我的GC知识储备虽然是有的,但是GC经验比较缺,只能循序渐进尽力而为了。

CMS日志

对于这个问题,问题表象似乎不是很直观,所以我们直接从事发现场说起,即先去找GC的日志。由于是RegionServer的GC问题,所以我们就去这个节点上的RS进程找它的GC日志。

代码语言:javascript
复制
日志位置:HBase-实例-RegionServer-进程-Stdout

找到对应GC报警时段的日志:

代码语言:javascript
复制
2021-08-10T14:51:47.953+0800: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2911327K(4109120K)] 2919091K(4185792K), 0.0095153 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 
2021-08-10T14:51:47.963+0800: [CMS-concurrent-mark-start]
2021-08-10T14:51:48.146+0800: [GC (Allocation Failure) 2021-08-10T14:51:48.146+0800: [ParNew: 75662K->8512K(76672K), 0.0806900 secs] 2986990K->2923498K(4185792K), 0.0807806 secs] [Times: user=0.06 sys=0.02, real=0.08 secs] 
2021-08-10T14:51:48.296+0800: [CMS-concurrent-mark: 0.248/0.333 secs] [Times: user=1.49 sys=0.06, real=0.33 secs] 
2021-08-10T14:51:48.296+0800: [CMS-concurrent-preclean-start]
2021-08-10T14:51:48.332+0800: [CMS-concurrent-preclean: 0.034/0.036 secs] [Times: user=0.21 sys=0.06, real=0.04 secs] 
2021-08-10T14:51:48.332+0800: [CMS-concurrent-abortable-preclean-start]
2021-08-10T14:51:48.402+0800: [GC (Allocation Failure) 2021-08-10T14:51:48.402+0800: [ParNew: 76672K->8512K(76672K), 0.1289755 secs] 2991658K->2937785K(4185792K), 0.1290695 secs] [Times: user=0.11 sys=0.02, real=0.13 secs] 
2021-08-10T14:51:48.608+0800: [CMS-concurrent-abortable-preclean: 0.141/0.275 secs] [Times: user=1.61 sys=0.05, real=0.27 secs] 
2021-08-10T14:51:48.609+0800: [GC (CMS Final Remark) [YG occupancy: 51522 K (76672 K)]2021-08-10T14:51:48.609+0800: [Rescan (parallel) , 0.0422101 secs]2021-08-10T14:51:48.651+0800: [weak refs processing, 0.0000795 secs]2021-08-10T14:51:48.651+0800: [class unloading, 0.0492355 secs]2021-08-10T14:51:48.700+0800: [scrub symbol table, 0.0123413 secs]2021-08-10T14:51:48.713+0800: [scrub string table, 0.0009098 secs][1 CMS-remark: 2929273K(4109120K)] 2980795K(4185792K), 0.1049206 secs] [Times: user=0.10 sys=0.00, real=0.11 secs] 
2021-08-10T14:51:48.714+0800: [CMS-concurrent-sweep-start]
2021-08-10T14:51:48.779+0800: [GC (Allocation Failure) 2021-08-10T14:51:48.779+0800: [ParNew: 76637K->8512K(76672K), 0.2079816 secs] 2996766K->2961111K(4185792K), 0.2080966 secs] [Times: user=0.21 sys=0.00, real=0.21 secs] 
2021-08-10T14:51:49.140+0800: [GC (Allocation Failure) 2021-08-10T14:51:49.140+0800: [ParNew: 76672K->8512K(76672K), 0.1998496 secs] 3003891K->2965198K(4185792K), 0.1999370 secs] [Times: user=0.20 sys=0.00, real=0.20 secs] 
2021-08-10T14:51:49.434+0800: [GC (Allocation Failure) 2021-08-10T14:51:49.434+0800: [ParNew: 76672K->8512K(76672K), 0.1821081 secs] 3021530K->2978797K(4185792K), 0.1822234 secs] [Times: user=0.18 sys=0.00, real=0.18 secs] 
2021-08-10T14:51:49.733+0800: [GC (Allocation Failure) 2021-08-10T14:51:49.733+0800: [ParNew: 76672K->8512K(76672K), 0.1504635 secs] 3024458K->2974818K(4185792K), 0.1505647 secs] [Times: user=0.15 sys=0.00, real=0.15 secs] 
2021-08-10T14:51:49.993+0800: [GC (Allocation Failure) 2021-08-10T14:51:49.993+0800: [ParNew: 76672K->8511K(76672K), 0.1936745 secs] 3016307K->2973449K(4185792K), 0.1937949 secs] [Times: user=0.19 sys=0.00, real=0.19 secs] 
2021-08-10T14:51:50.301+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.302+0800: [ParNew: 76671K->8511K(76672K), 0.1576835 secs] 3013717K->2964476K(4185792K), 0.1578101 secs] [Times: user=0.16 sys=0.00, real=0.15 secs] 
2021-08-10T14:51:50.480+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.480+0800: [ParNew: 76671K->8511K(76672K), 0.0860739 secs] 3030531K->2973237K(4185792K), 0.0861817 secs] [Times: user=0.07 sys=0.02, real=0.08 secs] 
2021-08-10T14:51:50.584+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.584+0800: [ParNew: 76671K->8510K(76672K), 0.0512029 secs] 3040549K->2978473K(4185792K), 0.0513130 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:50.647+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.647+0800: [ParNew: 76670K->7183K(76672K), 0.0521286 secs] 3045304K->2982489K(4185792K), 0.0522169 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:50.712+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.712+0800: [ParNew: 75343K->6360K(76672K), 0.0444439 secs] 3049340K->2986350K(4185792K), 0.0445512 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:50.771+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.771+0800: [ParNew: 74520K->8512K(76672K), 0.0515823 secs] 3052375K->2995976K(4185792K), 0.0516707 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:50.834+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.834+0800: [ParNew: 76506K->6074K(76672K), 0.0452806 secs] 3063146K->3000358K(4185792K), 0.0453631 secs] [Times: user=0.03 sys=0.01, real=0.05 secs] 
2021-08-10T14:51:50.902+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.902+0800: [ParNew: 74234K->8512K(76672K), 0.0504306 secs] 3063328K->3005131K(4185792K), 0.0505333 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:50.974+0800: [GC (Allocation Failure) 2021-08-10T14:51:50.974+0800: [ParNew: 76672K->8512K(76672K), 0.0627139 secs] 3070082K->3009574K(4185792K), 0.0628423 secs] [Times: user=0.05 sys=0.02, real=0.06 secs] 
2021-08-10T14:51:51.060+0800: [GC (GCLocker Initiated GC) 2021-08-10T14:51:51.060+0800: [ParNew: 76672K->8512K(76672K), 0.0589934 secs] 3074349K->3012650K(4185792K), 0.0590668 secs] [Times: user=0.06 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:51.133+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.133+0800: [ParNew: 76672K->7963K(76672K), 0.0530061 secs] 3079476K->3018288K(4185792K), 0.0530952 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:51.197+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.197+0800: [ParNew: 76123K->7852K(76672K), 0.0478507 secs] 3086001K->3025091K(4185792K), 0.0479499 secs] [Times: user=0.04 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:51.255+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.255+0800: [ParNew: 76012K->8512K(76672K), 0.0424995 secs] 3092647K->3031110K(4185792K), 0.0425745 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:51.310+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.310+0800: [ParNew: 76672K->8512K(76672K), 0.0484108 secs] 3097352K->3039524K(4185792K), 0.0484893 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:51.370+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.370+0800: [ParNew: 76672K->8397K(76672K), 0.0437014 secs] 3106640K->3046419K(4185792K), 0.0437746 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:51.422+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.422+0800: [ParNew: 76557K->7104K(76672K), 0.0400171 secs] 3112953K->3049574K(4185792K), 0.0400884 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:51.473+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.473+0800: [ParNew: 75264K->8512K(76672K), 0.0368263 secs] 3116982K->3054561K(4185792K), 0.0369065 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:51.523+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.523+0800: [ParNew: 76672K->7656K(76672K), 0.0460787 secs] 3121546K->3065556K(4185792K), 0.0461540 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:51.580+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.580+0800: [ParNew: 75816K->7498K(76672K), 0.0468142 secs] 3131558K->3069979K(4185792K), 0.0469190 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:51.636+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.636+0800: [ParNew: 75658K->8512K(76672K), 0.0425241 secs] 3137002K->3074813K(4185792K), 0.0426331 secs] [Times: user=0.04 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:51.697+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.697+0800: [ParNew: 76672K->8512K(76672K), 0.0428190 secs] 3142038K->3080283K(4185792K), 0.0429126 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:51.749+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.750+0800: [ParNew: 76672K->8512K(76672K), 0.0386611 secs] 3147631K->3085888K(4185792K), 0.0387424 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:51.799+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.799+0800: [ParNew: 76672K->8512K(76672K), 0.0412278 secs] 3151520K->3092529K(4185792K), 0.0413260 secs] [Times: user=0.04 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:51.853+0800: [GC (GCLocker Initiated GC) 2021-08-10T14:51:51.853+0800: [ParNew: 76672K->8512K(76672K), 0.0452774 secs] 3158908K->3099394K(4185792K), 0.0453738 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:51.909+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.909+0800: [ParNew: 76672K->8474K(76672K), 0.0494126 secs] 3164339K->3104267K(4185792K), 0.0494995 secs] [Times: user=0.04 sys=0.01, real=0.05 secs] 
2021-08-10T14:51:51.970+0800: [GC (Allocation Failure) 2021-08-10T14:51:51.970+0800: [ParNew: 76634K->8299K(76672K), 0.0509132 secs] 3169915K->3109519K(4185792K), 0.0510104 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:52.029+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.029+0800: [ParNew: 76459K->8208K(76672K), 0.0373005 secs] 3176254K->3115580K(4185792K), 0.0373899 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.074+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.074+0800: [ParNew: 76368K->7312K(76672K), 0.0367450 secs] 3181244K->3115986K(4185792K), 0.0368579 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.120+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.120+0800: [ParNew: 75472K->8512K(76672K), 0.0401914 secs] 3183208K->3121999K(4185792K), 0.0403127 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.170+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.170+0800: [ParNew: 76672K->8512K(76672K), 0.0426470 secs] 3188910K->3130015K(4185792K), 0.0427650 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.230+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.230+0800: [ParNew: 76672K->8512K(76672K), 0.0503207 secs] 3192776K->3134704K(4185792K), 0.0504159 secs] [Times: user=0.03 sys=0.02, real=0.06 secs] 
2021-08-10T14:51:52.293+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.293+0800: [ParNew: 76672K->7979K(76672K), 0.0390631 secs] 3195745K->3135076K(4185792K), 0.0391420 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.342+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.342+0800: [ParNew: 76139K->7490K(76672K), 0.0349573 secs] 3197410K->3136267K(4185792K), 0.0350359 secs] [Times: user=0.03 sys=0.00, real=0.03 secs] 
2021-08-10T14:51:52.386+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.386+0800: [ParNew: 75650K->6531K(76672K), 0.0309209 secs] 3197520K->3133630K(4185792K), 0.0309972 secs] [Times: user=0.02 sys=0.01, real=0.03 secs] 
2021-08-10T14:51:52.425+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.425+0800: [ParNew: 74691K->8512K(76672K), 0.0336670 secs] 3197654K->3136810K(4185792K), 0.0337406 secs] [Times: user=0.03 sys=0.01, real=0.03 secs] 
2021-08-10T14:51:52.471+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.471+0800: [ParNew: 76672K->8512K(76672K), 0.0397881 secs] 3198152K->3138106K(4185792K), 0.0398787 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.521+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.521+0800: [ParNew: 76672K->8512K(76672K), 0.0328564 secs] 3202124K->3141177K(4185792K), 0.0329334 secs] [Times: user=0.03 sys=0.00, real=0.03 secs] 
2021-08-10T14:51:52.568+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.568+0800: [ParNew: 76672K->8512K(76672K), 0.0413346 secs] 3205766K->3145864K(4185792K), 0.0414457 secs] [Times: user=0.03 sys=0.01, real=0.05 secs] 
2021-08-10T14:51:52.624+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.624+0800: [ParNew: 76672K->7401K(76672K), 0.0427615 secs] 3212268K->3150330K(4185792K), 0.0428648 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.686+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.686+0800: [ParNew: 75561K->6884K(76672K), 0.0400685 secs] 3215185K->3152788K(4185792K), 0.0401657 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:52.737+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.737+0800: [ParNew: 75044K->5857K(76672K), 0.0513699 secs] 3218967K->3154707K(4185792K), 0.0514839 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:52.797+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.797+0800: [ParNew: 74017K->5053K(76672K), 0.0406744 secs] 3221868K->3156472K(4185792K), 0.0407750 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.847+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.847+0800: [ParNew: 73213K->8512K(76672K), 0.0397704 secs] 3223979K->3161810K(4185792K), 0.0398779 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:52.899+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.899+0800: [ParNew: 76672K->7849K(76672K), 0.0438958 secs] 3229360K->3168201K(4185792K), 0.0440062 secs] [Times: user=0.03 sys=0.01, real=0.05 secs] 
2021-08-10T14:51:52.957+0800: [GC (Allocation Failure) 2021-08-10T14:51:52.957+0800: [ParNew: 76009K->7763K(76672K), 0.0364444 secs] 3233300K->3176812K(4185792K), 0.0365427 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:53.006+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.006+0800: [ParNew: 75923K->6555K(76672K), 0.0344456 secs] 3241790K->3177269K(4185792K), 0.0345565 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:53.057+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.057+0800: [ParNew: 74715K->8194K(76672K), 0.0315442 secs] 3240466K->3180486K(4185792K), 0.0316234 secs] [Times: user=0.02 sys=0.00, real=0.03 secs] 
2021-08-10T14:51:53.101+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.101+0800: [ParNew: 76354K->6693K(76672K), 0.0402031 secs] 3246292K->3183405K(4185792K), 0.0403126 secs] [Times: user=0.03 sys=0.02, real=0.04 secs] 
2021-08-10T14:51:53.151+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.151+0800: [ParNew: 74853K->8512K(76672K), 0.0368653 secs] 3248774K->3186488K(4185792K), 0.0369517 secs] [Times: user=0.04 sys=0.00, real=0.03 secs] 
2021-08-10T14:51:53.201+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.201+0800: [ParNew: 76672K->7846K(76672K), 0.0483054 secs] 3251478K->3189903K(4185792K), 0.0484085 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:53.260+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.260+0800: [ParNew: 76006K->8512K(76672K), 0.0358117 secs] 3256451K->3193882K(4185792K), 0.0359015 secs] [Times: user=0.03 sys=0.01, real=0.03 secs] 
2021-08-10T14:51:53.320+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.320+0800: [ParNew: 76672K->7734K(76672K), 0.0464199 secs] 3253264K->3193290K(4185792K), 0.0465162 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:53.378+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.378+0800: [ParNew: 75894K->6746K(76672K), 0.0378897 secs] 3259489K->3194830K(4185792K), 0.0379893 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:53.455+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.455+0800: [ParNew: 74906K->8512K(76672K), 0.0650141 secs] 3258682K->3197584K(4185792K), 0.0651756 secs] [Times: user=0.05 sys=0.00, real=0.07 secs] 
2021-08-10T14:51:53.552+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.552+0800: [ParNew: 76672K->5695K(76672K), 0.0929924 secs] 3260641K->3196173K(4185792K), 0.0931994 secs] [Times: user=0.05 sys=0.01, real=0.09 secs] 
2021-08-10T14:51:53.684+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.684+0800: [ParNew: 73855K->5042K(76672K), 0.0615285 secs] 3259842K->3194613K(4185792K), 0.0617298 secs] [Times: user=0.06 sys=0.00, real=0.06 secs] 
2021-08-10T14:51:53.768+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.768+0800: [ParNew: 73202K->8512K(76672K), 0.0583413 secs] 3260122K->3196967K(4185792K), 0.0585096 secs] [Times: user=0.06 sys=0.00, real=0.06 secs] 
2021-08-10T14:51:53.856+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.856+0800: [ParNew: 76672K->5750K(76672K), 0.0667699 secs] 3255177K->3189423K(4185792K), 0.0669035 secs] [Times: user=0.05 sys=0.01, real=0.07 secs] 
2021-08-10T14:51:53.941+0800: [GC (Allocation Failure) 2021-08-10T14:51:53.941+0800: [ParNew: 73910K->8512K(76672K), 0.0709935 secs] 3255437K->3198192K(4185792K), 0.0711401 secs] [Times: user=0.06 sys=0.00, real=0.07 secs] 
2021-08-10T14:51:54.041+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.041+0800: [ParNew: 76672K->6788K(76672K), 0.0488584 secs] 3264395K->3202314K(4185792K), 0.0492206 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:54.116+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.116+0800: [ParNew: 74948K->8512K(76672K), 0.0605065 secs] 3269525K->3208524K(4185792K), 0.0606492 secs] [Times: user=0.04 sys=0.00, real=0.06 secs] 
2021-08-10T14:51:54.199+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.199+0800: [ParNew: 76672K->8512K(76672K), 0.0437344 secs] 3274882K->3211719K(4185792K), 0.0439084 secs] [Times: user=0.04 sys=0.01, real=0.05 secs] 
2021-08-10T14:51:54.298+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.298+0800: [ParNew: 76672K->8512K(76672K), 0.0620067 secs] 3276579K->3215324K(4185792K), 0.0621579 secs] [Times: user=0.05 sys=0.00, real=0.07 secs] 
2021-08-10T14:51:54.425+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.425+0800: [ParNew: 76629K->8054K(76672K), 0.0450103 secs] 3276955K->3216113K(4185792K), 0.0451683 secs] [Times: user=0.03 sys=0.01, real=0.05 secs] 
2021-08-10T14:51:54.508+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.508+0800: [ParNew: 76214K->8512K(76672K), 0.0432830 secs] 3281493K->3214555K(4185792K), 0.0434589 secs] [Times: user=0.05 sys=0.00, real=0.05 secs] 
2021-08-10T14:51:54.580+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.580+0800: [ParNew: 76672K->5880K(76672K), 0.0419076 secs] 3276456K->3209229K(4185792K), 0.0420531 secs] [Times: user=0.05 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:54.709+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.709+0800: [ParNew: 73999K->8512K(76672K), 0.0342041 secs] 3266325K->3200838K(4185792K), 0.0343394 secs] [Times: user=0.03 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:54.832+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.832+0800: [ParNew: 76670K->5527K(76672K), 0.0423911 secs] 3257665K->3192776K(4185792K), 0.0425222 secs] [Times: user=0.03 sys=0.01, real=0.04 secs] 
2021-08-10T14:51:54.947+0800: [GC (Allocation Failure) 2021-08-10T14:51:54.947+0800: [ParNew: 73687K->4075K(76672K), 0.0339701 secs] 3248652K->3179040K(4185792K), 0.0341424 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
2021-08-10T14:51:55.107+0800: [GC (Allocation Failure) 2021-08-10T14:51:55.107+0800: [ParNew: 72235K->2146K(76672K), 0.0572441 secs] 3225565K->3155476K(4185792K), 0.0574210 secs] [Times: user=0.04 sys=0.01, real=0.06 secs] 
2021-08-10T14:51:55.417+0800: [GC (Allocation Failure) 2021-08-10T14:51:55.417+0800: [ParNew: 70306K->8512K(76672K), 0.1606864 secs] 3211889K->3154957K(4185792K), 0.1608645 secs] [Times: user=0.14 sys=0.00, real=0.16 secs] 
2021-08-10T14:51:56.005+0800: [GC (Allocation Failure) 2021-08-10T14:51:56.006+0800: [ParNew: 76672K->8512K(76672K), 0.3085277 secs] 3180847K->3134649K(4185792K), 0.3089095 secs] [Times: user=0.31 sys=0.00, real=0.31 secs] 
2021-08-10T14:51:56.527+0800: [GC (Allocation Failure) 2021-08-10T14:51:56.527+0800: [ParNew: 76672K->8512K(76672K), 0.1977908 secs] 3164005K->3113200K(4185792K), 0.1979444 secs] [Times: user=0.16 sys=0.03, real=0.20 secs] 
2021-08-10T14:51:56.991+0800: [GC (Allocation Failure) 2021-08-10T14:51:56.991+0800: [ParNew: 76672K->8512K(76672K), 0.2410614 secs] 3129032K->3085993K(4185792K), 0.2411975 secs] [Times: user=0.24 sys=0.00, real=0.24 secs] 
2021-08-10T14:51:57.384+0800: [GC (Allocation Failure) 2021-08-10T14:51:57.384+0800: [ParNew: 76672K->8512K(76672K), 0.2582527 secs] 3102988K->3056474K(4185792K), 0.2584169 secs] [Times: user=0.26 sys=0.01, real=0.26 secs] 
2021-08-10T14:51:57.865+0800: [GC (Allocation Failure) 2021-08-10T14:51:57.865+0800: [ParNew: 76672K->8512K(76672K), 0.2842812 secs] 3103749K->3052060K(4185792K), 0.2844181 secs] [Times: user=0.19 sys=0.00, real=0.28 secs] 
2021-08-10T14:51:58.260+0800: [CMS-concurrent-sweep: 3.436/9.546 secs] [Times: user=31.79 sys=1.32, real=9.55 secs] 
2021-08-10T14:51:58.260+0800: [CMS-concurrent-reset-start]
2021-08-10T14:51:58.285+0800: [CMS-concurrent-reset: 0.026/0.026 secs] [Times: user=0.10 sys=0.01, real=0.02 secs] 

从日志就可以看出使用了CMS垃圾回收器。

CMS

CMS(Concurrent Mark Sweep)收集器是一种以获取最短回收停顿时间为目标的收集器。这是因为CMS收集器工作时,GC工作线程与用户线程可以并发执行,以此来达到降低收集停顿时间的目的。

CMS收集器仅作用于老年代的收集,是基于标记-清除算法的,它的运作过程分为以下步骤:

  • 初始标记(CMS initial mark)
  • 并发标记(CMS concurrent mark)
  • 预清理(CMS-concurrent-preclean)
  • 可被终止的预清理(CMS-concurrent-abortable-preclean)
  • 重新标记(CMS remark)
  • 并发清除(CMS concurrent sweep)
  • 并发重置状态等待下次CMS的触发(CMS-concurrent-reset),与用户线程同时运行;

其中,初始标记、重新标记这两个步骤仍然需要Stop-the-world。其中主要的四个阶段如下,

1. 初始标记

这是CMS中两次stop-the-world事件中的一次。这一步的作用是标记存活的对象,有两部分:

  • 标记老年代中所有的GC Roots对象,如下图节点1;
  • 标记年轻代中活着的对象引用到的老年代的对象(指的是年轻带中还存活的引用类型对象,引用指向老年代中的对象)如下图节点2、3;

2. 并发标记

  • 并发标记阶段 就是进行GC Roots Tracing的过程

由于这个阶段是和用户线程并发的,可能会导致concurrent mode failure。

3. 重新标记

  • 重新标记阶段 则是为了修正并发标记期间因用户程序继续运作而导致标记产生变动的那一部分对象的标记记录,这个阶段的停顿时间一般会比初始阶段稍长一些,但远比并发标记的时间短,会导致swt。

CMS大部分时间都花费在重新标记阶段,可以让虚拟机先进行一次Young GC,减少停顿时间。

代码语言:javascript
复制
-XX:+CMSScavengeBeforeRemark

4. 并发清理

  • 这个阶段主要是清除那些没有标记的对象并且回收空间;由于CMS并发清理阶段用户线程还在运行着,伴随程序运行自然就还会有新的垃圾不断产生,这一部分垃圾出现在标记过程之后,CMS无法在当次收集中处理掉它们,只好留待下一次GC时再清理掉。这一部分垃圾就称为“浮动垃圾”。

可能在收集过程中出现"concurrent mode failure",解决方法是让CMS尽早GC。在一定次数的Full GC之后让CMS对内存做一次压缩,减少内存碎片,防止年轻代对象晋升到老年代时因为内存碎片问题导致晋升失败。

代码语言:javascript
复制
-XX:CMSFullGCsBeforeCompaction=n

CMS收集器优点:并发收集、低停顿。

CMS收集器缺点

  • CMS收集器对CPU资源非常敏感。
  • CMS收集器无法处理浮动垃圾(Floating Garbage)。
  • CMS收集器是基于标记-清除算法,该算法的缺点都有(内存碎片)。
  • 停顿时间是不可预期的。

CMS收集器之所以能够做到并发,根本原因在于采用基于“标记-清除”的算法,并对算法过程进行了细粒度的分解

分析RS GC日志

结合以上CMS的阶段的介绍,分析我们得到的RS的GC日志:

代码语言:javascript
复制
// 阶段1:初始标记--Initial Mark 
// CMS-initial-mark:初始标记阶段,这是第一个 STW 阶段,它会收集所有 GC Roots 以及其直接引用的对象;
// 2911327K:当前老年代使用的容量;
// (4109120K):老年代可用的最大容量;3.915g 年轻代:75mb
// 2919091K:整个堆目前使用的容量;
// (4185792K):堆可用的容量;3.99g
// 0.0095153 secs:这个阶段的持续时间;
// [Times: user=0.00 sys=0.01, real=0.01 secs] :相应 user(用户态,累计并发)、system(内核态) 、 real(现实世界) 的时间统计。
2021-08-10T14:51:47.953+0800: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2911327K(4109120K)] 2919091K(4185792K), 0.0095153 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 

// 阶段2:并发标记--concurrent-mark
// CMS-concurrent-mark:并发收集阶段,这个阶段会遍历老年代,并标记所有存活的对象;
// 0.248/0.333 secs:这个阶段的持续时间与时钟时间;
// [Times: user=1.49 sys=0.06, real=0.33 secs]:同上,但是这部的时间意义不大,因为它是从并发标记的开始时间开始计算,这期间因为是并发进行,不仅仅包含 GC 线程的工作。
2021-08-10T14:51:47.963+0800: [CMS-concurrent-mark-start]
2021-08-10T14:51:48.296+0800: [CMS-concurrent-mark: 0.248/0.333 secs] [Times: user=1.49 sys=0.06, real=0.33 secs] 

// 阶段3:Concurrent Preclean
// CMS-concurrent-preclean:并发预清理-标记阶段,对在前面并发标记阶段中引用发生变化的对象进行标记;
// [Times: user=0.21 sys=0.06, real=0.04 secs] 同上
2021-08-10T14:51:48.296+0800: [CMS-concurrent-preclean-start]
2021-08-10T14:51:48.332+0800: [CMS-concurrent-preclean: 0.034/0.036 secs] [Times: user=0.21 sys=0.06, real=0.04 secs] 

// 阶段4:Concurrent Abortable Preclean
// Concurrent Abortable Preclean:并发预清理-清除阶段,这个阶段是为了尽量承担 STW(stop-the-world)中最终标记阶段的工作。
2021-08-10T14:51:48.332+0800: [CMS-concurrent-abortable-preclean-start]
2021-08-10T14:51:48.608+0800: [CMS-concurrent-abortable-preclean: 0.141/0.275 secs] [Times: user=1.61 sys=0.05, real=0.27 secs] 

// 阶段5:Final Remark
// Final Remark:重标记阶段,这是第二个 STW 阶段,也是 CMS 中的最后一个,这个阶段的目标是标记所有老年代所有的存活对象。
// YG occupancy: 51522 K (76672 K):年轻代当前占用量及容量
// [Rescan (parallel) , 0.0422101 secs]:这个 Rescan 是当应用暂停的情况下完成对所有存活对象的标记,这个阶段是并行处理的。
// [weak refs processing, 0.0027800 secs]:第一个子阶段,它的工作是处理弱引用;
// [class unloading, 0.0492355 secs]:第二个子阶段,它的工作是:unloading the unused classes;
// [scrub symbol table, 0.0123413 secs]2021-08-10T14:51:48.713+0800: [scrub string table, 0.0009098 secs]:最后一个子阶段,它的目的是:cleaning up symbol and string tables which hold class-level metadata and internalized string respectively,时钟的暂停也包含在这里;
// [1 CMS-remark: 2929273K(4109120K)] 2980795K(4185792K), 0.1049206 secs]:这个阶段之后,老年代的使用量与总量,堆的使用量与总量,时间0.1049206 secs
// [Times: user=0.10 sys=0.00, real=0.11 secs]:同上
2021-08-10T14:51:48.609+0800: [GC (CMS Final Remark) [YG occupancy: 51522 K (76672 K)]2021-08-10T14:51:48.609+0800: [Rescan (parallel) , 0.0422101 secs]2021-08-10T14:51:48.651+0800: [weak refs processing, 0.0000795 secs]2021-08-10T14:51:48.651+0800: [class unloading, 0.0492355 secs]2021-08-10T14:51:48.700+0800: [scrub symbol table, 0.0123413 secs]2021-08-10T14:51:48.713+0800: [scrub string table, 0.0009098 secs][1 CMS-remark: 2929273K(4109120K)] 2980795K(4185792K), 0.1049206 secs] [Times: user=0.10 sys=0.00, real=0.11 secs] 

// 阶段6:Concurrent Sweep
// concurrent-sweep:并发清除阶段,这里不需要 STW,它是与用户的应用程序并发运行,这个阶段是:清除那些不再使用的对象,回收它们的占用空间为将来使用。
2021-08-10T14:51:48.714+0800: [CMS-concurrent-sweep-start]
2021-08-10T14:51:58.260+0800: [CMS-concurrent-sweep: 3.436/9.546 secs] [Times: user=31.79 sys=1.32, real=9.55 secs] 

// 阶段7:Concurrent Reset.
// concurrent-reset:并发充值阶段,这个阶段也是并发执行的,它会重设 CMS 内部的数据结构,为下次的 GC 做准备,对应的日志信息如下:
2021-08-10T14:51:58.260+0800: [CMS-concurrent-reset-start]
2021-08-10T14:51:58.285+0800: [CMS-concurrent-reset: 0.026/0.026 secs] [Times: user=0.10 sys=0.01, real=0.02 secs]

这里我们发现日志在2021-08-10T14:51:48.714+0800的并发清除CMS-concurrent-sweep阶段,花费掉的现实时间是9.55秒,并且在它附近的CMS并发清除花费的时间基本都很久,这个时间也和GC报警时间03:00吻合。

所以,我们调优的基调应该是以降低并发清除阶段的时间、或者减少在高峰时段的CMS次数为主。

分析RS进程

1. 找到pid

代码语言:javascript
复制
jps -lvm | grep regionserver

2. 查看VM参数

代码语言:javascript
复制
jinfo -flags pid
jinfo -flag vmargs pid

既然是并发清除的阶段时间较长,我们是否可以让GC在这个阶段充分的并发?以此缩短这个阶段的耗时。沿着这个思路,我先查看了RS的一下两个并发控制参数,

代码语言:javascript
复制
-XX:ParallelGCThreads
-XX:ConcGCThreads

结果:

3. 打印堆栈的空间布局

代码语言:javascript
复制
jmap -heap pid

这里发现MaxNewSize非常小,我们知道NewSize小可能会引起频繁的old GC,而在日志中这一点在批量写HBase时体现的也很明显,在03:00的时候基本10秒钟一次CMS!

综合以上,本次CMS的调优基本以增大新生代+增加并发GC线程数为主,其他的CMS调优参数主要有:

代码语言:javascript
复制
-XX:+CMSParallelRemarkEnabled        
-XX:+CMSParallelInitialMarkEnabled   
-XX:ConcGCThreads                   
-XX:ParallelCMSThreads               
-XX:ParallelGCThreads                
-XX:MaxTenuringThreshold
-XX:+UseCMSCompactAtFullCollection   
-XX:CMSFullGCsBeforeCompaction     
-XX:+UseCMSInitiatingOccupancyOnly   
-XX:CMSInitiatingOccupancyFraction
-XX:+DisableExplicitGC
-XX:+CMSScavengeBeforeRemark   

但是在测试集群上测试后发现,RS的调优效果不一致。已经在确保主要前置条件一致的情况下,测试依旧不一致,我考虑到的前置条件主要有:

  • HBase表的Region做预分区,在各个RS上分布均衡
  • 测试数据的RowKey做了reverse以免热点,并且观察测试数据的RowKey在各个Region上的分布散列程度是基本一致的
  • RS所在节点的物理资源基本一致

随后,观察效果不太理想的节点发现负载明显比其他节点的负载高很多,

随后,我降低了Flink SQL的并行度(Hive表作Source默认的是以block块数作为并行度,要设置参数进行控制)。负载均衡效果明显,但是这里我还是没找到在环境基本一致的情况下,第三个节点比其他节点的负载高很多的原因?有思路的同学请一定告诉我,感谢^^

经过在测试集群测试:增大新生代、提高新生代和CMS老年代的GC并发线程数两种措施,将CMS的并发清除时间由6-9秒降低为1-2秒。

缘分调优法

当还想进一步突破,在我有限的认知范围内,已经无法相处好的办法。所以,只能用缘分试错法,借鉴警尽可能比较权威的测试验证结论进行试错。

总结了两个比较重要的思路:

  • 调整Survivor区大小(SurvivorRatio)
  • 调整MaxTenuringThreshold

一次Minor GC会将存活对象从Eden区(以及survivor from区)复制到Survivor区(to区),因此增大Survivor区可以容纳更多的存活对象。这样就会防止因为Survivor区太小导致很对存活对象还没有达到MaxTenuringThreshold阈值就直接进入老生代,潜在增大old gc的触发频率;但是Survivor区设置太大也会有一定的问题,Survivor设置较大会使得对象可以在Young区’待’的时间很长,但是对于一些长寿对象较多的场景下(比如HBase),大量长寿对象长时间待在Young区做很多’无谓’的复制,一定程度上增加Minor GC开销。

增加MaxTenuringThreshold相当于提高了进入老年代的门槛,可以有效限制进入老年代的对象数。和Survivor设置相似,调整MaxTenuringThreshold也需要做一个取舍,设置太小会增加CMS GC的触发频率以及耗时,而设置太大则会在长寿对象较多场景下增加Minor GC开销。一般情况下,默认MaxTenuringThreshold=15已经相对比较大,不需要做任何调整。

这里直接给出链接,里面有作者对这两种思路的测试结论,比较完整:http://hbasefly.com/2016/08/09/hbase-cms-gc/

令我欣喜若狂的是,在调整了SurvivorRatio这个参数后,即使在测试集群超高负荷的情况下,GC日志中的CMS几乎很少触发了!大部分都是minor gc,而由于我设置了较高的gc 并行度,minor gc的Parnew就像flink一样,执行效率提升效果很明显,基本在10-150ms左右!

那么,这是为什么呢?我猜测是批量写HBase的对象的新生代年龄恰好在升级为old gen之前,在新生代就被回收掉了,而不是原来那样由于survivor区很小,许多对象还没来得及升级为老年对象就被迫放到了old gen,导致频繁的CMS GC!

在不改变堆内存大小的资源配置前提下,我的调优方案如下:

代码语言:javascript
复制
{{JAVA_GC_ARGS}} -XX:ReservedCodeCacheSize=256m
 -Xms4g 
 -Xmx4g 
 -XX:NewSize=1g 
 -XX:MaxNewSize=1g
 -XX:SurvivorRatio=2
 -XX:+PrintCommandLineFlags 
 -XX:+PrintFlagsFinal 
 -XX:+UseParNewGC 
 -XX:+UseConcMarkSweepGC 
 -XX:+CMSParallelRemarkEnabled  
 -XX:ConcGCThreads=5 
 -XX:ParallelGCThreads=10 
 -XX:MaxTenuringThreshold=15 
 -XX:-UseCMSInitiatingOccupancyOnly 
 -XX:CMSInitiatingOccupancyFraction=70 
 -XX:+UseCMSCompactAtFullCollection 
 -XX:CMSFullGCsBeforeCompaction=0

当然,具体的参数设置要根据自己的硬件情况进行调整。

总结

此次GC调优,我的思路主要有两个:

  1. 降低CMS的并发收集阶段的耗时
  2. 减少CMS GC的触发频率

针对1,我采取了,

  1. 提高CMS并发收集的线程数

针对2,我采取了,

  1. 增大新生代大小
  2. 增大survivor大小
  3. 增大MaxTenuringThreshold

其他,

  1. 增加丰富的gc日志便于调优
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-08-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 章鱼沉思录 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 【Flink】第四篇:【迷思】对update语义拆解D-、I+后造成update原子性丢失
  • 【Flink】第五篇:checkpoint【1】
  • 【Flink】第五篇:checkpoint【2】
  • 【Flink】第八篇:Flink 内存管理
  • 【Flink】第九篇:Flink SQL 性能优化实战
  • 【Flink】第十篇:join 之 regular join
  • 【Flink】第十三篇:JVM思维导图
  • 【Flink】第十四篇:LSM-Tree一般性总结
  • 【Flink】第十五篇:Redis Connector 数据保序思考
  • 【Flink】第十六篇:源码角度分析 sink 端的数据一致性
  • 【Flink】第十七篇:记一次牛轰轰的OOM故障排查
  • 【Flink】第十八篇:Direct Memory 一箩筐
  • 【Flink】第十九篇:从一个批量写HBase性能问题到一个Flink issue的距离
  • 4. 并发清理
相关产品与服务
TDSQL MySQL 版
TDSQL MySQL 版(TDSQL for MySQL)是腾讯打造的一款分布式数据库产品,具备强一致高可用、全球部署架构、分布式水平扩展、高性能、企业级安全等特性,同时提供智能 DBA、自动化运营、监控告警等配套设施,为客户提供完整的分布式数据库解决方案。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档