ES节点丢失导致实时数据导入速度特别慢

发布于 2018-05-23 17:17:32

4K0

发布于 2018-05-23 17:17:32

文章被收录于专栏：YG小书屋

一个节点死机了，无法自动重启。通过logtash导数据，由于当天入的数据是0备份，节点丢失后，某些shard丢失，导致集群一直处于red状态。节点丢失后，该索引的导入速度直线下降。经测试发现是logtash的原因，logtash的input阶段是一个线程，filter和output用一个线程。中间通过一个同步队列缓存数据。如果在output的过程中出现问题，那么失败的数据会无限制地放回同步队列，然后队列中的数据被再次分配shard导入，分配到丢失shard的数据会再次失败，再次放入同步队列。因此数据一直在同步队列和es的bulk中循环，导致整个索引的导入速度变慢。

用测试机测试出的结果如下： 1、正常导数据：

xxx-20170925              1     p      STARTED   24713  24.7mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              5     p      STARTED   24256  33.7mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              2     p      STARTED   24702  24.2mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              3     p      STARTED   24626  24.2mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              7     p      STARTED   24916  34.2mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              4     p      STARTED   23970  38.2mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              6     p      STARTED   24786    24mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              0     p      STARTED   24824  34.4mb xxx.6.105  node-xxx.6.105-performance_test

2 关闭一个节点

xxx-20170925              6     p      STARTED     128179 110.8mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              1     p      UNASSIGNED                                
xxx-20170925              4     p      STARTED     128263 108.1mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              7     p      STARTED     128593 109.3mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              2     p      STARTED     128613 112.8mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              5     p      UNASSIGNED                                
xxx-20170925              3     p      STARTED     127969 115.6mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              0     p      STARTED     128322 110.3mb xxx.6.105  node-xxx.6.105-performance_test

3 经过一段时间后查看shard，发现其他shard增长的速度特别慢

xxx-20170925              6     p      STARTED     128436 111.1mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              5     p      UNASSIGNED                                
xxx-20170925              3     p      STARTED     128231 110.9mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              7     p      STARTED     128814 109.6mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              1     p      UNASSIGNED                                
xxx-20170925              2     p      STARTED     128871 182.6mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              4     p      STARTED     128502 108.5mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              0     p      STARTED     128568 109.1mb xxx.6.105  node-xxx.6.105-performance_test

logtash的日志如下：

[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [15] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,784][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action

4 数据恢复后

xxx-20170925              4     p      STARTED     154764 125.3mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              5     p      STARTED     157936 126.4mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              2     p      STARTED     154945 138.9mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              7     p      STARTED     155224 156.8mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              1     p      STARTED     158080 124.8mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              3     p      STARTED     154243 153.8mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              6     p      STARTED     154909 146.9mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              0     p      STARTED     154681   127mb xxx.6.105  node-xxx.6.105-performance_test

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2017.11.22 ，如有侵权请联系 cloudcommunity@tencent.com 删除

es 2

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

es 2

登录后参与评论

0 条评论

热度

ES节点丢失导致实时数据导入速度特别慢

ES节点丢失导致实时数据导入速度特别慢

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐