首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >Elasticsearch 7.4错误地报告快照已在运行

Elasticsearch 7.4错误地报告快照已在运行
EN

Stack Overflow用户
提问于 2020-11-27 18:15:37
回答 2查看 501关注 0票数 2

在解决了Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then之后,我的集群中仍然有一些问题。每当我运行快照命令时,它给我一个503,当我再次运行它一两次时,它突然启动并创建了一个快照。opster.com online工具提示有关未配置快照的内容,但是当我运行它建议的验证命令时,一切看起来都很好。

代码语言:javascript
运行
复制
$ curl -s -X POST 'http://127.0.0.1:9201/_snapshot/elastic_backup/_verify?pretty'
{
  "nodes" : {
    "JZHgYyCKRyiMESiaGlkITA" : {
      "name" : "elastic7-1"
    },
    "jllZ8mmTRQmsh8Sxm8eDYg" : {
      "name" : "elastic7-4"
    },
    "TJJ_eHLIRk6qKq_qRWmd3w" : {
      "name" : "elastic7-3"
    },
    "cI-cn4V3RP65qvE3ZR8MXQ" : {
      "name" : "elastic7-2"
    }
  }
}

但随后:

代码语言:javascript
运行
复制
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "concurrent_snapshot_execution_exception",
        "reason" : "[elastic_backup:snapshot-2020.11.27]  a snapshot is already running"
      }
    ],
    "type" : "concurrent_snapshot_execution_exception",
    "reason" : "[elastic_backup:snapshot-2020.11.27]  a snapshot is already running"
  },
  "status" : 503
}

会不会是4个节点中的一个认为快照已经在运行,并且此任务被随机分配到其中一个节点,因此当它运行几次时,它最终将生成快照?如果是这样,我如何才能确定哪个节点表示快照已经在运行?

此外,我注意到其中一个节点的堆要高得多,正常的堆使用率是多少?

代码语言:javascript
运行
复制
$ curl -s http://127.0.0.1:9201/_cat/nodes?v
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.1.215           59          99   7    0.38    0.38     0.36 dilm      -      elastic7-1
10.0.1.218           32          99   1    0.02    0.17     0.22 dilm      *      elastic7-4
10.0.1.212           11          99   1    0.04    0.17     0.21 dilm      -      elastic7-3
10.0.1.209           36          99   3    0.42    0.40     0.36 dilm      -      elastic7-2

昨晚它再次发生,而我确信没有任何东西已经快照,所以现在我运行以下命令来确认奇怪的响应,至少我不希望在这一点上得到这个错误。

代码语言:javascript
运行
复制
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup/_current?pretty
{
  "snapshots" : [ ]
}
代码语言:javascript
运行
复制
$ curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "concurrent_snapshot_execution_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.03]  a snapshot is already running"
      }
    ],
    "type" : "concurrent_snapshot_execution_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.03]  a snapshot is already running"
  },
  "status" : 503
}

当我第二次(有时是第三次)运行它时,它会突然创建一个快照。

请注意,如果我没有运行它,那么第二次或第三次快照将永远不会出现,所以我百分之百地确定在这个错误发生的时刻没有快照正在运行。

据我所知,尚未配置SLM:

代码语言:javascript
运行
复制
{ }

存储库已正确配置AFAICT:

代码语言:javascript
运行
复制
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup?pretty
{
  "elastic_backup" : {
    "type" : "fs",
    "settings" : {
      "compress" : "true",
      "location" : "elastic_backup"
    }
  }
}

此外,在配置中,它还映射到Amazon EFS的NFS挂载文件夹。它是可用的和可访问的,并且在成功的快照上显示新数据。

作为我添加到查询_cat/tasks?v中的cronjob的一部分,希望今晚我们能看到更多。因为刚才当我手动运行该命令时,它运行起来没有问题:

代码语言:javascript
运行
复制
$ curl localhost:9201/_cat/tasks?v ; curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty' ; curl localhost:9201/_cat/tasks?v     
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node                                                        
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:15885091 -                               transport 1607068277045 07:51:17  209.6micros  10.0.1.215 elastic7-1                                                  
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24278976 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277044 07:51:17  62.7micros   10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15885092 JZHgYyCKRyiMESiaGlkITA:15885091 direct    1607068277045 07:51:17  57.4micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23773565 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277045 07:51:17  84.7micros   10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3418325  JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277046 07:51:17  56.9micros   10.0.1.209 elastic7-2                                               
{                                                                                                                                                                  
  "snapshot" : {                                                                                                                                                   
    "snapshot" : "snapshot-2020.12.04",                                                                                                                            
    "uuid" : "u2yQB40sTCa8t9BqXfj_Hg",                                                                                                                                                                          
    "version_id" : 7040099,                                                                                                                                        
    "version" : "7.4.0",                                                                                                                                           
    "indices" : [                                                                                                                                                  
        "log-db-1-2020.06.18-000003",
        "log-db-2-2020.02.19-000002",
        "log-db-1-2019.10.25-000001",
        "log-db-3-2020.11.23-000002",
        "log-db-3-2019.10.25-000001",
        "log-db-2-2019.10.25-000001",
        "log-db-1-2019.10.27-000002"                                                                                                                              
    ],                                                                                                                                                             
    "include_global_state" : true,                                                                                                                                                                              
    "state" : "SUCCESS",                                                                                                                                           
    "start_time" : "2020-12-04T07:51:17.085Z",                                                                                                                                                                  
    "start_time_in_millis" : 1607068277085,                                                                                                                        
    "end_time" : "2020-12-04T07:51:48.537Z",                                                                                                                        
    "end_time_in_millis" : 1607068308537,                                                                                                                                 
    "duration_in_millis" : 31452,                                                                                                                                         
    "failures" : [ ],                                                                                                                                                     
    "shards" : {                                                                                                                                                          
      "total" : 28,                                                                                                                                                       
      "failed" : 0,                                                                                                                                                       
      "successful" : 28                                                                                                                                                   
    }                                                                                                                                                                     
  }                                                                                                                                                                       
}                                                                                                                                                                                                               
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node                                                     
indices:data/read/search       JZHgYyCKRyiMESiaGlkITA:15888939 -                               transport 1607068308987 07:51:48  2.7ms        10.0.1.215 elastic7-1
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:15888942 -                               transport 1607068308990 07:51:48  223.2micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24282763 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308989 07:51:48  61.5micros   10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15888944 JZHgYyCKRyiMESiaGlkITA:15888942 direct    1607068308990 07:51:48  78.2micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23777841 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308990 07:51:48  63.3micros   10.0.1.218 elastic7-4                                             
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3422139  JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308991 07:51:48  60micros     10.0.1.209 elastic7-2

昨晚(2020-12-12)在cron期间,我让它运行以下命令:

代码语言:javascript
运行
复制
curl localhost:9201/_cat/tasks?v
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
curl localhost:9201/_cat/tasks?v
sleep 1 
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'

它的输出如下:

代码语言:javascript
运行
复制
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:78016838 -                               transport 1607736001255 01:20:01  314.4micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228580 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001254 01:20:01  66micros     10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806094 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01  74micros     10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016839 JZHgYyCKRyiMESiaGlkITA:78016838 direct    1607736001255 01:20:01  94.3micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582174 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01  73.6micros   10.0.1.209 elastic7-2
node_name  name     active queue rejected
elastic7-2 snapshot      0     0        0
elastic7-4 snapshot      0     0        0
elastic7-1 snapshot      0     0        0
elastic7-3 snapshot      0     0        0
{            
  "error" : {       
    "root_cause" : [
      {                                            
        "type" : "concurrent_snapshot_execution_exception",                                                                                      
        "reason" : "[elastic_backup:snapshot-2020.12.12]  a snapshot is already running"
      }
    ],                                         
    "type" : "concurrent_snapshot_execution_exception",                                                                                      
    "reason" : "[elastic_backup:snapshot-2020.12.12]  a snapshot is already running"
  },            
  "status" : 503
}
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node
cluster:monitor/nodes/stats    JZHgYyCKRyiMESiaGlkITA:78016874 -                               transport 1607736001632 01:20:01  39.6ms       10.0.1.215 elastic7-1
cluster:monitor/nodes/stats[n] TJJ_eHLIRk6qKq_qRWmd3w:82228603 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001631 01:20:01  39.2ms       10.0.1.212 elastic7-3
cluster:monitor/nodes/stats[n] jllZ8mmTRQmsh8Sxm8eDYg:55806114 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01  39.5ms       10.0.1.218 elastic7-4
cluster:monitor/nodes/stats[n] cI-cn4V3RP65qvE3ZR8MXQ:63582204 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01  39.4ms       10.0.1.209 elastic7-2
cluster:monitor/nodes/stats[n] JZHgYyCKRyiMESiaGlkITA:78016875 JZHgYyCKRyiMESiaGlkITA:78016874 direct    1607736001632 01:20:01  39.5ms       10.0.1.215 elastic7-1
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:78016880 -                               transport 1607736001671 01:20:01  348.9micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016881 JZHgYyCKRyiMESiaGlkITA:78016880 direct    1607736001671 01:20:01  188.6micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228608 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001671 01:20:01  106.2micros  10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582209 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01  96.3micros   10.0.1.209 elastic7-2
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806120 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01  97.8micros   10.0.1.218 elastic7-4
node_name  name     active queue rejected
elastic7-2 snapshot      0     0        0
elastic7-4 snapshot      0     0        0
elastic7-1 snapshot      0     0        0
elastic7-3 snapshot      0     0        0
{
  "snapshot" : {
    "snapshot" : "snapshot-2020.12.12",
    "uuid" : "DgwuBxC7SWirjyVlFxBnng",
    "version_id" : 7040099,
    "version" : "7.4.0",
    "indices" : [
      "log-db-sbr-2020.06.18-000003",
      "log-db-other-2020.02.19-000002",
      "log-db-sbr-2019.10.25-000001",
      "log-db-trace-2020.11.23-000002",
      "log-db-trace-2019.10.25-000001",
      "log-db-sbr-2019.10.27-000002",
      "log-db-other-2019.10.25-000001"
    ],
    "include_global_state" : true,
    "state" : "SUCCESS",
    "start_time" : "2020-12-12T01:20:02.544Z",
    "start_time_in_millis" : 1607736002544,
    "end_time" : "2020-12-12T01:20:27.776Z",
    "end_time_in_millis" : 1607736027776,
    "duration_in_millis" : 25232,
    "failures" : [ ],
    "shards" : {
      "total" : 28,
      "failed" : 0,
      "successful" : 28
    }
  }
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_snapshot_name_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
      }
    ],
    "type" : "invalid_snapshot_name_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
  },
  "status" : 400
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_snapshot_name_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
      }
    ],
    "type" : "invalid_snapshot_name_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
  },
  "status" : 400
}

此外,集群目前是绿色的,管理队列没有填满,一切似乎都很好。

另外,只有一个存储库:

代码语言:javascript
运行
复制
curl http://127.0.0.1:9201/_cat/repositories?v
id             type
elastic_backup   fs
EN

回答 2

Stack Overflow用户

发布于 2020-12-14 21:38:04

所以问题开始于最近升级到Docker 19.03.6,从1x Docker Swarm manager + 4x Docker Swarm worker升级到5x Docker Swarm manager + 4x Docker Swarm worker。在这两种情况下,Elastic都在工人身上运行。由于这种升级/更改,我们看到容器内部的网络接口数量发生了变化。正因为如此,我们不得不在Elastic中使用'publish_host‘来让它再次工作。

为了解决这个问题,我们必须取消在入口网络上发布Elastic端口,这样额外的网络接口就消失了。接下来,我们可以删除“publish_host”设置。这让事情变得更好了。但为了真正解决我们的问题,我们必须将Docker Swarm部署endpoint_mode更改为dnsrr,这样事情就不会通过Docker Swarm路由网格。

我们一直都有‘连接被同级重置’的问题,但是自从改变后,这个问题变得更糟,并使Elasticsearch出现了奇怪的问题。我猜在Docker Swarm (或任何其他Kubernetes或其他东西)中运行Elasticsearch可能是一件很难调试的事情。

在容器中使用tcpdump,在主机上使用conntrack,我们能够看到完全正常的连接被无缘无故地重置。另一种解决方案是让内核丢弃不匹配的数据包(而不是发送重置),但在这种情况下尽可能阻止DNAT/SNAT的使用似乎也解决了问题。

票数 1
EN

Stack Overflow用户

发布于 2020-11-30 16:42:47

7.4版一次只支持Elasticsearch的一个快照操作。Elasticsearch

从错误中看,当你触发一个新的快照时,之前触发的快照似乎已经在运行了,而Elasticsearch抛出了concurrent_snapshot_execution_exception

您可以使用GET /_snapshot/elastic_backup/_current检查list of currently running snapshot

我建议您首先使用上面的api检查elasticsearch集群是否正在运行快照操作。如果当前没有正在运行的快照操作,则只有您应该触发新的快照。

附注:从Elasticsearch版本7.7开始,elasticsearch也支持并发快照。因此,如果您计划在集群中执行并发快照操作,则应升级ES 7.7或更高版本。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65035556

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档