在解决了Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then之后,我的集群中仍然有一些问题。每当我运行快照命令时,它给我一个503,当我再次运行它一两次时,它突然启动并创建了一个快照。opster.com online工具提示有关未配置快照的内容,但是当我运行它建议的验证命令时,一切看起来都很好。
$ curl -s -X POST 'http://127.0.0.1:9201/_snapshot/elastic_backup/_verify?pretty'
{
"nodes" : {
"JZHgYyCKRyiMESiaGlkITA" : {
"name" : "elastic7-1"
},
"jllZ8mmTRQmsh8Sxm8eDYg" : {
"name" : "elastic7-4"
},
"TJJ_eHLIRk6qKq_qRWmd3w" : {
"name" : "elastic7-3"
},
"cI-cn4V3RP65qvE3ZR8MXQ" : {
"name" : "elastic7-2"
}
}
}
但随后:
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.11.27] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.11.27] a snapshot is already running"
},
"status" : 503
}
会不会是4个节点中的一个认为快照已经在运行,并且此任务被随机分配到其中一个节点,因此当它运行几次时,它最终将生成快照?如果是这样,我如何才能确定哪个节点表示快照已经在运行?
此外,我注意到其中一个节点的堆要高得多,正常的堆使用率是多少?
$ curl -s http://127.0.0.1:9201/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.1.215 59 99 7 0.38 0.38 0.36 dilm - elastic7-1
10.0.1.218 32 99 1 0.02 0.17 0.22 dilm * elastic7-4
10.0.1.212 11 99 1 0.04 0.17 0.21 dilm - elastic7-3
10.0.1.209 36 99 3 0.42 0.40 0.36 dilm - elastic7-2
昨晚它再次发生,而我确信没有任何东西已经快照,所以现在我运行以下命令来确认奇怪的响应,至少我不希望在这一点上得到这个错误。
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup/_current?pretty
{
"snapshots" : [ ]
}
$ curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.03] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.03] a snapshot is already running"
},
"status" : 503
}
当我第二次(有时是第三次)运行它时,它会突然创建一个快照。
请注意,如果我没有运行它,那么第二次或第三次快照将永远不会出现,所以我百分之百地确定在这个错误发生的时刻没有快照正在运行。
据我所知,尚未配置SLM:
{ }
存储库已正确配置AFAICT:
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup?pretty
{
"elastic_backup" : {
"type" : "fs",
"settings" : {
"compress" : "true",
"location" : "elastic_backup"
}
}
}
此外,在配置中,它还映射到Amazon EFS的NFS挂载文件夹。它是可用的和可访问的,并且在成功的快照上显示新数据。
作为我添加到查询_cat/tasks?v
中的cronjob的一部分,希望今晚我们能看到更多。因为刚才当我手动运行该命令时,它运行起来没有问题:
$ curl localhost:9201/_cat/tasks?v ; curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty' ; curl localhost:9201/_cat/tasks?v
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:15885091 - transport 1607068277045 07:51:17 209.6micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24278976 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277044 07:51:17 62.7micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15885092 JZHgYyCKRyiMESiaGlkITA:15885091 direct 1607068277045 07:51:17 57.4micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23773565 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277045 07:51:17 84.7micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3418325 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277046 07:51:17 56.9micros 10.0.1.209 elastic7-2
{
"snapshot" : {
"snapshot" : "snapshot-2020.12.04",
"uuid" : "u2yQB40sTCa8t9BqXfj_Hg",
"version_id" : 7040099,
"version" : "7.4.0",
"indices" : [
"log-db-1-2020.06.18-000003",
"log-db-2-2020.02.19-000002",
"log-db-1-2019.10.25-000001",
"log-db-3-2020.11.23-000002",
"log-db-3-2019.10.25-000001",
"log-db-2-2019.10.25-000001",
"log-db-1-2019.10.27-000002"
],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-12-04T07:51:17.085Z",
"start_time_in_millis" : 1607068277085,
"end_time" : "2020-12-04T07:51:48.537Z",
"end_time_in_millis" : 1607068308537,
"duration_in_millis" : 31452,
"failures" : [ ],
"shards" : {
"total" : 28,
"failed" : 0,
"successful" : 28
}
}
}
action task_id parent_task_id type start_time timestamp running_time ip node
indices:data/read/search JZHgYyCKRyiMESiaGlkITA:15888939 - transport 1607068308987 07:51:48 2.7ms 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:15888942 - transport 1607068308990 07:51:48 223.2micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24282763 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308989 07:51:48 61.5micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15888944 JZHgYyCKRyiMESiaGlkITA:15888942 direct 1607068308990 07:51:48 78.2micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23777841 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308990 07:51:48 63.3micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3422139 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308991 07:51:48 60micros 10.0.1.209 elastic7-2
昨晚(2020-12-12)在cron期间,我让它运行以下命令:
curl localhost:9201/_cat/tasks?v
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
curl localhost:9201/_cat/tasks?v
sleep 1
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
它的输出如下:
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:78016838 - transport 1607736001255 01:20:01 314.4micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228580 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001254 01:20:01 66micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806094 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01 74micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016839 JZHgYyCKRyiMESiaGlkITA:78016838 direct 1607736001255 01:20:01 94.3micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582174 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01 73.6micros 10.0.1.209 elastic7-2
node_name name active queue rejected
elastic7-2 snapshot 0 0 0
elastic7-4 snapshot 0 0 0
elastic7-1 snapshot 0 0 0
elastic7-3 snapshot 0 0 0
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] a snapshot is already running"
},
"status" : 503
}
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/nodes/stats JZHgYyCKRyiMESiaGlkITA:78016874 - transport 1607736001632 01:20:01 39.6ms 10.0.1.215 elastic7-1
cluster:monitor/nodes/stats[n] TJJ_eHLIRk6qKq_qRWmd3w:82228603 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001631 01:20:01 39.2ms 10.0.1.212 elastic7-3
cluster:monitor/nodes/stats[n] jllZ8mmTRQmsh8Sxm8eDYg:55806114 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01 39.5ms 10.0.1.218 elastic7-4
cluster:monitor/nodes/stats[n] cI-cn4V3RP65qvE3ZR8MXQ:63582204 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01 39.4ms 10.0.1.209 elastic7-2
cluster:monitor/nodes/stats[n] JZHgYyCKRyiMESiaGlkITA:78016875 JZHgYyCKRyiMESiaGlkITA:78016874 direct 1607736001632 01:20:01 39.5ms 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:78016880 - transport 1607736001671 01:20:01 348.9micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016881 JZHgYyCKRyiMESiaGlkITA:78016880 direct 1607736001671 01:20:01 188.6micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228608 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001671 01:20:01 106.2micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582209 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01 96.3micros 10.0.1.209 elastic7-2
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806120 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01 97.8micros 10.0.1.218 elastic7-4
node_name name active queue rejected
elastic7-2 snapshot 0 0 0
elastic7-4 snapshot 0 0 0
elastic7-1 snapshot 0 0 0
elastic7-3 snapshot 0 0 0
{
"snapshot" : {
"snapshot" : "snapshot-2020.12.12",
"uuid" : "DgwuBxC7SWirjyVlFxBnng",
"version_id" : 7040099,
"version" : "7.4.0",
"indices" : [
"log-db-sbr-2020.06.18-000003",
"log-db-other-2020.02.19-000002",
"log-db-sbr-2019.10.25-000001",
"log-db-trace-2020.11.23-000002",
"log-db-trace-2019.10.25-000001",
"log-db-sbr-2019.10.27-000002",
"log-db-other-2019.10.25-000001"
],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-12-12T01:20:02.544Z",
"start_time_in_millis" : 1607736002544,
"end_time" : "2020-12-12T01:20:27.776Z",
"end_time_in_millis" : 1607736027776,
"duration_in_millis" : 25232,
"failures" : [ ],
"shards" : {
"total" : 28,
"failed" : 0,
"successful" : 28
}
}
}
{
"error" : {
"root_cause" : [
{
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
}
],
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
},
"status" : 400
}
{
"error" : {
"root_cause" : [
{
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
}
],
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
},
"status" : 400
}
此外,集群目前是绿色的,管理队列没有填满,一切似乎都很好。
另外,只有一个存储库:
curl http://127.0.0.1:9201/_cat/repositories?v
id type
elastic_backup fs
发布于 2020-12-14 21:38:04
所以问题开始于最近升级到Docker 19.03.6,从1x Docker Swarm manager + 4x Docker Swarm worker升级到5x Docker Swarm manager + 4x Docker Swarm worker。在这两种情况下,Elastic都在工人身上运行。由于这种升级/更改,我们看到容器内部的网络接口数量发生了变化。正因为如此,我们不得不在Elastic中使用'publish_host‘来让它再次工作。
为了解决这个问题,我们必须取消在入口网络上发布Elastic端口,这样额外的网络接口就消失了。接下来,我们可以删除“publish_host”设置。这让事情变得更好了。但为了真正解决我们的问题,我们必须将Docker Swarm部署endpoint_mode更改为dnsrr,这样事情就不会通过Docker Swarm路由网格。
我们一直都有‘连接被同级重置’的问题,但是自从改变后,这个问题变得更糟,并使Elasticsearch出现了奇怪的问题。我猜在Docker Swarm (或任何其他Kubernetes或其他东西)中运行Elasticsearch可能是一件很难调试的事情。
在容器中使用tcpdump,在主机上使用conntrack,我们能够看到完全正常的连接被无缘无故地重置。另一种解决方案是让内核丢弃不匹配的数据包(而不是发送重置),但在这种情况下尽可能阻止DNAT/SNAT的使用似乎也解决了问题。
发布于 2020-11-30 16:42:47
7.4版一次只支持Elasticsearch的一个快照操作。Elasticsearch
从错误中看,当你触发一个新的快照时,之前触发的快照似乎已经在运行了,而Elasticsearch抛出了concurrent_snapshot_execution_exception
。
您可以使用GET /_snapshot/elastic_backup/_current
检查list of currently running snapshot。
我建议您首先使用上面的api检查elasticsearch集群是否正在运行快照操作。如果当前没有正在运行的快照操作,则只有您应该触发新的快照。
附注:从Elasticsearch版本7.7开始,elasticsearch也支持并发快照。因此,如果您计划在集群中执行并发快照操作,则应升级ES 7.7或更高版本。
https://stackoverflow.com/questions/65035556
复制相似问题