我想要在一个节点死掉的时候实现虚拟机的自动迁移。我创建了proxmox claster,设置了复制,并安装了watchdog ipmi。但丢失1个节点后,什么也不会发生。我使用了https://pve.proxmox.com/pve-docs/chapter-ha-manager.html和https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs
ha-manager config
ct:100
group HA
max_restart 0
state started
ha-manager status
quorum OK
master node1 (active, Mon May 18 09:18:59 2020)
lrm node1 (idle, Mon May 18 09:19:00 2020)
lrm node2 (active, Mon May 18 09:19:02 2020)
service ct:100 (node2, started)
当我关闭node2时,我记录了以下日志:
May 18 08:12:37 node1 pve-ha-crm[2222]: lost lock 'ha_manager_lock - cfs lock update failed - Operation not permitted
May 18 08:12:38 node1 pmxcfs[2008]: [dcdb] notice: start cluster connection
May 18 08:12:38 node1 pmxcfs[2008]: [dcdb] crit: cpg_join failed: 14
May 18 08:12:38 node1 pmxcfs[2008]: [dcdb] crit: can't initialize service
May 18 08:12:42 node1 pve-ha-crm[2222]: status change master => lost_manager_lock
May 18 08:12:42 node1 pve-ha-crm[2222]: watchdog closed (disabled)
May 18 08:12:42 node1 pve-ha-crm[2222]: status change lost_manager_lock => wait_for_quorum
May 18 08:12:44 node1 pmxcfs[2008]: [dcdb] notice: members: 1/2008
May 18 08:12:44 node1 pmxcfs[2008]: [dcdb] notice: all data is up to date
May 18 08:13:00 node1 systemd[1]: Starting Proxmox VE replication runner...
May 18 08:13:01 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:02 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:03 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:04 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:05 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:05 node1 pveproxy[39495]: proxy detected vanished client connection
May 18 08:13:06 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:07 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:08 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:09 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:10 node1 pvesr[40781]: error with cfs lock 'file-replication_cfg': no quorum!
发布于 2020-10-13 23:08:03
问题出在quorum上。并且不是微不足道的,并且不能直观地工作。如果设置Proxmox簇,则会打开仲裁机制。要在集群上执行任何操作,它需要来自了解正在发生的事情的每个节点的投票。它需要50%的现有节点+1才能接受投票。
在创建2节点集群时,有一个idi默认设置:它需要50%+1=2节点才能执行任何操作。因此,尽管是“集群”,如果一个节点死了,你甚至不能启动vm /container,直到两个节点都正常工作。
有一个解决方法:在corosync.conf (/etc/corosync/coratrc.conf)中,您必须启用两个参数: two_node: 1 wait_for_all: 0
第一个参数定义了在两个节点集群的情况下,需要1票才能执行操作。但对于年轻玩家来说,还有另一个陷阱:这个设置会自动启用另一个设置: wait_for_all,它会禁用poweron上的集群操作,直到所有节点都出现。因此,这实际上又一次破坏了集群。所以你也必须克服这一点。
请仔细阅读此手册页:https://www.systutorials.com/docs/linux/man/5-votequorum/
但还有另一个问题。有两个版本的corosync.conf: /etc/corosync/coratrc.conf和/etc/pve/coratrc.conf
当第二个被改变时,第一个会被覆盖,所以你必须编辑后一个。但是当您的第二个节点关闭时,您必须先禁用quorum一段时间,然后编辑文件。
https://stackoverflow.com/questions/61863694
复制相似问题