前面介绍了 HeartBeat 及其集群高可用部署、DRBD原理与实践、DRBD配置与管理、高可用集群架构 corosync+pacemaker、LVS+KeepAlived、haproxy+keepalived+nginx 实现 k8s 集群负载均衡等相关的知识点,今天我将详细的为大家介绍Pacemaker+Corosync 高可用架构实战相关知识,希望大家能够从中收获多多!如有帮助,请点在看、转发朋友圈支持一波!!!
关闭firewalld及selinux
rhel中,系统镜像自带的资源有额外的高可用套件及存储套件,他们存放在镜像里的addons:
所以,需要配置yum仓库(server1、server2):
接着进行安装(server1、server2):
yum install pacemaker corosync -y
安装集群管理工具pcs及依赖性psmisc policycoreutils-python
集群管理工具pcs命令需要连接pcsd服务,所以,打开pcsd服务并设置为开机自启动:
[root@server1 ~]# systemctl enable pcsd --now
Created symlink from /etc/systemd/system/multi-user.target.wants/pcsd.service to /usr/lib/systemd/system/pcsd.service.
设置建立认证的密码(server1、server2)。为建立认证的用户hacluster创建密码:
[root@server1 ~]# pcs cluster auth server1 server2
server1: Already authorized
server2: Already authorized
pcs cluster setup --name mycluster server1 server2
[root@server1 ~]# pcs cluster start --all
server1: Starting Cluster (corosync)...
server2: Starting Cluster (corosync)...
server1: Starting Cluster (pacemaker)...
server2: Starting Cluster (pacemaker)...
[root@server1 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 172.25.5.1
status = ring 0 active with no faults
[root@server1 ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 172.25.5.1
status = ring 0 active with no faults
[root@server1 ~]# pcs cluster status
Cluster Status:
Stack: corosync
Current DC: server1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 12:00:02 2020
Last change: Wed Aug 5 09:10:47 2020 by root via cibadmin on server2
2 nodes configured
2 resources configured
PCSD Status:
server1: Online
server2: Online
高可用集群建立成功!
pcs resource create vip ocf:heartbeat:IPaddr2 ip=172.25.5.99 op monitor interval=30s
在集群中添加资源vip:172.168.5.99 j监控时间为30s ,ocf:heartbeat:IPaddr2
为资源启动脚本。
[root@server1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: server1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 13:35:45 2020
Last change: Thu Aug 6 13:35:27 2020 by root via cibadmin on server1
2 nodes configured
1 resource configured
Online: [ server1 server2 ]
Full list of resources:
vip (ocf::heartbeat:IPaddr2): Started server2
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
可以看到,此时vip被添加到了server2上:
[root@server2 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:12:90:b5 brd ff:ff:ff:ff:ff:ff
inet 172.25.5.2/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet 172.25.5.99/16 brd 172.25.255.255 scope global secondary eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe12:90b5/64 scope link
valid_lft forever preferred_lft forever
此时停掉server2上的集群组件,server2会接管vip:
[root@server1 ~]# pcs cluster stop server2
server2: Stopping Cluster (pacemaker)...
server2: Stopping Cluster (corosync)...
[root@server1 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:49:4e:8f brd ff:ff:ff:ff:ff:ff
inet 172.25.5.1/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet 172.25.5.99/16 brd 172.25.255.255 scope global secondary eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe49:4e8f/64 scope link
valid_lft forever preferred_lft forever
[root@server1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: server1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 13:43:13 2020
Last change: Thu Aug 6 13:35:27 2020 by root via cibadmin on server1
2 nodes configured
1 resource configured
Online: [ server1 ]
OFFLINE: [ server2 ]
Full list of resources:
vip (ocf::heartbeat:IPaddr2): Started server1
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
此时server2上的vip已经没有了:
[root@server2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:12:90:b5 brd ff:ff:ff:ff:ff:ff
inet 172.25.5.2/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe12:90b5/64 scope link
valid_lft forever preferred_lft forever
手动删除vip后,vip会自己创建:
[root@server1 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:49:4e:8f brd ff:ff:ff:ff:ff:ff
inet 172.25.5.1/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe49:4e8f/64 scope link
valid_lft forever preferred_lft forever
[root@server1 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:49:4e:8f brd ff:ff:ff:ff:ff:ff
inet 172.25.5.1/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe49:4e8f/64 scope link
[root@server1 ~]# ip addr del 172.25.5.99 dev eth0
Warning: Executing wildcard deletion to stay compatible with old scripts.
Explicitly specify the prefix length (172.25.5.99/32) to avoid this warning.
This special behaviour is likely to disappear in further releases,
fix your scripts!
[root@server1 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:49:4e:8f brd ff:ff:ff:ff:ff:ff
inet 172.25.5.1/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe49:4e8f/64 scope link
valid_lft forever preferred_lft forever
[root@server1 ~]#
[root@server1 ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:49:4e:8f brd ff:ff:ff:ff:ff:ff
inet 172.25.5.1/16 brd 172.25.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet 172.25.5.99/16 brd 172.25.255.255 scope global secondary eth0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe49:4e8f/64 scope link
valid_lft forever preferred_lft forever
当系统发现vip没有时会调用资源启动脚本ocf:heartbeat:IPaddr2
来创建vip。更多关于企业集群运维管理系列的学习文章,请参阅:玩转企业集群运维管理专栏,本系列持续更新中。
网络故障(关掉vip所在主机网卡)
ifdown eth0
此时vip迁移到server2:
[root@server2 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: server2 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 14:01:20 2020
Last change: Thu Aug 6 13:35:27 2020 by root via cibadmin on server1
2 nodes configured
1 resource configured
Online: [ server2 ]
OFFLINE: [ server1 ]
Full list of resources:
vip (ocf::heartbeat:IPaddr2): Started server2
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
如果需要的资源有httpd及vip,使用上述方式则有可能出现httpd和vip不在同一台服务器上。所以,此时应使用组资源。
资源调用的方式:
[root@server1 ~]# pcs resource standards
lsb
ocf
service
systemd
当将apache作为资源时,就应该使用systemd的方式:
pcs resource create apache systemd:httpd op monitor interval=1min
[root@server1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: server2 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 14:31:20 2020
Last change: Thu Aug 6 14:31:06 2020 by root via cibadmin on server1
2 nodes configured
2 resources configured
Online: [ server1 server2 ]
Full list of resources:
vip (ocf::heartbeat:IPaddr2): Started server2
apache (systemd:httpd): Started server1
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
可以发现,apache和vip并没有绑定在同一服务器,不能实现通过vip对httpd的访问。
pcs resource group add webgroup vip apache
注意:vip 和 apache为按顺序启动!
[root@server1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: server2 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 14:34:23 2020
Last change: Thu Aug 6 14:33:26 2020 by root via cibadmin on server1
2 nodes configured
2 resources configured
Online: [ server1 server2 ]
Full list of resources:
Resource Group: webgroup
vip (ocf::heartbeat:IPaddr2): Started server2
apache (systemd:httpd): Started server2
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@server1 ~]# curl 172.25.5.99
server2
此时停掉server2,vip及apache会整体迁到server1:
[root@server1 ~]# pcs cluster stop server2
server2: Stopping Cluster (pacemaker)...
^[[Aserver2: Stopping Cluster (corosync)...
[root@server1 ~]# pcs status
Cluster name: mycluster
Stack: corosync
Current DC: server1 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug 6 14:39:00 2020
Last change: Thu Aug 6 14:33:26 2020 by root via cibadmin on server1
2 nodes configured
2 resources configured
Online: [ server1 ]
OFFLINE: [ server2 ]
Full list of resources:
Resource Group: webgroup
vip (ocf::heartbeat:IPaddr2): Started server1
apache (systemd:httpd): Started server1
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@server1 ~]# curl 172.25.5.99
server1
更多关于企业集群运维管理系列的学习文章,请参阅:玩转企业集群运维管理专栏,本系列持续更新中。
此文以PostgreSQL 10版本为例!如未指定,下述命令在所有节点执行!
yum -y install vim lrzsz bash-completion
echo 192.168.0.11 pgsql1 >> /etc/hosts
echo 192.168.0.12 pgsql2 >> /etc/hosts
echo 192.168.0.13 pgsql3 >> /etc/hosts
yum -y install chrony
systemctl start chronyd
systemctl enable chronyd
systemctl status chronyd
chronyc sources
systemctl stop firewalld
systemctl disable firewalld
setenforce 0
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
安装Pacemaker和Corosync:
yum -y install pacemaker corosync pcs ipvsadm
启动pcsd,并设置自启动:
systemctl start pcsd
systemctl enable pcsd
systemctl status pcsd
设置hacluster用户密码:
echo hacluster | passwd hacluster --stdin
在任何节点上启用集群认证:
pcs cluster auth -u hacluster -p hacluster pgsql1 pgsql2 pgsql3
在任何节点上同步配置:
pcs cluster setup --last_man_standing=1 --name pgcluster pgsql1 pgsql2 pgsql3
在任何节点上启动集群:
pcs cluster start --all
设置Pacemaker和Corosync自启动:
systemctl enable pacemaker
systemctl enable corosync
查看Pacemaker支持的PostgreSQL版本:
cat /usr/lib/ocf/resource.d/heartbeat/pgsql | grep ocf_version_cmp
参考地址:https://www.postgresql.org/download
yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum -y install postgresql10-server
配置环境变量:
su - postgres
修改.bash_profile,添加如下内容:
export PATH=$PATH:/usr/pgsql-10/bin
在PGSQL1节点上初始化PostgreSQL:
/usr/pgsql-10/bin/postgresql-10-setup initdb
在PGSQL1节点上配置远程登录和复制权限:
修改/var/lib/pgsql/10/data/postgresql.conf:
listen_addresses = '*'
修改/var/lib/pgsql/10/data/pg_hba.conf,添加如下内容:
# IPv4 local connections:
host all all 0.0.0.0/0 md5
# replication privilege.
host replication repluser 192.168.0.0/24 md5
在PGSQL1节点上启动PostgreSQL:
su - postgres
pg_ctl start
pg_ctl status
在PGSQL1节点上修改数据库密码:
su - postgres
psql -U postgres
ALTER USER postgres WITH ENCRYPTED PASSWORD '111111';
\du
\q
在PGSQL1节点上创建复制用户:
su - postgres
psql
CREATE USER repluser WITH REPLICATION PASSWORD '111111';
\du
\q
在PGSQL2和PGSQL3节点上备份PGSQL1节点数据:
su - postgres
pg_basebackup -h pgsql1 -U repluser -D /var/lib/pgsql/10/data -P -v
在PGSQL1节点上停止PostgreSQL:
su - postgres
pg_ctl stop
pg_ctl status
在PGSQL1节点上配置PCS资源:
创建cib配置文件:
pcs cluster cib pgsql_cfg
在Pacemaker级别忽略Quorum:
pcs -f pgsql_cfg property set no-quorum-policy=ignore
禁用STONITH:
pcs -f pgsql_cfg property set stonith-enabled=false
设置资源粘性,防止节点在故障恢复后迁移:
pcs -f pgsql_cfg resource defaults resource-stickiness=INFINITY
设置3次失败后迁移:
pcs -f pgsql_cfg resource defaults migration-threshold=3
设置Master节点虚IP:
pcs -f pgsql_cfg resource create vip-master IPaddr2 ip=192.168.0.10 cidr_netmask=24 op start timeout=60s interval=0s on-fail=restart op monitor timeout=60s interval=10s on-fail=restart op stop timeout=60s interval=0s on-fail=block
设置Slave节点虚IP:
pcs -f pgsql_cfg resource create vip-slave IPaddr2 ip=192.168.0.20 cidr_netmask=24 op start timeout=60s interval=0s on-fail=restart op monitor timeout=60s interval=10s on-fail=restart op stop timeout=60s interval=0s on-fail=block
设置pgsql集群资源:
pcs -f pgsql_cfg resource create pgsql pgsql pgctl=/usr/pgsql-10/bin/pg_ctl psql=/usr/pgsql-10/bin/psql pgdata=/var/lib/pgsql/10/data config=/var/lib/pgsql/10/data/postgresql.conf rep_mode=sync node_list="pgsql1 pgsql2 pgsql3" master_ip=192.168.0.10 repuser=repluser primary_conninfo_opt="password=111111 keepalives_idle=60 keepalives_interval=5 keepalives_count=5" restart_on_promote=true op start timeout=60s interval=0s on-fail=restart op monitor timeout=60s interval=4s on-fail=restart op monitor timeout=60s interval=3s on-fail=restart role=Master op promote timeout=60s interval=0s on-fail=restart op demote timeout=60s interval=0s on-fail=stop op stop timeout=60s interval=0s on-fail=block
设置Master/Slave模式:
pcs -f pgsql_cfg resource master pgsql-cluster pgsql master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true
配置Master IP组:
pcs -f pgsql_cfg resource group add master-group vip-master
配置Slave IP组:
pcs -f pgsql_cfg resource group add slave-group vip-slave
配置Master IP组绑定Master节点:
pcs -f pgsql_cfg constraint colocation add master-group with master pgsql-cluster INFINITY
配置启动Master节点:
pcs -f pgsql_cfg constraint order promote pgsql-cluster then start master-group symmetrical=false score=INFINITY
配置停止Master节点:
pcs -f pgsql_cfg constraint order demote pgsql-cluster then stop master-group symmetrical=false score=0
配置Slave IP组绑定Slave节点:
pcs -f pgsql_cfg constraint colocation add slave-group with slave pgsql-cluster INFINITY
配置启动Slave节点:
pcs -f pgsql_cfg constraint order promote pgsql-cluster then start slave-group symmetrical=false score=INFINITY
配置停止Slave节点:
pcs -f pgsql_cfg constraint order demote pgsql-cluster then stop slave-group symmetrical=false score=0
把配置文件push到cib:
pcs cluster cib-push pgsql_cfg
如果修改集群配置,执行如下命令实现:
cibadmin --query > tmp.xml
vim tmp.xml
cibadmin --replace --xml-file tmp.xml
查看集群状态:
pcs status corosync
pcs status
PGSQL1节点为Master节点。更多关于企业集群运维管理系列的学习文章,请参阅:玩转企业集群运维管理专栏,本系列持续更新中。
通过虚IP连接数据库:
psql -U postgres -h 192.168.0.10
创建数据库和表:
CREATE DATABASE db;
\c db
CREATE TABLE tb (
id int NOT NULL,
name varchar(255) NULL,
PRIMARY KEY (id)
);
插入数据:
INSERT INTO tb (id,name) VALUES (1,'MySQL');
查看数据:
SELECT * FROM tb;
\q
在任意健康节点上查看集群状态:
pcs status corosync
pcs status
此时PGSQL3节点为Master节点
通过虚IP连接数据库:
psql -U postgres -h 192.168.0.10
插入数据:
\c db
INSERT INTO tb (id,name) VALUES (2,'Redis');
查看数据:
SELECT * FROM tb;
\q
数据库读写正常。
原Master节点恢复后,出现如下问题,需要删除lock文件并清除资源状态与错误计数:
rm -rf /var/lib/pgsql/tmp/PGSQL.lock
pcs resource cleanup
以上就是今天给大家分享的基于 Pacemaker+Corosync 架构实现的高可用集群实战过程。更多关于企业集群运维管理系列的学习文章,请参阅:玩转企业集群运维管理专栏,本系列持续更新中。
参考链接:https://blog.csdn.net/qq_42564122/article /details/107831918 https://blog.csdn.net/ mengshicheng1992/article/details/123612431