PG高可用之repmgr（1主2从+1witness）+Pgpool-II实现主从切换+读写分离+负载均衡

小麦苗DBA宝典

发布于 2022-11-07 19:22:08

1.6K0

发布于 2022-11-07 19:22:08

一、总体架构
二、环境申请
三、配置4台主机之间的互信
四、配置repmgr
  4.1、主库修改pg_hba.conf参数文件
  4.2、主库修改postgresql.conf参数文件
  4.3、主库创建相关用户和数据库
  4.4、4个节点分别修改repmgr.conf
  4.5、在主库注册主库服务
  4.6、克隆备库1和备库2
    4.6.1、4个节点都配~/.pgpass密码文件
    4.6.2、克隆备库1
    4.6.3、克隆备库2
    4.6.4、注册从库服务
  4.7、配置witness
    4.7.1、初始化witness数据库
      4.7.1.1、初始化并启动数据库
      4.7.1.2、创建相关用户和数据库
    4.7.2、注册为witness节点
五、检查repmgr集群
六、主从切换
  6.1、switchover正常主从切换
    6.1.1、把备库62变为主库
    6.1.2、把新备库61切换为主库
  6.2、failover切换
    6.2.1、failover异常手工主从切换
    6.2.2、failover异常自动主从切换
七、配置Pgpool-II做读写分离+负载均衡
  7.1、安装Pgpool-II
  7.2、配置pgpool.conf
  7.3、在主库创建用户
  7.4、启动Pgpool-II
  7.5、查询
  7.6、switchover故障转移后，查询读写分离
八、总结

一、总体架构

本文最终需要实现的目标：

1、1主2从流复制

2、读写分离+负载均衡（pgpool-II）

3、主从自动切换（repmgr）

所有的节点：安装操作系统、创建PG用户目录、安装PG软件、安装repmgr

主库：只初始化主库、启动主库归档

二、环境申请

先申请4台机器，6.66节点最后再申请，反正docker容器都是即开即用的。

-- 主库
docker rm -f lhrrepmgr64361
docker run -d --name lhrrepmgr64361 -h lhrrepmgr64361 \
  --net=pg-network --ip 172.72.6.61 \
  -p 64361:5432  \
  -v /sys/fs/cgroup:/sys/fs/cgroup \
  --privileged=true lhrbest/lhrpgall:1.0 \
  /usr/sbin/init

docker network connect bridge lhrrepmgr64361


-- 从库1
docker rm -f lhrrepmgr64362
docker run -d --name lhrrepmgr64362 -h lhrrepmgr64362 \
  --net=pg-network --ip 172.72.6.62 \
  -p 64362:5432  \
  -v /sys/fs/cgroup:/sys/fs/cgroup \
  --privileged=true lhrbest/lhrpgall:1.0 \
  /usr/sbin/init

docker network connect bridge lhrrepmgr64362




-- 从库2
docker rm -f lhrrepmgr64363
docker run -d --name lhrrepmgr64363 -h lhrrepmgr64363 \
  --net=pg-network --ip 172.72.6.63 \
  -p 64363:5432  \
  -v /sys/fs/cgroup:/sys/fs/cgroup \
  --privileged=true lhrbest/lhrpgall:1.0 \
  /usr/sbin/init

docker network connect bridge lhrrepmgr64363



-- Witness库
docker rm -f lhrrepmgr64364
docker run -d --name lhrrepmgr64364 -h lhrrepmgr64364 \
  --net=pg-network --ip 172.72.6.64 \
  -p 64364:5432  \
  -v /sys/fs/cgroup:/sys/fs/cgroup \
  --privileged=true lhrbest/lhrpgall:1.0 \
  /usr/sbin/init

docker network connect bridge lhrrepmgr64364



-- 重启4台机器
docker restart lhrrepmgr64361 lhrrepmgr64362 lhrrepmgr64363 lhrrepmgr64364


[root@docker35 ~]# docker ps | grep repmgr   
31d3d31c1073        lhrbest/lhrpgall:1.0                       "/usr/sbin/init"         41 hours ago        Up 41 hours         5433-5435/tcp, 0.0.0.0:64364->5432/tcp                                                               lhrrepmgr64364
572d4ea2c072        lhrbest/lhrpgall:1.0                       "/usr/sbin/init"         41 hours ago        Up 41 hours         5433-5435/tcp, 0.0.0.0:64363->5432/tcp                                                               lhrrepmgr64363
6ded416b2016        lhrbest/lhrpgall:1.0                       "/usr/sbin/init"         41 hours ago        Up 41 hours         5433-5435/tcp, 0.0.0.0:64362->5432/tcp                                                               lhrrepmgr64362
fa38b58b8f3d        lhrbest/lhrpgall:1.0                       "/usr/sbin/init"         41 hours ago        Up 41 hours         5433-5435/tcp, 0.0.0.0:64361->5432/tcp                                                               lhrrepmgr64361
[root@docker35 ~]#

👉 注意： 1、镜像lhrbest/lhrpgall:1.0中，申请的4台主机均已安装好PG 13和repmgr软件 2、PG安装方式为编译安装，数据库已初始化完成，用户为pg13 3、该镜像里已安装了PG13、PG12、PG11、PG10和PG9.6这几个版本，均为编译安装，本文使用PG13来操作

三、配置4台主机之间的互信

-- 只在主库61节点上，以root用户执行：
chmod +x sshUserSetup.sh
./sshUserSetup.sh -user pg13  -hosts "172.72.6.61 172.72.6.62 172.72.6.63 172.72.6.64" -advanced exverify -confirm

-- 在每台主机修改权限，否则报错：Bad owner or permissions on /home/pg13/.ssh/config
chmod 600 /home/pg13/.ssh/config

依次输入：yes，回车和密码即可。

脚本下载：https://share.weiyun.com/57HUxNi

验证：

ssh pg13@lhrrepmgr64361 date
ssh pg13@lhrrepmgr64362 date
ssh pg13@lhrrepmgr64363 date
ssh pg13@lhrrepmgr64364 date

第二次执行时不再提示输入yes，并且可以成功执行命令，则表示SSH对等性配置成功。

执行过程：

[pg13@lhrrepmgr64361 /]# ssh pg13@lhrrepmgr64362 date
Tue Apr 27 17:15:04 CST 2021
[root@lhrrepmgr64361 /]# 
[pg13@lhrrepmgr64361 /]# ssh pg13@lhrrepmgr64363 date
Tue Apr 27 17:15:08 CST 2021
[root@lhrrepmgr64361 /]# 
[pg13@lhrrepmgr64361 /]# ssh pg13@lhrrepmgr64364 date
Tue Apr 27 17:15:13 CST 2021
[pg13@lhrrepmgr64361 /]#

四、配置repmgr

4.1、主库修改pg_hba.conf参数文件

-- 进入主库
docker exec -it lhrrepmgr64361 bash

su - pg13

cat >> /pg13/pgdata/pg_hba.conf <<"EOF"

local repmgr repmgr md5
host repmgr repmgr 127.0.0.1/32 md5
host repmgr repmgr 172.72.6.0/24 md5

local replication repmgr md5
host replication repmgr 127.0.0.1/32 md5
host replication repmgr 172.72.6.0/24 md5

EOF

4.2、主库修改postgresql.conf参数文件

-- 修改参数
cat >> /pg13/pgdata/postgresql.conf <<"EOF"

# 归档参数
wal_level='replica'
archive_mode='on'
archive_command='test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f'
restore_command='cp /pg13/archive/%f %p'

# 主从流复制
hot_standby=on
max_wal_senders=10
wal_sender_timeout=60s
wal_keep_size=16MB

# 主从切换参数，启用PG数据库的复制槽，PG12不需要"use_replication_slots=true"这个参数了。
max_replication_slots=10
wal_log_hints=on

# 自动切换
shared_preload_libraries ='repmgr'

EOF


-- 重启
pg_ctl start

-- 查询
psql -U postgres -h 192.168.66.35 -p 64361
select * from pg_settings where name in ('wal_level','archive_mode','archive_command');


-- 切换归档
select pg_switch_wal();

4.3、主库创建相关用户和数据库

-- 创建相关用户和数据库
su - pg13
createuser -s repmgr
createdb repmgr -O repmgr

psql -h 127.0.0.1 -c "alter user repmgr with password 'lhr';"
psql -h 127.0.0.1 -c "alter user repmgr set search_path to repmgr, \"\$user\",public;"

结果：

[root@lhrrepmgr64361 /]# systemctl start pg13
[root@lhrrepmgr64361 /]# systemctl status pg13
● pg13.service - PostgreSQL database server
   Loaded: loaded (/etc/systemd/system/pg13.service; disabled; vendor preset: disabled)
   Active: active (running) since Tue 2021-04-27 16:25:24 CST; 6s ago
     Docs: man:postgres(1)
  Process: 769 ExecStart=/pg13/pg13/bin/pg_ctl start -D ${PGDATA} -s -o -p ${PGPORT} -w -t 300 (code=exited, status=0/SUCCESS)
 Main PID: 771 (postgres)
   CGroup: /docker/a777ef12d5ff83a9d47be51a98531bd45b42d2b008f7a25f894f3244ce9cc0d4/system.slice/pg13.service
           ├─771 /pg13/pg13/bin/postgres -D /pg13/pgdata -p 5432
           ├─772 postgres: logger 
           ├─774 postgres: checkpointer 
           ├─775 postgres: background writer 
           ├─776 postgres: walwriter 
           ├─777 postgres: autovacuum launcher 
           ├─778 postgres: stats collector 
           └─779 postgres: logical replication launcher 

Apr 27 16:25:23 lhrrepmgr64361 systemd[1]: Starting PostgreSQL database server...
Apr 27 16:25:24 lhrrepmgr64361 pg_ctl[769]: 2021-04-27 16:25:24.024 CST [771] LOG:  redirecting log output to logging collector process
Apr 27 16:25:24 lhrrepmgr64361 pg_ctl[769]: 2021-04-27 16:25:24.024 CST [771] HINT:  Future log output will appear in directory "pg_log".
Apr 27 16:25:24 lhrrepmgr64361 systemd[1]: Started PostgreSQL database server.

[root@lhrrepmgr64361 /]# su - pg13
Last login: Tue Apr 27 16:24:50 CST 2021 on pts/0
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ createuser -s repmgr
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ createdb repmgr -O repmgr
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ psql -h 127.0.0.1 -c "alter user repmgr with password 'lhr'"
ALTER ROLE
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ psql -h 127.0.0.1 -c "alter user repmgr set search_path to repmgr, \"\$user\",public";
ALTER ROLE


C:\Users\lhrxxt>psql -U postgres -h 192.168.66.35 -p 64361
Password for user postgres:
psql (13.2)
Type "help" for help.

postgres=# \du
                                   List of roles
 Role name |                         Attributes                         | Member of
-----------+------------------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
 repmgr    | Superuser, Create role, Create DB                          | {}


postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
(4 rows)

4.4、4个节点分别修改repmgr.conf

-- 以pg13用户修改
su - pg13

-- 主库
cat > /pg13/pg13/repmgr.conf << "EOF"
node_id=1
node_name=lhrrepmgr64361
conninfo='host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
data_directory='/pg13/pgdata'
pg_bindir='/pg13/pg13/bin'
EOF


-- 从库1
cat > /pg13/pg13/repmgr.conf << "EOF"
node_id=2
node_name=lhrrepmgr64362
conninfo='host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
data_directory='/pg13/pgdata'
pg_bindir='/pg13/pg13/bin'
EOF

-- 从库2
cat > /pg13/pg13/repmgr.conf << "EOF"
node_id=3
node_name=lhrrepmgr64363
conninfo='host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
data_directory='/pg13/pgdata'
pg_bindir='/pg13/pg13/bin'
EOF


-- witness节点
cat > /pg13/pg13/repmgr.conf << "EOF"
node_id=4
node_name=lhrrepmgr64364
conninfo='host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
data_directory='/pg13/pgdata'
pg_bindir='/pg13/pg13/bin'
EOF




-- 测试
psql 'host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
psql 'host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
psql 'host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2'
psql 'host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2'

4.5、在主库注册主库服务

-- 注册服务
repmgr -f /pg13/pg13/repmgr.conf primary register

-- 查询
repmgr -f /pg13/pg13/repmgr.conf cluster show
psql -U repmgr -h 192.168.66.35 -p 64361 -d repmgr

执行过程：

[pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
[pg13@lhrrepmgr64361 pg13]$ 
[pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |          | default  | 100      | 1        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2


C:\Users\lhrxxt>psql -U repmgr -h 192.168.66.35 -p 64361 -d repmgr
Password for user repmgr:
psql (13.2)
Type "help" for help.

repmgr=# select * from repmgr.nodes;
 node_id | upstream_node_id | active |   node_name    |  type   | location | priority |                                 conninfo                                  | repluser | slot_name |      config_file
---------+------------------+--------+----------------+---------+----------+----------+---------------------------------------------------------------------------+----------+-----------+------------------------
       1 |                  | t      | lhrrepmgr64361 | primary | default  |      100 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr   |           | /pg13/pg13/repmgr.conf
(1 row)

4.6、克隆备库1和备库2

4.6.1、4个节点都配~/.pgpass密码文件

su - pg13
echo "*:*:*:repmgr:lhr" > ~/.pgpass
chmod 0600 ~/.pgpass



-- 测试，若不需要密码，那就证明配置正确
psql 'host=172.72.6.61 user=repmgr dbname=repmgr connect_timeout=2'
psql 'host=172.72.6.62 user=repmgr dbname=repmgr connect_timeout=2'
psql 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2'
psql 'host=172.72.6.64 user=repmgr dbname=repmgr connect_timeout=2'

4.6.2、克隆备库1

-- 其中--dry-run表示不真实执行克隆过程，只是检查是否有问题
repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone --force --dry-run

-- 执行克隆命令，其实还是调用pg_basebackup
repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone --force 


-- 启动
pg_ctl -D /pg13/pgdata start

-- 查询
psql
select * from pg_stat_wal_receiver;

执行过程：

[pg13@lhrrepmgr64362 ~]$ repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone --force --dry-run
NOTICE: destination directory "/pg13/pgdata" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.72.6.61 user=repmgr dbname=repmgr
DETAIL: current installation size is 29 MB
INFO: "repmgr" extension is installed in database "repmgr"
WARNING: target data directory appears to be a PostgreSQL data directory
DETAIL: target data directory is "/pg13/pgdata"
HINT: use -F/--force to overwrite the existing data directory
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
WARNING: data checksums are not enabled and "wal_log_hints" is "off"
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met
[pg13@lhrrepmgr64362 ~]$ repmgrrepmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone^C
[pg13@lhrrepmgr64362 ~]$ repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone
NOTICE: destination directory "/pg13/pgdata" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.72.6.61 user=repmgr dbname=repmgr
DETAIL: current installation size is 29 MB
ERROR: target data directory appears to be a PostgreSQL data directory
DETAIL: target data directory is "/pg13/pgdata"
HINT: use -F/--force to overwrite the existing data directory
[pg13@lhrrepmgr64362 ~]$ repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone --force
NOTICE: destination directory "/pg13/pgdata" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.72.6.61 user=repmgr dbname=repmgr
DETAIL: current installation size is 29 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: data checksums are not enabled and "wal_log_hints" is "off"
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
WARNING: directory "/pg13/pgdata" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/pg13/pgdata"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /pg13/pg13/bin/pg_basebackup -l "repmgr base backup"  -D /pg13/pgdata -h 172.72.6.61 -p 5432 -U repmgr -X stream 
Password: 
WARNING:  skipping special file "./.s.PGSQL.5432"
WARNING:  skipping special file "./.s.PGSQL.5432"
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /pg13/pgdata start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

[pg13@lhrrepmgr64362 ~]$ pg_ctl -D /pg13/pgdata start
waiting for server to start....2021-04-27 19:14:08.750 CST [9298] LOG:  redirecting log output to logging collector process
2021-04-27 19:14:08.750 CST [9298] HINT:  Future log output will appear in directory "pg_log".
 done
server started

[pg13@lhrrepmgr64362 ~]$ psql
psql (13.2)
Type "help" for help.


postgres=# select * from pg_stat_wal_receiver;
 pid  |  status   | receive_start_lsn | receive_start_tli | written_lsn | flushed_lsn | received_tli |      last_msg_send_time       |     last_msg_receipt_time     | latest_end_lsn |        latest_end_time        | slot_name | sender_host | sender_port |                                                                                                                                                           conninfo                                                                                                                                                            
------+-----------+-------------------+-------------------+-------------+-------------+--------------+-------------------------------+-------------------------------+----------------+-------------------------------+-----------+-------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 9304 | streaming | 0/3000000         |                 1 | 0/30001F0   | 0/30001F0   |            1 | 2021-04-27 19:15:08.862599+08 | 2021-04-27 19:15:08.862719+08 | 0/30001F0      | 2021-04-27 19:14:08.830865+08 |           | 172.72.6.61 |        5432 | user=repmgr password=******** channel_binding=disable connect_timeout=2 dbname=replication host=172.72.6.61 port=5432 application_name=lhrrepmgr64362 fallback_application_name=walreceiver sslmode=disable sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres target_session_attrs=any
(1 row)

信息“NOTICE: standby clone (using pg_basebackup) complete”表示测试完成。

4.6.3、克隆备库2

-- 其中--dry-run表示不真实执行克隆过程，只是检查是否有问题
repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone --force --dry-run
repmgr -h 172.72.6.61 -U repmgr -d repmgr -f /pg13/pg13/repmgr.conf standby clone --force 


-- 启动
pg_ctl -D /pg13/pgdata start

-- 查询
psql
select * from pg_stat_wal_receiver;

4.6.4、注册从库服务

-- 分别在2个从库注册
repmgr -f /pg13/pg13/repmgr.conf standby register --force
repmgr -f /pg13/pg13/repmgr.conf cluster show

执行过程：

-- 从库2
[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby register
INFO: connecting to local node "lhrrepmgr64362" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "lhrrepmgr64362" (ID: 2) successfully registered
[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 1        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 1        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2


-- 从库3
[pg13@lhrrepmgr64363 pgdata]$ repmgr -f /pg13/pg13/repmgr.conf standby register
INFO: connecting to local node "lhrrepmgr64363" (ID: 3)
repmgr -f /pg13/pg13/repmgr.conf cluster showINFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "lhrrepmgr64363" (ID: 3) successfully registered
[pg13@lhrrepmgr64363 pgdata]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 1        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 1        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64361 | default  | 100      | 1        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2

4.7、配置witness

4.7.1、初始化witness数据库

witness节点不能克隆，只能初始化数据库。

4.7.1.1、初始化并启动数据库

-- 需要重新初始化环境
rm -rf /pg13/pgdata
/pg13/pg13/bin/initdb -D /pg13/pgdata -E UTF8 --locale=en_US.utf8 -U postgres


-- 修改参数postgresql.conf
cat >> /pg13/pgdata/postgresql.conf <<"EOF"

listen_addresses = '*'
port=5432
unix_socket_directories='/pg13/pgdata'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on

# 归档参数
wal_level='replica'
archive_mode='on'
archive_command='test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f'
restore_command='cp /pg13/archive/%f %p'

# 主从流复制
hot_standby=on
max_wal_senders=10
wal_sender_timeout=60s
wal_keep_size=16MB

# 主从切换参数，启用PG数据库的复制槽，PG12不需要"use_replication_slots=true"这个参数了。
max_replication_slots=10
wal_log_hints=on

# 自动切换
shared_preload_libraries ='repmgr'

EOF


-- 修改参数pg_hba.conf
cat >> /pg13/pgdata/pg_hba.conf <<"EOF"

# TYPE  DATABASE    USER    ADDRESS       METHOD
local     all       all                    trust
host      all       all   127.0.0.1/32     trust
host      all       all    0.0.0.0/0        md5
host   replication  all    0.0.0.0/0        md5

local repmgr repmgr md5
host repmgr repmgr 127.0.0.1/32 md5
host repmgr repmgr 172.72.6.0/24 md5

local replication repmgr md5
host replication repmgr 127.0.0.1/32 md5
host replication repmgr 172.72.6.0/24 md5

EOF


-- 启动数据库
/pg13/pg13/bin/pg_ctl -D /pg13/pgdata -l logfile start

4.7.1.2、创建相关用户和数据库

su - pg13
createuser -s repmgr
createdb repmgr -O repmgr


psql -h 127.0.0.1 -c "alter user repmgr with password 'lhr';"
psql -h 127.0.0.1 -c "alter user repmgr set search_path to repmgr, \"\$user\",public;"

4.7.2、注册为witness节点

repmgr -f /pg13/pg13/repmgr.conf witness register -h 172.72.6.61 -U repmgr -d repmgr  --force
repmgr -f /pg13/pg13/repmgr.conf cluster show

执行过程：

[pg13@lhrrepmgr64364 ~]$ repmgr -f /pg13/pg13/repmgr.conf witness register -h 172.72.6.61 -U repmgr -d repmgr  --force
INFO: connecting to witness node "lhrrepmgr64364" (ID: 4)
INFO: connecting to primary node
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
INFO: witness registration complete
NOTICE: witness node "lhrrepmgr64364" (ID: 4) successfully registered

-- 所有节点查询都可以获取如下的结果
[pg13@lhrrepmgr64364 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 1        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 1        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64361 | default  | 100      | 1        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2


repmgr=# select * from pg_stat_replication;
 pid  | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time
------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
 7270 |    16385 | repmgr  | lhrrepmgr64362   | 172.72.6.62 |                 |       54710 | 2021-04-28 19:07:18.248872+08 |              | streaming | 0/7DF703C0 | 0/7DF703C0 | 0/7DF703C0 | 0/7DF703C0 |           |           |            |             0 | async      | 2021-04-29 12:06:07.924676+08
 7276 |    16385 | repmgr  | lhrrepmgr64363   | 172.72.6.63 |                 |       54888 | 2021-04-28 19:07:18.678135+08 |              | streaming | 0/7DF703C0 | 0/7DF703C0 | 0/7DF703C0 | 0/7DF703C0 |           |           |            |             0 | async      | 2021-04-29 12:06:07.956466+08
(2 rows)

五、检查repmgr集群

repmgr -f /pg13/pg13/repmgr.conf  cluster matrix 
repmgr -f /pg13/pg13/repmgr.conf  cluster crosscheck 

repmgr -f /pg13/pg13/repmgr.conf node status
repmgr -f /pg13/pg13/repmgr.conf node check

一些结果：

[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf  cluster matrix 
INFO: connecting to database
 Name           | ID | 1 | 2 | 3 | 4
----------------+----+---+---+---+---
 lhrrepmgr64361 | 1  | * | * | * | * 
 lhrrepmgr64362 | 2  | * | * | * | * 
 lhrrepmgr64363 | 3  | * | * | * | * 
 lhrrepmgr64364 | 4  | * | * | * | * 

[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf node status
Node "lhrrepmgr64361":
        PostgreSQL version: 13.2
        Total data size: 37 MB
        Conninfo: host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
        Role: primary
        WAL archiving: enabled
        Archive command: test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f
        WALs pending archiving: 0 pending files
        Replication connections: 2 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Replication lag: n/a

[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf node check
Node "lhrrepmgr64361":
        Server role: OK (node is primary)
        Replication lag: OK (N/A - node is primary)
        WAL archiving: OK (0 pending archive ready files)
        Upstream connection: OK (N/A - node is primary)
        Downstream servers: OK (2 of 2 downstream nodes attached)
        Replication slots: OK (node has no physical replication slots)
        Missing physical replication slots: OK (node has no missing physical replication slots)
        Configured data directory: OK (configured "data_directory" is "/pg13/pgdata")



[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf node status
Node "lhrrepmgr64362":
        PostgreSQL version: 13.2
        Total data size: 37 MB
        Conninfo: host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
        Role: standby
        WAL archiving: disabled (on standbys "archive_mode" must be set to "always" to be effective)
        Archive command: test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f
        WALs pending archiving: 0 pending files
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Upstream node: lhrrepmgr64361 (ID: 1)
        Replication lag: 0 seconds
        Last received LSN: 0/8001480
        Last replayed LSN: 0/8001480

[pg13@lhrrepmgr64364 ~]$ repmgr -f /pg13/pg13/repmgr.conf node status
Node "lhrrepmgr64364":
        PostgreSQL version: 13.2
        Total data size: 29 MB
        Conninfo: host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
        Role: witness
        WAL archiving: enabled
        Archive command: test ! -f /pg13/archive/%f && cp %p /pg13/archive/%f
        WALs pending archiving: 0 pending files
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Replication lag: n/a

六、主从切换

6.1、switchover正常主从切换

switchover切换即主变备，备变主。

官网：https://repmgr.org/docs/current/repmgr-standby-switchover.html

6.1.1、把备库62变为主库

在62上操作：

repmgr -f /pg13/pg13/repmgr.conf cluster show
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
repmgr -f /pg13/pg13/repmgr.conf cluster show


-- 可以debug打印详细的切换过程
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose

--siblings-follow 表示所有的从库同步源自动改成最新的主库节点 --force-rewind 如果repmgr检测到需要执行pg_rewind(同步)的时候，在执行pg_rewind之前，在新主节点执行checkpoint

执行过程：

[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 3        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 3        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64361 | default  | 100      | 3        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
[pg13@lhrrepmgr64362 ~]$ 
[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
NOTICE: checking switchover on node "lhrrepmgr64362" (ID: 2) in --dry-run mode
INFO: prerequisites for using pg_rewind are met
INFO: SSH connection to host "172.72.6.61" succeeded
INFO: able to execute "repmgr" on remote host "172.72.6.61"
INFO: all sibling nodes are reachable via SSH
INFO: 3 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: local node "lhrrepmgr64362" (ID: 2) would be promoted to primary; current primary "lhrrepmgr64361" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "lhrrepmgr64361":
  "/pg13/pg13/bin/pg_ctl  -D '/pg13/pgdata' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
[pg13@lhrrepmgr64362 ~]$ 
[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
NOTICE: executing switchover on node "lhrrepmgr64362" (ID: 2)
NOTICE: local node "lhrrepmgr64362" (ID: 2) will be promoted to primary; current primary "lhrrepmgr64361" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "lhrrepmgr64361" (ID: 1)
NOTICE: issuing CHECKPOINT on node "lhrrepmgr64361" (ID: 1) 
DETAIL: executing server command "/pg13/pg13/bin/pg_ctl  -D '/pg13/pgdata' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/9000028
NOTICE: promoting standby to primary
DETAIL: promoting server "lhrrepmgr64362" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "lhrrepmgr64362" (ID: 2) was successfully promoted to primary
NOTICE: issuing CHECKPOINT on node "lhrrepmgr64362" (ID: 2) 
ERROR: unable to execute CHECKPOINT
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/9000028; rejoin target node's fork point: 0/90000A0
NOTICE: setting node 1  upstream to node 2
WARNING: unable to ping "host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/pg13/pg13/bin/pg_ctl  -w -D '/pg13/pgdata' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
NOTICE: node  "lhrrepmgr64362" (ID: 2) promoted to primary, node "lhrrepmgr64361" (ID: 1) demoted to standby
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "lhrrepmgr64362" is now primary and node "lhrrepmgr64361" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
[pg13@lhrrepmgr64362 ~]$ 
[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | standby |   running | lhrrepmgr64362 | default  | 100      | 3        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | primary | * running |                | default  | 100      | 4        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64362 | default  | 100      | 3        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2


C:\Users\lhrxxt>psql -U repmgr -h192.168.66.35 -p64361 -d repmgr
Password for user repmgr:
psql (13.2)
Type "help" for help.

repmgr=# select * from nodes;
 node_id | upstream_node_id | active |   node_name    |  type   | location | priority |                                 conninfo                                  | repluser | slot_name |      config_file
---------+------------------+--------+----------------+---------+----------+----------+---------------------------------------------------------------------------+----------+-----------+------------------------
       2 |                  | t      | lhrrepmgr64362 | primary | default  |      100 | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr   |           | /pg13/pg13/repmgr.conf
       1 |                2 | t      | lhrrepmgr64361 | standby | default  |      100 | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr   |           | /pg13/pg13/repmgr.conf
       3 |                2 | t      | lhrrepmgr64363 | standby | default  |      100 | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr   |           | /pg13/pg13/repmgr.conf
       4 |                2 | t      | lhrrepmgr64364 | witness | default  |        0 | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2 | repmgr   |           | /pg13/pg13/repmgr.conf
(4 rows)

可以看到，主库变为lhrrepmgr64362，而lhrrepmgr64361变为了从库。

6.1.2、把新备库61切换为主库

-- 在61上做操作
repmgr -f /pg13/pg13/repmgr.conf cluster show
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind

-- 可以debug打印详细的切换过程
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind --log-level DEBUG --verbose

执行过程：

[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | standby |   running | lhrrepmgr64362 | default  | 100      | 4        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | primary | * running |                | default  | 100      | 4        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64362 | default  | 100      | 4        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --dry-run --force-rewind
NOTICE: checking switchover on node "lhrrepmgr64361" (ID: 1) in --dry-run mode
INFO: prerequisites for using pg_rewind are met
INFO: SSH connection to host "172.72.6.62" succeeded
INFO: able to execute "repmgr" on remote host "172.72.6.62"
INFO: all sibling nodes are reachable via SSH
INFO: 3 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: local node "lhrrepmgr64361" (ID: 1) would be promoted to primary; current primary "lhrrepmgr64362" (ID: 2) would be demoted to standby
INFO: following shutdown command would be run on node "lhrrepmgr64362":
  "/pg13/pg13/bin/pg_ctl  -D '/pg13/pgdata' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
NOTICE: executing switchover on node "lhrrepmgr64361" (ID: 1)
NOTICE: local node "lhrrepmgr64361" (ID: 1) will be promoted to primary; current primary "lhrrepmgr64362" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "lhrrepmgr64362" (ID: 2)
NOTICE: issuing CHECKPOINT on node "lhrrepmgr64362" (ID: 2) 
DETAIL: executing server command "/pg13/pg13/bin/pg_ctl  -D '/pg13/pgdata' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/A000028
NOTICE: promoting standby to primary
DETAIL: promoting server "lhrrepmgr64361" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "lhrrepmgr64361" (ID: 1) was successfully promoted to primary
NOTICE: issuing CHECKPOINT on node "lhrrepmgr64361" (ID: 1) 
ERROR: unable to execute CHECKPOINT
INFO: local node 2 can attach to rejoin target node 1
DETAIL: local node's recovery point: 0/A000028; rejoin target node's fork point: 0/A0000A0
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/pg13/pg13/bin/pg_ctl  -w -D '/pg13/pgdata' start"
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
NOTICE: node  "lhrrepmgr64361" (ID: 1) promoted to primary, node "lhrrepmgr64362" (ID: 2) demoted to standby
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "lhrrepmgr64361" is now primary and node "lhrrepmgr64362" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 5        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 4        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64361 | default  | 100      | 4        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2

6.2、failover切换

6.2.1、failover异常手工主从切换

主库出现故障，然后直接将从库提升为主库：

-- 主库61宕机
pg_ctl -m fast stop

-- 从库63提升为主库
repmgr -f /pg13/pg13/repmgr.conf cluster show
repmgr -f /pg13/pg13/repmgr.conf --siblings-follow standby promote
repmgr -f /pg13/pg13/repmgr.conf cluster show

-- 等原主库61修复之后，由于新提升的主库63数据已经很新了，所以只能把61作为备库来运行
pg_ctl stop
repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --dry-run --verbose
repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose

执行过程：

-- 主库61宕机
[pg13@lhrrepmgr64361 ~]$ pg_ctl -m fast stop
waiting for server to shut down.... done
server stopped
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
ERROR: connection to database failed
DETAIL: 
could not connect to server: Connection refused
        Is the server running on host "172.72.6.61" and accepting
        TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr options=-csearch_path=



-- 从库63提升为主库  
[pg13@lhrrepmgr64363 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status        | Upstream         | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+---------------+------------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | ? unreachable | ?                | default  | 100      |          | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running     | ? lhrrepmgr64361 | default  | 100      | 5        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running     | ? lhrrepmgr64361 | default  | 100      | 5        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running     | ? lhrrepmgr64361 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "lhrrepmgr64361" (ID: 1)
  - node "lhrrepmgr64361" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "lhrrepmgr64362" (ID: 2)'s upstream node "lhrrepmgr64361" (ID: 1)
  - unable to determine if node "lhrrepmgr64362" (ID: 2) is attached to its upstream node "lhrrepmgr64361" (ID: 1)
  - unable to connect to node "lhrrepmgr64363" (ID: 3)'s upstream node "lhrrepmgr64361" (ID: 1)
  - unable to determine if node "lhrrepmgr64363" (ID: 3) is attached to its upstream node "lhrrepmgr64361" (ID: 1)
  - unable to connect to node "lhrrepmgr64364" (ID: 4)'s upstream node "lhrrepmgr64361" (ID: 1)'

HINT: execute with --verbose option to see connection error messages
[pg13@lhrrepmgr64363 ~]$ repmgr -f /pg13/pg13/repmgr.conf --siblings-follow standby promote
NOTICE: promoting standby to primary
DETAIL: promoting server "lhrrepmgr64363" (ID: 3) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "lhrrepmgr64363" (ID: 3) was successfully promoted to primary
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
[pg13@lhrrepmgr64363 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show                     
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | - failed  | ?              | default  | 100      |          | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64363 | default  | 100      | 5        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | primary | * running |                | default  | 100      | 6        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64363 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "lhrrepmgr64361" (ID: 1)

HINT: execute with --verbose option to see connection error messages

修复好61后，再启动61数据库，作为63的备库：

[pg13@lhrrepmgr64361 pg13]$  repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind  --verbose 
NOTICE: using provided configuration file "/pg13/pg13/repmgr.conf"
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3
DETAIL: rejoin target server  timeline 6 forked off current database system timeline 5 before current recovery point 1/1C000028
INFO: prerequisites for using pg_rewind are met
INFO: 0 files copied to "/tmp/repmgr-config-archive-lhrrepmgr64361"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/pg13/pg13/bin/pg_rewind -D '/pg13/pgdata' --source-server='host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2'"
ERROR: pg_rewind execution failed
DETAIL: pg_rewind: servers diverged at WAL location 1/190000A0 on timeline 5
pg_rewind: rewinding from last common checkpoint at 1/19000028 on timeline 5
pg_rewind: error: could not open file "/pg13/pgdata/pg_wal/00000005000000010000001A": No such file or directory
pg_rewind: fatal: could not read WAL record at 1/1A000000

[pg13@lhrrepmgr64361 pg13]$ cp ./archive/00000005000000010000001A /pg13/pgdata/pg_wal/00000005000000010000001A
[pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.63 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind  --verbose 
NOTICE: using provided configuration file "/pg13/pg13/repmgr.conf"
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 3
DETAIL: rejoin target server timeline 6 forked off current database system timeline 5 before current recovery point 1/1C000028
INFO: prerequisites for using pg_rewind are met
INFO: 0 files copied to "/tmp/repmgr-config-archive-lhrrepmgr64361"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/pg13/pg13/bin/pg_rewind -D '/pg13/pgdata' --source-server='host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /pg13/pgdata
INFO: directory "/tmp/repmgr-config-archive-lhrrepmgr64361" deleted
NOTICE: setting node 1's upstream to node 3'
WARNING: unable to ping "host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/pg13/pg13/bin/pg_ctl  -w -D '/pg13/pgdata' start"
INFO: node "lhrrepmgr64361" (ID: 1) is pingable
INFO: node "lhrrepmgr64361" (ID: 1) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 3
[pg13@lhrrepmgr64361 pg13]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | standby |   running | lhrrepmgr64363 | default  | 100      | 5        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64363 | default  | 100      | 6        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | primary | * running |                | default  | 100      | 6        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64363 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2

6.2.2、failover异常自动主从切换

生产建议配置自动failover切换，要配置自动failover，则需要做以下事情，在所有节点均配置：

1、在所有节点都配置：echo "shared_preload_libraries='repmgr' " >>/pg13/pgdata/postgresq.conf，配置后重启PG。该步骤我之前已配置过。

2、在所有节点都配置文件/pg13/pg13/repmgr.conf，增加内容包括：

cat >> /pg13/pg13/repmgr.conf << "EOF"

monitoring_history=yes
monitor_interval_secs=5
failover=automatic
reconnect_attempts=6
reconnect_interval=5
promote_command='repmgr standby promote -f /pg13/pg13/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /pg13/pg13/repmgr.conf --log-to-file --upstream-node-id=%n'
log_level=INFO
log_status_interval=10
log_file='/pg13/pg13/repmgr.log'
EOF



cat >> /etc/logrotate.conf <<"EOF"
/pg13/pg13/repmgr.log {
missingok
compress
rotate 30
daily
dateext
create 0600 pg13 pg13
}
EOF

3、在所有节点都启动repmgrd进程

-- 启动
repmgrd -f /pg13/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid  --daemonize

-- 建议加到开机自动启动:/etc/rc.local
echo "repmgrd -f /pg13/pg13/repmgr.conf --pid-file /tmp/repmgrd.pid --daemonize" >> /etc/rc.local
chmod +x /etc/rc.d/rc.local

-- 停止
kill -9 `cat /tmp/repmgrd.pid`

日志：

[pg13@lhrrepmgr64363 ~]$ tailf /pg13/pg13/repmgr.log
[2021-04-29 16:34:07] [NOTICE] repmgrd (repmgrd 5.2.1) starting up
[2021-04-29 16:34:07] [INFO] connecting to database "host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2021-04-29 16:34:07] [NOTICE] starting monitoring of node "lhrrepmgr64363" (ID: 3)
[2021-04-29 16:34:07] [INFO] "connection_check_type" set to "ping"
[2021-04-29 16:34:07] [INFO] monitoring connection to upstream node "lhrrepmgr64361" (ID: 1)
[2021-04-29 16:34:17] [INFO] node "lhrrepmgr64363" (ID: 3) monitoring upstream node "lhrrepmgr64361" (ID: 1) in normal state
[2021-04-29 16:34:17] [DETAIL] last monitoring statistics update was 5 seconds ago
[2021-04-29 16:34:27] [INFO] node "lhrrepmgr64363" (ID: 3) monitoring upstream node "lhrrepmgr64361" (ID: 1) in normal state
[2021-04-29 16:34:27] [DETAIL] last monitoring statistics update was 5 seconds ago
[2021-04-29 16:34:37] [INFO] node "lhrrepmgr64363" (ID: 3) monitoring upstream node "lhrrepmgr64361" (ID: 1) in normal state
[2021-04-29 16:34:37] [DETAIL] last monitoring statistics update was 5 seconds ago

接下来，我们宕掉主库，来观察能否发生自动切换：

[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 9        | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 9        | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64361 | default  | 100      | 9        | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
[pg13@lhrrepmgr64361 ~]$ pg_ctl stop
waiting for server to shut down.... done
server stopped

观察日志输出：

[2021-04-29 17:03:08] [INFO] witness node "lhrrepmgr64364" (ID: 4) monitoring primary node "lhrrepmgr64361" (ID: 1) in normal state
[2021-04-29 17:03:19] [INFO] witness node "lhrrepmgr64364" (ID: 4) monitoring primary node "lhrrepmgr64361" (ID: 1) in normal state
[2021-04-29 17:03:29] [INFO] witness node "lhrrepmgr64364" (ID: 4) monitoring primary node "lhrrepmgr64361" (ID: 1) in normal state
[2021-04-29 17:03:34] [WARNING] unable to ping "host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
[2021-04-29 17:03:34] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:34] [INFO] checking state of node 1, 1 of 6 attempts
[2021-04-29 17:03:34] [WARNING] unable to ping "user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr"
[2021-04-29 17:03:34] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:34] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-04-29 17:03:39] [INFO] checking state of node 1, 2 of 6 attempts
[2021-04-29 17:03:39] [WARNING] unable to ping "user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr"
[2021-04-29 17:03:39] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:39] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-04-29 17:03:44] [INFO] checking state of node 1, 3 of 6 attempts
[2021-04-29 17:03:44] [WARNING] unable to ping "user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr"
[2021-04-29 17:03:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:44] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-04-29 17:03:49] [INFO] checking state of node 1, 4 of 6 attempts
[2021-04-29 17:03:49] [WARNING] unable to ping "user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr"
[2021-04-29 17:03:49] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:49] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-04-29 17:03:54] [INFO] checking state of node 1, 5 of 6 attempts
[2021-04-29 17:03:54] [WARNING] unable to ping "user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr"
[2021-04-29 17:03:54] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:54] [INFO] sleeping 5 seconds until next reconnection attempt
[2021-04-29 17:03:59] [INFO] checking state of node 1, 6 of 6 attempts
[2021-04-29 17:03:59] [WARNING] unable to ping "user=repmgr password=lhr connect_timeout=2 dbname=repmgr host=172.72.6.61 fallback_application_name=repmgr"
[2021-04-29 17:03:59] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-04-29 17:03:59] [WARNING] unable to reconnect to node 1 after 6 attempts
[2021-04-29 17:03:59] [NOTICE] witness node "lhrrepmgr64364" (ID: 4) now following new primary node "lhrrepmgr64362" (ID: 2)
[2021-04-29 17:03:59] [INFO] resuming witness monitoring mode
[2021-04-29 17:03:59] [DETAIL] following new primary "lhrrepmgr64362" (ID: 2)
[2021-04-29 17:03:59] [INFO] witness monitoring connection to primary node "lhrrepmgr64362" (ID: 2)
[2021-04-29 17:04:09] [INFO] witness node "lhrrepmgr64364" (ID: 4) monitoring primary node "lhrrepmgr64362" (ID: 2) in normal state
[2021-04-29 17:04:19] [INFO] witness node "lhrrepmgr64364" (ID: 4) monitoring primary node "lhrrepmgr64362" (ID: 2) in normal state
[2021-04-29 17:04:29] [INFO] witness node "lhrrepmgr64364" (ID: 4) monitoring primary node "lhrrepmgr64362" (ID: 2) in normal state

可以看到，在尝试6次连接后，发生了主从切换，主库变为了lhrrepmgr64362。

查看状态：

[pg13@lhrrepmgr64362 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | - failed  | ?              | default  | 100      |          | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | primary | * running |                | default  | 100      | 12       | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64362 | default  | 100      | 12       | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "lhrrepmgr64361" (ID: 1)

HINT: execute with --verbose option to see connection error messages

接下来，等原主库61修复后，我们需要加入到现有的环境中：

[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf node rejoin -d 'host=172.72.6.62 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose
NOTICE: using provided configuration file "/pg13/pg13/repmgr.conf"
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 12 forked off current database system timeline 11 before current recovery point 1/22000028
INFO: prerequisites for using pg_rewind are met
INFO: 0 files copied to "/tmp/repmgr-config-archive-lhrrepmgr64361"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/pg13/pg13/bin/pg_rewind -D '/pg13/pgdata' --source-server='host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2'"
NOTICE: 0 files copied to /pg13/pgdata
INFO: directory "/tmp/repmgr-config-archive-lhrrepmgr64361" deleted
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/pg13/pg13/bin/pg_ctl  -w -D '/pg13/pgdata' start"
INFO: node "lhrrepmgr64361" (ID: 1) is pingable
INFO: node "lhrrepmgr64361" (ID: 1) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | standby |   running | lhrrepmgr64362 | default  | 100      | 11       | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | primary | * running |                | default  | 100      | 12       | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64362 | default  | 100      | 12       | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
[pg13@lhrrepmgr64361 ~]$

可以做switchover操作再切换回lhrrepmgr64361为主库的状态。

七、配置Pgpool-II做读写分离+负载均衡

👉 详细配置过程，在《【DB宝61】PostgreSQL使用Pgpool-II实现读写分离+负载均衡》中已经讲解过了，这里简单配置即可。

7.1、安装Pgpool-II

-- 下载镜像
docker pull registry.cn-hangzhou.aliyuncs.com/lhrbest/lhrpgpool:4.2.2_01


-- 创建Pgpool-II的容器
docker rm -f lhrpgpool01
docker run -d --name lhrpgpool01 -h lhrpgpool \
  --net=pg-network --ip 172.72.6.66 \
  -p 19999:9999 -p 19898:9898 \
  -v /sys/fs/cgroup:/sys/fs/cgroup \
  --privileged=true lhrbest/lhrpgpool:4.2.2_01 \
  /usr/sbin/init

docker network connect bridge lhrpgpool01

docker restart lhrpgpool01
docker exec -it lhrpgpool01 bash

以下所有操作均在pgsql用户执行。

7.2、配置pgpool.conf

cat >> /postgresql/pgpool/etc/pgpool.conf <<"EOF"
# - Backend Connection Settings -
backend_hostname0 = '172.72.6.61'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/pg13/pgdata'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'lhrrepmgr64361'


backend_hostname1 = '172.72.6.62'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/pg13/pgdata'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = 'lhrrepmgr64362'


backend_hostname2 = '172.72.6.63'
backend_port2 = 5432
backend_weight2 = 1
backend_data_directory2 = '/pg13/pgdata'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_application_name2 = 'lhrrepmgr64363'

EOF

7.3、在主库创建用户

psql -U postgres -h 192.168.66.35 -p64361
create role nobody login encrypted password 'lhr';
create role lhr login encrypted password 'lhr';
create role pgpool login encrypted password 'lhr';


grant postgres to nobody,lhr,pgpool;

在 PostgreSQL 里没有区分用户和角色的概念，CREATE USER 为 CREATE ROLE 的别名，这两个命令几乎是完全相同的，唯一的区别是”CREATE USER”命令创建的用户的 LOGIN 属性默认为 on , 而 CREATE ROLE 命令创建的用户的 NOLOGIN 属性默认为 on。

7.4、启动Pgpool-II

systemctl restart pgpool
systemctl status pgpool

7.5、查询

[pgsql@lhrpgpool ~]$ psql -U pgpool -h 192.168.66.35  -p 19999 -d postgres
psql (13.2)
Type "help" for help.

postgres=> show pool_nodes;
 node_id |  hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 172.72.6.61 | 5432 | up     | 0.333333  | standby | 0          | false             | 0                 |                   |                        | 2021-04-29 17:40:28
 1       | 172.72.6.62 | 5432 | up     | 0.333333  | primary | 0          | false             | 0                 |                   |                        | 2021-04-29 17:40:28
 2       | 172.72.6.63 | 5432 | up     | 0.333333  | standby | 0          | true              | 0                 |                   |                        | 2021-04-29 17:40:28
(3 rows)

7.6、switchover故障转移后，查询读写分离

repmgr -f /pg13/pg13/repmgr.conf cluster show
repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
repmgr -f /pg13/pg13/repmgr.conf cluster show

执行过程：

[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | standby |   running | lhrrepmgr64362 | default  | 100      | 12       | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | primary | * running |                | default  | 100      | 12       | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64362 | default  | 100      | 12       | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64362 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf standby switchover --siblings-follow --force-rewind
NOTICE: executing switchover on node "lhrrepmgr64361" (ID: 1)
NOTICE: local node "lhrrepmgr64361" (ID: 1) will be promoted to primary; current primary "lhrrepmgr64362" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "lhrrepmgr64362" (ID: 2)
NOTICE: issuing CHECKPOINT on node "lhrrepmgr64362" (ID: 2) 
DETAIL: executing server command "/pg13/pg13/bin/pg_ctl  -D '/pg13/pgdata' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 1/22000028
NOTICE: promoting standby to primary
DETAIL: promoting server "lhrrepmgr64361" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "lhrrepmgr64361" (ID: 1) was successfully promoted to primary
NOTICE: issuing CHECKPOINT on node "lhrrepmgr64361" (ID: 1) 
ERROR: unable to execute CHECKPOINT
INFO: local node 2 can attach to rejoin target node 1
DETAIL: local node's recovery point: 1/22000028; rejoin target node's fork point: 1/220000A0
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/pg13/pg13/bin/pg_ctl  -w -D '/pg13/pgdata' start"
WARNING: node "lhrrepmgr64362" attached in state "startup"
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
WARNING: node "lhrrepmgr64362" attached in state "startup"
INFO: waiting for node "lhrrepmgr64362" (ID: 2) to connect to new primary; 1 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: node "lhrrepmgr64361" (ID: 2) is currrently attached to its upstream node in state "startup"
NOTICE: node  "lhrrepmgr64361" (ID: 1) promoted to primary, node "lhrrepmgr64362" (ID: 2) demoted to standby
NOTICE: executing STANDBY FOLLOW on 2 of 2 siblings
INFO:  node 4 received notification to follow node 1
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "lhrrepmgr64361" is now primary and node "lhrrepmgr64362" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
[pg13@lhrrepmgr64361 ~]$ 
[pg13@lhrrepmgr64361 ~]$ repmgr -f /pg13/pg13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream       | Location | Priority | Timeline | Connection string                                                        
----+----------------+---------+-----------+----------------+----------+----------+----------+---------------------------------------------------------------------------
 1  | lhrrepmgr64361 | primary | * running |                | default  | 100      | 13       | host=172.72.6.61 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 2  | lhrrepmgr64362 | standby |   running | lhrrepmgr64361 | default  | 100      | 12       | host=172.72.6.62 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 3  | lhrrepmgr64363 | standby |   running | lhrrepmgr64361 | default  | 100      | 12       | host=172.72.6.63 user=repmgr password=lhr dbname=repmgr connect_timeout=2
 4  | lhrrepmgr64364 | witness | * running | lhrrepmgr64361 | default  | 0        | n/a      | host=172.72.6.64 user=repmgr password=lhr dbname=repmgr connect_timeout=2
[pg13@lhrrepmgr64361 ~]$

使用读写分离测试：

C:\Users\lhrxxt>psql -U pgpool -h 192.168.66.35  -p 19999 -d postgres
Password for user pgpool:
psql (13.2)
Type "help" for help.

postgres=> show pool_nodes;
 node_id |  hostname   | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 172.72.6.61 | 5432 | up     | 0.333333  | primary | 0          | true              | 0                 |                   |                        | 2021-04-30 11:17:30
 1       | 172.72.6.62 | 5432 | up     | 0.333333  | standby | 0          | false             | 0                 |                   |                        | 2021-04-30 11:17:30
 2       | 172.72.6.63 | 5432 | up     | 0.333333  | standby | 0          | false             | 0                 |                   |                        | 2021-04-30 11:14:39
(3 rows)

可以看到，状态是正确的。

其它情况不再做测试，想学习更多内容，可以私聊麦老师。

八、总结

1、由于主从切换需要启动数据库，所以，在整个环境中，所有的PG数据库均不能使用systemctl来管理，否则切换会出现问题。

2、若要清除节点的注册信息，也可以进入主库中，操作repmgr.nodes表信息。

本文结束。

本文参与腾讯云自媒体分享计划，分享自微信公众号。

原始发表：2022-10-25，如有侵权请联系 cloudcommunity@tencent.com 删除

数据库

sql

本文分享自 DB宝微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！

数据库

sql