专栏首页数据库架构之美使用repmgrd实现postgresql failover和auto failover

使用repmgrd实现postgresql failover和auto failover

前面的文章介绍了postgresql基于repmgr的高可用及切换方案,这篇文章主要聊聊通过repmgrd实现failover及auto failover。

前提是部署好postgresql主从,同时部署好repmgr。

[postgres@node1 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

failover

停止主库,模拟主库故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/
waiting for server to shut down..... done
server stopped

备库查看是unreachable状态

[postgres@node2 .ssh]$ repmgr cluster show
 ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+---------------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | ? unreachable |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running     | ? node1  | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

备库提升为主库

[postgres@node2 ~]$ repmgr standby promote
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"
waiting for server to promote.... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary

新主库查看集群状态

[postgres@node2 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2
 
WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

原主库执行rejoin操作重新加入集群

[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run
[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 2 files copied to /pgdata
INFO: directory "/tmp/repmgr-config-archive-node1" deleted
INFO: deleting "recovery.done"
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

查看集群状态

[postgres@node1 pgdata]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

auto failover

可以利用repmgrd进程实现自动的failover,首先要在repmgr.conf文件中将location参数设置为一致,不设置的话默认也是一致的。同时启动repmgrd必须在postgres.conf配置文件中设置shared_preload_libraries='repmgr'

修改主备库repmgr.conf文件

failover=automatic
promote_command='/pgsql/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='/pgsql/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
log_file=/home/postgres/repmgrd.log
monitoring_history=true (启用监控参数)                    
monitor_interval_secs=5(定义监视数据间隔写入时间参数)
reconnect_attempts=10(故障转移之前,尝试重新连接主库次数(默认为6)参数)
reconnect_interval=5(每间隔5s尝试重新连接一次参数)

重启主备库使修改生效

[postgres@node1 ~]$ repmgr node service --action=restart
DETAIL: executing server command "pg_ctl  -w -D '/pgdata' restart"

主备库启动repmgrd

[postgres@node1 ~]$ repmgrd –f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid
[2019-09-20 11:51:23] [NOTICE] redirecting logging output to "/home/postgres/repmgrd.log"

模拟主库故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/
waiting for server to shut down..... done
server stopped

查看备库日志,发现已经升为主库

[2019-09-20 12:02:52] [NOTICE] promoting standby to primary
[2019-09-20 12:02:52] [DETAIL] promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"
[2019-09-20 12:02:52] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2019-09-20 12:02:52] [NOTICE] STANDBY PROMOTE successful
[2019-09-20 12:02:52] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2019-09-20 12:02:52] [INFO] 0 followers to notify
[2019-09-20 12:02:52] [INFO] switching to primary monitoring mode
[2019-09-20 12:02:52] [NOTICE] monitoring cluster primary "node2" (ID: 2)

查看cluster状态,备库已经升主

[postgres@node2 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 5        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2
 
WARNING: following issues were detected
  - unable to connect to node "node1" (ID: 1)

原主库执行rejoin加入集群

[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run
[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose
INFO: looking for configuration file in /etc
INFO: configuration file found at: "/etc/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
INFO: 2 files copied to "/tmp/repmgr-config-archive-node1"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"
NOTICE: 2 files copied to /pgdata
INFO: directory "/tmp/repmgr-config-archive-node1" deleted
INFO: deleting "recovery.done"
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"
INFO: demoted primary is pingable
INFO: node 1 has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

查看集群状态

[postgres@node1 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 5        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 6        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

本文分享自微信公众号 - 数据库架构之美(databasekernel),作者:数据库架构之美

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-09-24

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 基于repmgr的postgresql主备高可用方案

    本文比较基础,主要介绍postgresql开源高可用工具repmgr的部署和使用,初学者可以根据本文步骤一步一步做下去,废话不多说,直接进入主题,本文以两台机器...

    数据库架构之美
  • 使用pg_repack实现在线vacuum

    Postgresql通过数据多版本实现mvcc,删除数据并不会真正删除数据,而是修改标识,更新是通过删除+插入的方式进行,所以在频繁更新的系统,数据膨胀是个头疼...

    数据库架构之美
  • 聊聊PostgreSQL中的几种索引类型

    索引是增强数据库性能的利器,在检索某些特定行的时候效率会有很大提升,postgresql中索引类型丰富,每种索引有着不同的应用场景,下面简单介绍一下。

    数据库架构之美
  • 如何打造一个高效适配的H5

    以往拿到一份视觉稿要做页面适配,通常都是针对不同分辨率尺寸做断点写 media query,然后看效果再微调。开发过程中视觉稿要是做了修改,修改的地方就得再走一...

    前朝楚水
  • 【DB笔试面试548】在Oracle中,索引有哪3大特性?

    一般来说索引有3大特性,索引高度比较低、索引存储列值及索引本身有序,对这3大特性的应用如下表所示:

    小麦苗DBA宝典
  • 如何打造一个高效适配的H5

    腾讯ISUX
  • 盘点与云计算密切相关的市场热捧产品

    云计算,即一种基于因特网的超级计算模式。随着国家宽带提速的战略的提出,云计算得到更进一步推动。毫无疑问,21世纪中,云计算被视为科技业的一场新的革命。而云计算的...

    静一
  • LeeCX - 开源后台管理系统简单介绍

    我们在github上开源了一个后台管理系统,使用了前端css框架并且简单的封装了一下,技术的将会不间断更新,详细可以点击https://github.com/l...

    风间影月
  • 快速学习-架构师成长之路

    cwl_java
  • 亿级流量网站架构核心技术【笔记】(一)

    3.在有限资源的情况下,一定是先解决当下最核心的问题,预测并发现未来可能出现的问题,一步步解决最痛点的问题,即满足需求的系统是不断迭代优化出来的 A.高并发原...

    硬核项目经理

扫码关注云+社区

领取腾讯云代金券