PostgreSQL高可用工具-repmgr-手动故障切换

2023-05-29

PostgreSQL高可用工具-repmgr-手动故障切换

集群安装配置：http://www.linuxmysql.com/14/2023/1194.htm

当主服务器发生宕机或服务异常不同用时，就需要让备服务器接管故障的主服务器，以确保整个repmgr集群对外可用，这时可以通过repmgr standby promotion完成备份服务器对主服务器的按管

查看现在集群的状态：
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | * running | | default | 3 | 1 | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | standby | running | node13 | default | 2 | 1 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node13 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node13 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456

现在模拟node13主库停机：
[root@node13 12]# systemctl stop postgresql-12

此时，集群处理部分禁用状态，集群不能对外提供写服务，需要手动恢复
在node14上查看集群的状态：
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+---------------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | ? unreachable | ? | default | 3 | | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | standby | running | ? node13 | default | 2 | 1 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | ? node13 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | ? node13 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456

WARNING: following issues were detected
- unable to connect to node "node13" (ID: 1)
- node "node13" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node14" (ID: 2)'s upstream node "node13" (ID: 1)
- unable to determine if node "node14" (ID: 2) is attached to its upstream node "node13" (ID: 1)
- unable to connect to node "node15" (ID: 3)'s upstream node "node13" (ID: 1)
- unable to determine if node "node15" (ID: 3) is attached to its upstream node "node13" (ID: 1)
- unable to connect to node "node16" (ID: 4)'s upstream node "node13" (ID: 1)

HINT: execute with --verbose option to see connection error messages

现在使用以下方法在node14上升级第一个备用数据库：
-bash-4.2$ repmgr standby promote
WARNING: 2 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
node15 (node ID: 3)
node16 (node ID: 4, witness server)
NOTICE: promoting standby to primary
DETAIL: promoting server "node14" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node14" (ID: 2) was successfully promoted to primary

查看当前集群的状态：
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | - failed | ? | default | 3 | | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | ? node13 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456

WARNING: following issues were detected
- unable to connect to node "node13" (ID: 1)
- unable to connect to node "node15" (ID: 3)'s upstream node "node13" (ID: 1)
- unable to determine if node "node15" (ID: 3) is attached to its upstream node "node13" (ID: 1)

HINT: execute with --verbose option to see connection error messages

通过查看集群状态可知，主库不可用的告警已自动清除，因为备库已接管成主库，但是仅仅是接管主库服务
目前node15节点的上级仍为node13,需要运行以下命令使未跟随新主服务器的备库重新加入新的集群中

在node15上，执行以下命令：
-bash-4.2$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: local node 3 can attach to follow target node 2
DETAIL: local node's recovery point: 0/5014D70; follow target node's fork point: 0/5014D70
NOTICE: setting node 3's upstream to node 2
NOTICE: stopping server using "/usr/bin/pg_ctl -D '/var/lib/pgsql/12/data' -w -m fast stop"
NOTICE: starting server using "/usr/bin/pg_ctl -w -D '/var/lib/pgsql/12/data' start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node14" (ID: 2)

查看集群状态：
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | - failed | ? | default | 3 | | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node14 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456

WARNING: following issues were detected
- unable to connect to node "node13" (ID: 1)

HINT: execute with --verbose option to see connection error messages

恢复node13服务器的服务，需要将备库重新加入集群中，在node13上执行
启动node13的服务：
systemctl start postgresql-12
此时的集群状态如下：
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | ! running | | default | 3 | 1 | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node14 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456

WARNING: following issues were detected
- node "node13" (ID: 1) is running but the repmgr node record is inactive

将node13重新加入集群，这时只能以备库的身份加入:
在node13服务器上：
停止数据库：
[root@node13 12]# systemctl stop postgresql-12
-bash-4.2$ repmgr node rejoin -d 'host=172.16.1.14 dbname=repmgr user=repmgr password=123456' --force-rewind
--force-rewind：如果测试环境主库没有数据变化，所以可以不加--force-rewind这个参数

-bash-4.2$ repmgr node rejoin -d 'host=172.16.1.14 dbname=repmgr user=repmgr password=123456'
NOTICE: rejoin target is node "node14" (ID: 2)
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/503BA38, rejoin target lsn is 0/503BE08
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=node13 port=5432 dbname=repmgr user=repmgr password=123456"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/bin/pg_ctl -w -D '/var/lib/pgsql/12/data' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

查看集群状态：
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | standby | running | node14 | default | 3 | 2 | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node14 | default | 1 | 2 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456

以上说明集群恢复正常

分割线

感谢打赏

江西数库信息技术有限公司

YWSOS.COM 平台代运维解决方案

评论

发表评论

姓名：

站内搜索

联系我

微信：18859224109-
邮箱：18859224109-@163.com
Q Q：1326992959-

相关文章

人气文章

Powered by AKCMS