集群安装配置:http://www.linuxmysql.com/14/2023/1194.htm
当主服务器发生宕机或服务异常不同用时,就需要让备服务器接管故障的主服务器,以确保整个repmgr集群对外可用,这时可以通过repmgr standby promotion完成备份服务器对主服务器的按管
查看现在集群的状态:
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | * running | | default | 3 | 1 | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | standby | running | node13 | default | 2 | 1 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node13 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node13 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456
现在模拟node13主库停机:
[root@node13 12]# systemctl stop postgresql-12
此时,集群处理部分禁用状态,集群不能对外提供写服务,需要手动恢复
在node14上查看集群的状态:
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+---------------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | ? unreachable | ? | default | 3 | | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | standby | running | ? node13 | default | 2 | 1 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | ? node13 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | ? node13 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456
WARNING: following issues were detected
- unable to connect to node "node13" (ID: 1)
- node "node13" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "node14" (ID: 2)'s upstream node "node13" (ID: 1)
- unable to determine if node "node14" (ID: 2) is attached to its upstream node "node13" (ID: 1)
- unable to connect to node "node15" (ID: 3)'s upstream node "node13" (ID: 1)
- unable to determine if node "node15" (ID: 3) is attached to its upstream node "node13" (ID: 1)
- unable to connect to node "node16" (ID: 4)'s upstream node "node13" (ID: 1)
HINT: execute with --verbose option to see connection error messages
现在使用以下方法在node14上升级第一个备用数据库:
-bash-4.2$ repmgr standby promote
WARNING: 2 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
node15 (node ID: 3)
node16 (node ID: 4, witness server)
NOTICE: promoting standby to primary
DETAIL: promoting server "node14" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node14" (ID: 2) was successfully promoted to primary
查看当前集群的状态:
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | - failed | ? | default | 3 | | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | ? node13 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456
WARNING: following issues were detected
- unable to connect to node "node13" (ID: 1)
- unable to connect to node "node15" (ID: 3)'s upstream node "node13" (ID: 1)
- unable to determine if node "node15" (ID: 3) is attached to its upstream node "node13" (ID: 1)
HINT: execute with --verbose option to see connection error messages
通过查看集群状态可知,主库不可用的告警已自动清除,因为备库已接管成主库,但是仅仅是接管主库服务
目前node15节点的上级仍为node13,需要运行以下命令使未跟随新主服务器的备库重新加入新的集群中
在node15上,执行以下命令:
-bash-4.2$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: local node 3 can attach to follow target node 2
DETAIL: local node's recovery point: 0/5014D70; follow target node's fork point: 0/5014D70
NOTICE: setting node 3's upstream to node 2
NOTICE: stopping server using "/usr/bin/pg_ctl -D '/var/lib/pgsql/12/data' -w -m fast stop"
NOTICE: starting server using "/usr/bin/pg_ctl -w -D '/var/lib/pgsql/12/data' start"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node14" (ID: 2)
查看集群状态:
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | - failed | ? | default | 3 | | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node14 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456
WARNING: following issues were detected
- unable to connect to node "node13" (ID: 1)
HINT: execute with --verbose option to see connection error messages
恢复node13服务器的服务,需要将备库重新加入集群中,在node13上执行
启动node13的服务:
systemctl start postgresql-12
此时的集群状态如下:
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | primary | ! running | | default | 3 | 1 | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node14 | default | 1 | 1 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456
WARNING: following issues were detected
- node "node13" (ID: 1) is running but the repmgr node record is inactive
将node13重新加入集群,这时只能以备库的身份加入:
在node13服务器上:
停止数据库:
[root@node13 12]# systemctl stop postgresql-12
-bash-4.2$ repmgr node rejoin -d 'host=172.16.1.14 dbname=repmgr user=repmgr password=123456' --force-rewind
--force-rewind:如果测试环境主库没有数据变化,所以可以不加--force-rewind这个参数
-bash-4.2$ repmgr node rejoin -d 'host=172.16.1.14 dbname=repmgr user=repmgr password=123456'
NOTICE: rejoin target is node "node14" (ID: 2)
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/503BA38, rejoin target lsn is 0/503BE08
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=node13 port=5432 dbname=repmgr user=repmgr password=123456"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/bin/pg_ctl -w -D '/var/lib/pgsql/12/data' start"
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
查看集群状态:
-bash-4.2$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node13 | standby | running | node14 | default | 3 | 2 | host=node13 port=5432 dbname=repmgr user=repmgr password=123456
2 | node14 | primary | * running | | default | 2 | 2 | host=node14 port=5432 dbname=repmgr user=repmgr password=123456
3 | node15 | standby | running | node14 | default | 1 | 2 | host=node15 port=5432 dbname=repmgr user=repmgr password=123456
4 | node16 | witness | * running | node14 | default | 0 | n/a | host=node16 port=5432 dbname=repmgr user=repmgr password=123456
以上说明集群恢复正常