PgSQL Replicated Cluster

This page explains how to install a 2-node multistate(master/slave) cluster on Fedora 19 using PostgreSQL synchronous streaming replication which is implemented since version 9.1.

Features

 * Fail-over of Master
 * If Master breaks down, RA detects this fault and makes Master to stop, and Slave is promoted to new Master(promote).
 * Switching between asynchronous and synchronous
 * If Slave breaks down or LAN has some troubles, the transaction including Write operation will be stopped during the setting of synchronous replication. This means the stop of service. Therefore, RA switches dynamically from the synchronous to the asynchronous replication for prevented from stopping service.
 * Automated discrimination of data old and new when initial starts
 * When Pacemaker is started at the same time in the initial starts, RA compares the data of each node using "last xlog replay location" or "last receive replay location" to check which node has the newest data. The node which has the newest data will be Master.
 * Load-balancing of read
 * Because Slave can process the Read only-transaction, the load-balancing of Read is possible by applying another virtual IP for the Read operation.

Terms
These terms are used in this document.


 * Master,Slave
 * The state of Master/Slave resource of Pacemaker
 * PRI
 * PostgreSQL works as Primary(Master). The request of Read/Write can be processed in PRI as well as usual PostgreSQL, and data for the replication can be transmitted. Master of Pacemaker is basically corresponding to this state. At a synchronous replication, The transaction is stopped when the response from Standby PostgreSQL(HS) is lost.
 * HS
 * PostgreSQL works as Hot Standby. Only the request of Read is available. The data of the replication can be received from PRI. Since PostgreSQL cannot change the state from PRI to HS directly, the state may not be consistent to Slave of Pacemaker after demoting though the state changes to Pacemaker with Master->Slave.
 * async mode
 * Make the asynchronous replication of PostgreSQL HA cluster with RA.
 * sync mode
 * Make the synchronous replication of PostgreSQL HA cluster with RA. If parameter "rep_mode" is sync, HS works as the sync mode as a normal. When Master(PRI) detects HS is breaking down on monitor, Master switches replication mode to async mode automatically.
 * replication mode
 * The async mode and the sync mode are generically called a replication mode.It can be set by "rep_mode" parameter.

Software version
We developed pgsql RA using pacemaker 1.0.11 with heartbeat stack and linux distribution compatible with RHEL6. It will probably work on other distributions but I recommend using newer pacemaker and replacing pgsql RA(https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql) if your resource-agents < 3.9.6.

Network Topology
It's recommended that LANs(for service, pacemaker, replication, stonith) are separated and redundant. But we use these topology in this document to explain simply.


 * node1
 * eth0 : 192.168.0.1 : LAN for service
 * eth1 : 192.168.1.1 : LAN for Pacemaker
 * eth2 : 192.168.2.1 : LAN for replication


 * node2
 * eth0 : 192.168.0.2 : LAN for service
 * eth1 : 192.168.1.2 : LAN for Pacemaker
 * eth2 : 192.168.2.2 : LAN for replication

virtual IP (vip)
 * virtual IP1 : 192.168.0.3 : vip for eth0 (DB client connects this IP to access PostgreSQL(Master))
 * virtual IP2 : 192.168.2.3 : vip for eth2 (Slave connects this IP to replicate)



PostgreSQL
Use PostgreSQL 9.1 or later.

Parameters of pgsql RA
The following parameters are added for replication.


 * rep_mode
 * choice from async or sync to use replication."async" is used for async mode only, "sync" is used for switching between sync mode and async mode.The following parameter node_list master_ip, and restore_command is necessary at async or sync modes(*).
 * node_list(*)
 * The list of PRI and HS node names. Specifies a space-separated list of all node name (result of the uname -n command).
 * master_ip(*)
 * HS connects to this IP. It equals virtual IP2 in this documents.
 * restore_command(*)
 * restore_command specified in recovery.conf file when starting with HS.
 * repuser
 * The user of replication which HS connects to PRI. Default is "postgres". Please use .pgpass file to set password.
 * primary_conninfo_opt
 * RA generates recovery.conf file to start PostgreSQL as HS. host,port,user and application name of primary_conninfo are automatically set by RA. If you would like to set some additional parameters, you can specify them here.
 * tmpdir
 * the rep_mode_conf and xlog_note.* and PGSQL.lock files are created in this directory. Default is /var/lib/pgsql/tmp directory. If the directory doesn't exist, RA makes it automatically.
 * xlog_check_count
 * The count of cheking last_xlog_replay_location or last_xlog_receive_location is specified to compare data. Default is 3(times). It is counted at monitor interval.
 * stop_escalate_in_slave
 * Number of shutdown retries (using -m fast) before resorting to -m immediate in Slave state. In Master sate, you can use "stop_escalate" which is provided since early times.
 * restart_on_promote
 * RA restarts PostgreSQL on promote instead of promote to prevent from increasing Timeline ID of PostgreSQL since HS can't connect PRI if Timeline ID is different. Default is false and you should copy data from PRI to align Timeline ID after promoting.

Installation
In this documents, we use Fedora 19.

OS (Fedora 19) (both nodes)
We don't explain details.


 * disable selinux
 * edit /etc/selinux/config
 * disable firewall(iptables)
 * systemctl stop firewalld.service
 * systemctl disable firewalld.service
 * set node name "node1" and "node2".
 * edit /etc/sysconfig/network
 * stop and disable NetworkManager and set up IPs as above.
 * systemctl stop NetworkManager.service
 * systemctl disable NetworkManager.service
 * edit /etc/sysconfig/network-scripts/ifcfg-ethX

Packages (both nodes)

 * 1) yum -y install postgresql-server pacemaker corosync pcs

Replacement of pgsql RA (both nodes)
Replace pgsql RA because the RA of resource-agents-3.9.5-2.fc19.x86_64 has an bug when using it as Master/Slave.


 * 1) wget https://raw.github.com/ClusterLabs/resource-agents/master/heartbeat/pgsql
 * 2) cp pgsql /usr/lib/ocf/resource.d/heartbeat/
 * 3) chmod 755 /usr/lib/ocf/resource.d/heartbeat/pgsql

PostgreSQL (node1 only)
$ mkdir /var/lib/pgsql/pg_archive $ cd /var/lib/pgsql/data $ initdb
 * 1) su - postgres

Edit postgresql.conf.


 * Point of postgresql.conf setting
 * If there is "synchronous_standby_names" parameter, please delete it.
 * Fixed IP cannot be written in listen_address.
 * Replication_timeout is detection time for the replication cuts it, and wal_receiver_status_interval is an interval when HS tries connecting to PRI. To shorten detection, you should set this value to small.
 * When using rep_mode=sync, RA adds "include" into postgresql.conf to switch replication mode. If you want to switch to rep_mode=async from sync, you need to delete it manually.

The main set part as follows. Please refer to the manual of PostgreSQL for other parameter. Check the starting with the PostgreSQL unit, and the replication is possible.

listen_addresses = '*' wal_level = hot_standby synchronous_commit = on archive_mode = on archive_command = 'cp %p /var/lib/pgsql/pg_archive/%f' max_wal_senders=5 wal_keep_segments = 32 hot_standby = on restart_after_crash = off replication_timeout = 5000 wal_receiver_status_interval = 2 max_standby_streaming_delay = -1 max_standby_archive_delay = -1 synchronous_commit = on restart_after_crash = off hot_standby_feedback = on

Edit pg_hba.conf. Be careful. This explanation is not considered about the security.

host   all             all     127.0.0.1/32        trust host   all             all     192.168.0.0/16      trust host   replication     all     192.168.0.0/16      trust

start PostgreSQL on node1

$ pg_ctl -D /var/lib/pgsql/data start

PostgreSQL (node2 only)
Copy data from node1 to node2 $ rm -rf /var/lib/pgsql/data/* $ pg_basebackup -h 192.168.2.1 -U postgres -D /var/lib/pgsql/data -X stream -P $ mkdir /var/lib/pgsql/pg_archive
 * 1) su - postgres

Create /var/lib/pgsql/data/recovery.conf to confirm replication.

standby_mode = 'on' primary_conninfo = 'host=192.168.2.1 port=5432 user=postgres application_name=node2' restore_command = 'cp /var/lib/pgsql/pg_archive/%f %p' recovery_target_timeline = 'latest'

Start PostgreSQL on node2

$ pg_ctl -D /var/lib/pgsql/data/ start

Confirm PostgreSQL replication success (node1 only)
$ psql -c "select client_addr,sync_state from pg_stat_replication;" client_addr | sync_state ---+ 192.168.2.2  | async
 * 1) su - postgres

Stop PostgreSQL (both nodes)
$ pg_ctl -D /var/lib/pgsql/data stop $ exit

Corosync (both nodes)
Create /etc/corosync/corosync.conf

quorum { provider: corosync_votequorum expected_votes: 2 } aisexec { user: root group: root } service { name: pacemaker ver: 0 } totem { version: 2 secauth: off interface { ringnumber: 0 bindnetaddr: 192.168.1.0 mcastaddr: 226.94.1.1 mcastport: 5405 } } logging { to_syslog: yes }

Start corosync
 * 1) systemctl start corosync.service

You can see this log in /var/log/messages when you succeed in starting corosync. Starting Corosync Cluster Engine (corosync): [ OK  ]

Pacemaker (both nodes)
Clear current settings if it exists.


 * 1) rm -f /var/lib/pacemaker/cib/cib*

Start pacemaker


 * 1) systemctl start pacemaker.service

Check status


 * 1) crm_mon -Afr -1

Last updated: Mon Jul 8 09:46:27 2013 Last change: Mon Jul 8 09:46:27 2013 via crmd on node1 Stack: corosync Current DC: node1 (402696384) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 0 Resources configured.

Online: [ node1 node2 ]

Full list of resources:

Node Attributes:
 * Node node1:
 * Node node2:

Migration summary:
 * Node node1:
 * Node node2:

Make configuration file(config.pcs) for pcs command.

In this sample configuration, "vip-master" means virtual IP1 and "vip-rep" means virtual IP2, and we use restart_on_promote="true" to explain operations simply. If you use false, you should start pacemaker on node1 only and laod configuration. After that you should copy data from node1 to node2 and start pacemaker on node2 to align Timeline ID.

pcs cluster cib pgsql_cfg

pcs -f pgsql_cfg property set no-quorum-policy="ignore" pcs -f pgsql_cfg property set stonith-enabled="false" pcs -f pgsql_cfg resource op defaults resource-stickiness="INFINITY" pcs -f pgsql_cfg resource op defaults migration-threshold="1"

pcs -f pgsql_cfg resource create vip-master IPaddr2 \ ip="192.168.0.3" \ nic="eth0" \ cidr_netmask="24" \ op start  timeout="60s" interval="0s"  on-fail="restart" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop   timeout="60s" interval="0s"  on-fail="block"

pcs -f pgsql_cfg resource create vip-rep IPaddr2 \ ip="192.168.2.3" \ nic="eth2" \ cidr_netmask="24" \ meta migration-threshold="0" \ op start  timeout="60s" interval="0s"  on-fail="stop" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop   timeout="60s" interval="0s"  on-fail="ignore"

pcs -f pgsql_cfg resource create pgsql pgsql \ pgctl="/usr/bin/pg_ctl" \ psql="/usr/bin/psql" \ pgdata="/var/lib/pgsql/data/" \ rep_mode="sync" \ node_list="node1 node2" \ restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \ primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \ master_ip="192.168.2.3" \ restart_on_promote='true' \ op start  timeout="60s" interval="0s"  on-fail="restart" \ op monitor timeout="60s" interval="4s" on-fail="restart" \ op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \ op promote timeout="60s" interval="0s" on-fail="restart" \ op demote timeout="60s" interval="0s"  on-fail="stop" \ op stop   timeout="60s" interval="0s"  on-fail="block" \ op notify timeout="60s" interval="0s"

pcs -f pgsql_cfg resource master msPostgresql pgsql \ master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

pcs -f pgsql_cfg resource group add master-group vip-master vip-rep

pcs -f pgsql_cfg constraint colocation add master-group with Master msPostgresql INFINITY pcs -f pgsql_cfg constraint order promote msPostgresql then start master-group symmetrical=false score=INFINITY pcs -f pgsql_cfg constraint order demote msPostgresql then stop  master-group symmetrical=false score=0

pcs cluster push cib pgsql_cfg

Load configuration


 * 1) sh config.pcs

Check status again.


 * 1) crm_mon -Afr -1

Last updated: Mon Jul 8 10:24:21 2013 Last change: Mon Jul 8 10:22:14 2013 via crm_attribute on node1 Stack: corosync Current DC: node2 (419473600) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 4 Resources configured.

Online: [ node1 node2 ]

Full list of resources:

Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2):      Started node1 vip-rep   (ocf::heartbeat:IPaddr2):       Started node2 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Slaves: [ node2 ]

Node Attributes: + master-pgsql                     : 1000 + pgsql-data-status                : LATEST + pgsql-master-baseline            : 0000000009000080 + pgsql-status                     : PRI + master-pgsql                     : 100 + pgsql-data-status                : STREAMING|SYNC + pgsql-status                     : HS:sync
 * Node node1:
 * Node node2:

Migration summary:
 * Node node1:
 * Node node2:

after fail-over
Kill PostgreSQL process at node1 to occur fail-over.
 * 1) killall -9 postgres

Check status


 * 1) crm_mon -Afr -1

Last updated: Mon Aug 26 19:12:33 2013 Last change: Mon Aug 26 19:12:32 2013 via crm_attribute on node2 Stack: corosync Current DC: node2 (2566957248) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 4 Resources configured.

Online: [ node1 node2 ]

Full list of resources:

Master/Slave Set: msPostgresql [pgsql] Masters: [ node2 ] Stopped: [ node1 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2):      Started node2 vip-rep   (ocf::heartbeat:IPaddr2):       Started node2

Node Attributes: + master-pgsql                     : -INFINITY + pgsql-data-status                : DISCONNECT + pgsql-status                     : STOP + master-pgsql                     : 1000 + pgsql-data-status                : LATEST + pgsql-master-baseline            : 000000000F000000 + pgsql-status                     : PRI
 * Node node1:
 * Node node2:

Migration summary: pgsql: migration-threshold=1000000 fail-count=1000000 last-failure='Mon Aug 26 19:12:26 2013'
 * Node node1:
 * Node node2:

Failed actions: pgsql_start_0 (node=node1, call=157, rc=1, status=complete, last-rc-change=Mon Aug 26 19:12:27 2013, queued=402ms, exec=0ms ): unknown error

Recovery node1 as Slave. Copy all data from node2 because the data of node1 may be inconsistent.

(at node1) $ rm -rf /var/lib/pgsql/data/ $ pg_basebackup -h 192.168.2.3 -U postgres -D /var/lib/pgsql/data -X stream -P $ rm /var/lib/pgsql/tmp/PGSQL.lock $ exit
 * 1) su - postgres
 * 1) pcs resource cleanup msPostgresql

Check status again.
 * 1) crm_mon -Afr -1

Last updated: Mon Aug 26 19:21:17 2013 Last change: Mon Aug 26 19:21:17 2013 via crm_attribute on node2 Stack: corosync Current DC: node2 (2566957248) - partition with quorum Version: 1.1.9-3.fc19-781a388 2 Nodes configured, unknown expected votes 4 Resources configured.

Online: [ node1 node2 ]

Full list of resources:

Master/Slave Set: msPostgresql [pgsql] Masters: [ node2 ] Slaves: [ node1 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2):      Started node2 vip-rep   (ocf::heartbeat:IPaddr2):       Started node2

Node Attributes: + master-pgsql                     : 100 + pgsql-data-status                : STREAMING|SYNC + pgsql-status                     : HS:sync + master-pgsql                     : 1000 + pgsql-data-status                : LATEST + pgsql-master-baseline            : 000000000F000000 + pgsql-status                     : PRI
 * Node node1:
 * Node node2:

Migration summary:
 * Node node1:
 * Node node2:

About node attributes
The RA defines the following states as a node attribute value of Pacemaker. attribute can be seen in "crm_mon -A".

pgsql-status
A present state of PostgreSQL is displayed by the attribute value to which PRI or either HS node is displayed.


 * STOP
 * PostgreSQL has stopped.
 * HS:alone
 * It works as HS but HS doesn't connect PRI
 * HS:connected
 * it works as HS, and connected with PRI but HS isn't normal state of the replication (Data has not caught up with PRI).
 * HS:async
 * It works as HS, and the state is async mode. When RA is used in the rep_mode=async, it is possible to be promoted to Master.
 * HS:sync
 * It works as HS, and the state is sync mode. It is possible to be promoted to Master for the rep_mode=sync.
 * PRI
 * It operates by PRI.

pgsql-data-status
The transitional state of data is displayed. This state remains after stopping pacemaker. When starting pacemaker next time, this state is used to judge whether my data is old or not.


 * DISCONNECT
 * Master changes other node state into DISCONNECT if Master can't detect connection of replication because of LAN failure or breakdown of Slave and so on.
 * {state}|{sync_state}
 * Master changes other node state into {state}|{sync_state} if Master detects connection of replication.
 * {state} and {sync_state} means state of replication which is retrieved using "select state and sync_state from pg_stat_replication" on Master.
 * For example, INIT, CATCHUP, and STREAMING are displayed in {state} and ASYNC, SYNC are displayed in {sync_state}
 * LATEST
 * It's displayed when it's Master.

These states are the transitional state of final data, and it may be not consistent with the state of actual data. For instance, During PRI, the state is "LATEST". But the node is stopped or down, this state "LATEST" is maintained if Master doesn't exist in other nodes. It never changes to "DISCONNECT" for oneself. When other node newly is promoted, this new Master changes the state of old Master to "DISCONNECT". When any node can not become Master, this "LATEST" will be keeped.

pgsql-xlog-replay-loc
There is no Master node, it is displayed. RA decide to promote one node to Master comparing with the value of the last_replay_xlog_location or last_receive_xlog_replay_location among other node.

sample settings for crm command
property \ no-quorum-policy="ignore" \ stonith-enabled="false" \ crmd-transition-delay="0s" rsc_defaults \ resource-stickiness="INFINITY" \ migration-threshold="1" ms msPostgresql pgsql \ meta \ master-max="1" \ master-node-max="1" \ clone-max="2" \ clone-node-max="1" \ notify="true" clone clnPingCheck pingCheck group master-group \ vip-master \ vip-rep primitive vip-master ocf:heartbeat:IPaddr2 \ params \ ip="192.168.0.3" \ nic="eth0" \ cidr_netmask="24" \ op start  timeout="60s" interval="0s"  on-fail="stop" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop   timeout="60s" interval="0s"  on-fail="block" primitive vip-rep ocf:heartbeat:IPaddr2 \ params \ ip="192.168.2.3" \ nic="eth2" \ cidr_netmask="24" \ meta \ migration-threshold="0" \ op start  timeout="60s" interval="0s"  on-fail="restart" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop   timeout="60s" interval="0s"  on-fail="block" primitive pgsql ocf:heartbeat:pgsql \ params \ pgctl="/usr/bin/pg_ctl" \ psql="/usr/bin/psql" \ pgdata="/var/lib/pgsql/data/" \ start_opt="-p 5432" \ rep_mode="sync" \ node_list="node1 node2" \ restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \ primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \ master_ip="192.168.2.3" \ restart_on_promote="true" \ op start  timeout="60s" interval="0s"  on-fail="restart" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op monitor timeout="60s" interval="9s" on-fail="restart" role="Master" \ op promote timeout="60s" interval="0s" on-fail="restart" \ op demote timeout="60s" interval="0s"  on-fail="stop" \ op stop   timeout="60s" interval="0s"  on-fail="block" \ op notify timeout="60s" interval="0s" colocation rsc_colocation-1 inf: master-group msPostgresql:Master order rsc_order-1 0: msPostgresql:promote master-group:start  symmetrical=false order rsc_order-2 0: msPostgresql:demote  master-group:stop   symmetrical=false

You can load as follows.
 * 1) crm configure load update [filename]

Q&A

 * How do I force start Master although pgsql-data-status is "DISCONNECT"?
 * # crm_attribute -l forever -N {Node Name} -n "pgsql-data-status" -v "LATEST"
 * What's PGSQL.lock file?
 * The file is created on promote. And it's deleted on demote only if Slave does not exist. If this file remains in a node, it means that the data may be inconsistent. Please copy all data from PRI and delete this lock file.
 * How do we stop all servers?
 * First, stop Slave. After that stop Master. If you stop Master in first, PGSQL.lock file remains.