PG_AUTO_FAILOVER(1) | pg_auto_failover | PG_AUTO_FAILOVER(1) |
pg_auto_failover - pg_auto_failover Documentation
The pg_auto_failover project is an Open Source Software project. The development happens at https://github.com/citusdata/pg_auto_failover and is public: everyone is welcome to participate by opening issues or pull requests, giving feedback, etc.
Remember that the first steps are to actually play with the pg_autoctl command, then read the entire available documentation (after all, I took the time to write it), and then to address the community in a kind and polite way — the same way you would expect people to use when addressing you.
NOTE:
For enhancements, improvements, and new features, consider contributing to the project. Pull Requests are reviewed as part of the offered maintenance.
NOTE:
pg_auto_failover is an extension for PostgreSQL that monitors and manages failover for postgres clusters. It is optimised for simplicity and correctness.
pg_auto_failover implements Business Continuity for your PostgreSQL services. pg_auto_failover implements a single PostgreSQL service using multiple nodes with automated failover, and automates PostgreSQL maintenance operations in a way that guarantees availability of the service to its users and applications.
To that end, pg_auto_failover uses three nodes (machines, servers) per PostgreSQL service:
The pg_auto_failover Monitor implements a state machine and relies on in-core PostgreSQL facilities to deliver HA. For example. when the secondary node is detected to be unavailable, or when its lag is reported above a defined threshold (the default is 1 WAL files, or 16MB, see the pgautofailover.promote_wal_log_threshold GUC on the pg_auto_failover monitor), then the Monitor removes it from the synchronous_standby_names setting on the primary node. Until the secondary is back to being monitored healthy, failover and switchover operations are not allowed, preventing data loss.
In the pictured architecture, pg_auto_failover implements Business Continuity and data availability by implementing a single PostgreSQL service using multiple with automated failover and data redundancy. Even after losing any Postgres node in a production system, this architecture maintains two copies of the data on two different nodes.
When using more than one standby, different architectures can be achieved with pg_auto_failover, depending on the objectives and trade-offs needed for your production setup.
When setting the three parameters above, it's possible to design very different Postgres architectures for your production needs.
In this case, the system is setup with two standby nodes participating in the replication quorum, allowing for number_sync_standbys = 1. The system always maintains a minimum of two copies of the data set: one on the primary, another one on one on either node B or node D. Whenever we lose one of those nodes, we can hold to this guarantee of two copies of the data set.
Adding to that, we have the standby server D which has been set up to not participate in the replication quorum. Node D will not be found in the synchronous_standby_names list of nodes. Also, node D is set up in a way to never be a candidate for failover, with candidate-priority = 0.
This architecture would fit a situation where nodes A, B, and C are deployed in the same data center or availability zone, and node D in another. Those three nodes are set up to support the main production traffic and implement high availability of both the Postgres service and the data set.
Node D might be set up for Business Continuity in case the first data center is lost, or maybe for reporting the need for deployment on another application domain.
pg_auto_failover implements Business Continuity for your Citus services. pg_auto_failover implements a single Citus formation service using multiple Citus nodes with automated failover, and automates PostgreSQL maintenance operations in a way that guarantees availability of the service to its users and applications.
In that case, pg_auto_failover knows how to orchestrate a Citus coordinator failover and a Citus worker failover. A Citus worker failover can be achieved with a very minimal downtime to the application, where during a short time window SQL writes may error out.
In this figure we see a single standby node for each Citus node, coordinator and workers. It is possible to implement more standby nodes, and even read-only nodes for load balancing, see Citus Secondaries and read-replica.
pg_auto_failover includes the command line tool pg_autoctl that implements many commands to manage your Postgres nodes. To implement the Postgres architectures described in this documentation, and more, it is generally possible to use only some of the many pg_autoctl commands.
This section of the documentation is a short introduction to the main commands that are useful when getting started with pg_auto_failover. More commands are available and help deal with a variety of situations, see the Manual Pages for the whole list.
To understand which replication settings to use in your case, see Architecture Basics section and then the Multi-node Architectures section.
To follow a step by step guide that you can reproduce on your own Azure subscription and create a production Postgres setup from VMs, see the pg_auto_failover Tutorial section.
To understand how to setup pg_auto_failover in a way that is compliant with your internal security guide lines, read the Security settings for pg_auto_failover section.
As a command line tool pg_autoctl depends on some environment variables. Mostly, the tool re-uses the Postgres environment variables that you might already know.
To manage a Postgres node pg_auto_failover needs to know its data directory location on-disk. For that, some users will find it easier to export the PGDATA variable in their environment. The alternative consists of always using the --pgdata option that is available to all the pg_autoctl commands.
To get started with the simplest Postgres failover setup, 3 nodes are needed: the pg_auto_failover monitor, and 2 Postgres nodes that will get assigned roles by the monitor. One Postgres node will be assigned the primary role, the other one will get assigned the secondary role.
To create the monitor use the command:
$ pg_autoctl create monitor
The create the Postgres nodes use the following command on each node you want to create:
$ pg_autoctl create postgres
While those create commands initialize your nodes, now you have to actually run the Postgres service that are expected to be running. For that you can manually run the following command on every node:
$ pg_autoctl run
It is also possible (and recommended) to integrate the pg_auto_failover service in your usual service management facility. When using systemd the following commands can be used to produce the unit file configuration required:
$ pg_autoctl show systemd INFO HINT: to complete a systemd integration, run the following commands: INFO pg_autoctl -q show systemd --pgdata "/tmp/pgaf/m" | sudo tee /etc/systemd/system/pgautofailover.service INFO sudo systemctl daemon-reload INFO sudo systemctl enable pgautofailover INFO sudo systemctl start pgautofailover [Unit] ...
While it is expected that for a production deployment each node actually is a separate machine (virtual or physical, or even a container), it is also possible to run several Postgres nodes all on the same machine for testing or development purposes.
TIP:
Once your Postgres nodes have been created, and once each pg_autoctl service is running, it is possible to inspect the current state of the formation with the following command:
$ pg_autoctl show state
The pg_autoctl show state commands outputs the current state of the system only once. Sometimes it would be nice to have an auto-updated display such as provided by common tools such as watch(1) or top(1) and the like. For that, the following commands are available (see also pg_autoctl watch):
$ pg_autoctl watch $ pg_autoctl show state --watch
To analyze what's been happening to get to the current state, it is possible to review the past events generated by the pg_auto_failover monitor with the following command:
$ pg_autoctl show events
HINT:
Use pg_autoctl show state --local to have a view of the local state of a given node without connecting to the monitor Postgres instance.
The option --json is available in most pg_autoctl commands and switches the output format from a human readable table form to a program friendly JSON pretty-printed output.
When creating a node it is possible to use the --candidate-priority and the --replication-quorum options to set the replication properties as required by your choice of Postgres architecture.
To review the current replication settings of a formation, use one of the two following commands, which are convenient aliases (the same command with two ways to invoke it):
$ pg_autoctl show settings $ pg_autoctl get formation settings
It is also possible to edit those replication settings at any time while your nodes are in production: you can change your mind or adjust to new elements without having to re-deploy everything. Just use the following commands to adjust the replication settings on the fly:
$ pg_autoctl set formation number-sync-standbys $ pg_autoctl set node replication-quorum $ pg_autoctl set node candidate-priority
IMPORTANT:
The pg_autoctl set command then changes the replication settings on the node registration on the monitor. Then the monitor assigns the APPLY_SETTINGS state to the current primary node in the system for it to apply the new replication settings to its Postgres streaming replication setup.
As a result, the pg_autoctl set commands requires a stable state in the system to be allowed to proceed. Namely, the current primary node in the system must have both its Current State and its Assigned State set to primary, as per the pg_autoctl show state output.
When a Postgres node must be taken offline for a maintenance operation, such as e.g. a kernel security upgrade or a minor Postgres update, it is best to make it so that the pg_auto_failover monitor knows about it.
To implement maintenance operations, use the following commands:
$ pg_autoctl enable maintenance $ pg_autoctl disable maintenance
The main pg_autoctl run service that is expected to be running in the background should continue to run during the whole maintenance operation. When a node is in the maintenance state, the pg_autoctl service is not controlling the Postgres service anymore.
Note that it is possible to enable maintenance on a primary Postgres node, and that operation then requires a failover to happen first. It is possible to have pg_auto_failover orchestrate that for you when using the command:
$ pg_autoctl enable maintenance --allow-failover
IMPORTANT:
In the cases when a failover is needed without having an actual node failure, the pg_auto_failover monitor can be used to orchestrate the operation. Use one of the following commands, which are synonyms in the pg_auto_failover design:
$ pg_autoctl perform failover $ pg_autoctl perform switchover
Finally, it is also possible to “elect” a new primary node in your formation with the command:
$ pg_autoctl perform promotion
IMPORTANT:
This section of the documentation is meant to help users get started by focusing on the main commands of the pg_autoctl tool. Each command has many options that can have very small impact, or pretty big impact in terms of security or architecture. Read the rest of the manual to understand how to best use the many pg_autoctl options to implement your specific Postgres production architecture.
In this guide we’ll create a Postgres setup with two nodes, a primary and a standby. Then we'll add a second standby node. We’ll simulate failure in the Postgres nodes and see how the system continues to function.
This tutorial uses docker-compose in order to separate the architecture design from some of the implementation details. This allows reasonning at the architecture level within this tutorial, and better see which software component needs to be deployed and run on which node.
The setup provided in this tutorial is good for replaying at home in the lab. It is not intended to be production ready though. In particular, no attention have been spent on volume management. After all, this is a tutorial: the goal is to walk through the first steps of using pg_auto_failover to implement Postgres automated failover.
When using docker-compose we describe a list of services, each service may run on one or more nodes, and each service just runs a single isolated process in a container.
Within the context of a tutorial, or even a development environment, this matches very well to provisioning separate physical machines on-prem, or Virtual Machines either on-prem on in a Cloud service.
The docker image used in this tutorial is named pg_auto_failover:tutorial. It can be built locally when using the attached Dockerfile found within the GitHub repository for pg_auto_failover.
To build the image, either use the provided Makefile and run make build, or run the docker build command directly:
$ git clone https://github.com/citusdata/pg_auto_failover $ cd pg_auto_failover/docs/tutorial $ docker build -t pg_auto_failover:tutorial -f Dockerfile ../.. $ docker-compose build
Using docker-compose makes it easy enough to create an architecture that looks like the following diagram:
Such an architecture provides failover capabilities, though it does not provide with High Availability of both the Postgres service and the data. See the Multi-node Architectures chapter of our docs to understand more about this.
To create a cluster we use the following docker-compose definition:
version: "3.9" # optional since v1.27.0 services:
app:
build:
context: .
dockerfile: Dockerfile.app
environment:
PGUSER: tutorial
PGDATABASE: tutorial
PGHOST: node1,node2
PGPORT: 5432
PGAPPNAME: tutorial
PGSSLMODE: require
PGTARGETSESSIONATTRS: read-write
monitor:
image: pg_auto_failover:tutorial
volumes:
- ./monitor:/var/lib/postgres
environment:
PGDATA: /var/lib/postgres/pgaf
PG_AUTOCTL_SSL_SELF_SIGNED: true
expose:
- 5432
command: |
pg_autoctl create monitor --auth trust --run
node1:
image: pg_auto_failover:tutorial
hostname: node1
volumes:
- ./node1:/var/lib/postgres
environment:
PGDATA: /var/lib/postgres/pgaf
PGUSER: tutorial
PGDATABASE: tutorial
PG_AUTOCTL_HBA_LAN: true
PG_AUTOCTL_AUTH_METHOD: "trust"
PG_AUTOCTL_SSL_SELF_SIGNED: true
PG_AUTOCTL_MONITOR: "postgresql://autoctl_node@monitor/pg_auto_failover"
expose:
- 5432
command: |
pg_autoctl create postgres --name node1 --pg-hba-lan --run
node2:
image: pg_auto_failover:tutorial
hostname: node2
volumes:
- ./node2:/var/lib/postgres
environment:
PGDATA: /var/lib/postgres/pgaf
PGUSER: tutorial
PGDATABASE: tutorial
PG_AUTOCTL_HBA_LAN: true
PG_AUTOCTL_AUTH_METHOD: "trust"
PG_AUTOCTL_SSL_SELF_SIGNED: true
PG_AUTOCTL_MONITOR: "postgresql://autoctl_node@monitor/pg_auto_failover"
expose:
- 5432
command: |
pg_autoctl create postgres --name node2 --pg-hba-lan --run
To run the full Citus cluster with HA from this definition, we can use the following command:
$ docker-compose up
The command above starts the services up. The first service is the monitor and is created with the command pg_autoctl create monitor. The options for this command are exposed in the environment, and could have been specified on the command line too:
$ pg_autoctl create postgres --ssl-self-signed --auth trust --pg-hba-lan --run
While the Postgres nodes are being provisionned by docker-compose, you can run the following command and have a dynamic dashboard to follow what's happening. The following command is like top for pg_auto_failover:
$ docker-compose exec monitor pg_autoctl watch
After a little while, you can run the pg_autoctl show state command and see a stable result:
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ------+-------+------------+----------------+--------------+---------------------+-------------------- node2 | 1 | node2:5432 | 1: 0/3000148 | read-write | primary | primary node1 | 2 | node1:5432 | 1: 0/3000148 | read-only | secondary | secondary
We can review the available Postgres URIs with the pg_autoctl show uri command:
$ docker-compose exec monitor pg_autoctl show uri
Type | Name | Connection String
-------------+---------+-------------------------------
monitor | monitor | postgres://autoctl_node@58053a02af03:5432/pg_auto_failover?sslmode=require
formation | default | postgres://node2:5432,node1:5432/tutorial?target_session_attrs=read-write&sslmode=require
Let's create a database schema with a single table, and some data in there.
$ docker-compose exec app psql
-- in psql CREATE TABLE companies (
id bigserial PRIMARY KEY,
name text NOT NULL,
image_url text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL );
Next download and ingest some sample data, still from within our psql session:
\copy companies from program 'curl -o- https://examples.citusdata.com/mt_ref_arch/companies.csv' with csv ( COPY 75 )
When using pg_auto_failover, it is possible (and easy) to trigger a failover without having to orchestrate an incident, or power down the current primary.
$ docker-compose exec monitor pg_autoctl perform switchover 14:57:16 992 INFO Waiting 60 secs for a notification with state "primary" in formation "default" and group 0 14:57:16 992 INFO Listening monitor notifications about state changes in formation "default" and group 0 14:57:16 992 INFO Following table displays times when notifications are received
Time | Name | Node | Host:Port | Current State | Assigned State ---------+-------+-------+------------+---------------------+-------------------- 14:57:16 | node2 | 1 | node2:5432 | primary | draining 14:57:16 | node1 | 2 | node1:5432 | secondary | prepare_promotion 14:57:16 | node1 | 2 | node1:5432 | prepare_promotion | prepare_promotion 14:57:16 | node1 | 2 | node1:5432 | prepare_promotion | stop_replication 14:57:16 | node2 | 1 | node2:5432 | primary | demote_timeout 14:57:17 | node2 | 1 | node2:5432 | draining | demote_timeout 14:57:17 | node2 | 1 | node2:5432 | demote_timeout | demote_timeout 14:57:19 | node1 | 2 | node1:5432 | stop_replication | stop_replication 14:57:19 | node1 | 2 | node1:5432 | stop_replication | wait_primary 14:57:19 | node2 | 1 | node2:5432 | demote_timeout | demoted 14:57:19 | node2 | 1 | node2:5432 | demoted | demoted 14:57:19 | node1 | 2 | node1:5432 | wait_primary | wait_primary 14:57:19 | node2 | 1 | node2:5432 | demoted | catchingup 14:57:26 | node2 | 1 | node2:5432 | demoted | catchingup 14:57:38 | node2 | 1 | node2:5432 | demoted | catchingup 14:57:39 | node2 | 1 | node2:5432 | catchingup | catchingup 14:57:39 | node2 | 1 | node2:5432 | catchingup | secondary 14:57:39 | node2 | 1 | node2:5432 | secondary | secondary 14:57:40 | node1 | 2 | node1:5432 | wait_primary | primary 14:57:40 | node1 | 2 | node1:5432 | primary | primary
The new state after the failover looks like the following:
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ------+-------+------------+----------------+--------------+---------------------+-------------------- node2 | 1 | node2:5432 | 2: 0/5002698 | read-only | secondary | secondary node1 | 2 | node1:5432 | 2: 0/5002698 | read-write | primary | primary
And we can verify that we still have the data available:
docker-compose exec app psql -c "select count(*) from companies"
count
-------
75
(1 row)
As mentioned in the first section of this tutorial, the way we use docker-compose here is not meant to be production ready. It's useful to understand and play with a distributed system such as Postgres multiple nodes sytem and failovers.
See the command pg_autoctl do tmux compose session for more details about how to run a docker-compose test environment with docker-compose, including external volumes for each node.
See also the complete Azure VMs Tutorial for a guide on how-to provision an Azure network and then Azure VMs with pg_auto_failover, including systemd coverage and a failover triggered by stopping a full VM.
In this guide we’ll create a primary and secondary Postgres node and set up pg_auto_failover to replicate data between them. We’ll simulate failure in the primary node and see how the system smoothly switches (fails over) to the secondary.
For illustration, we'll run our databases on virtual machines in the Azure platform, but the techniques here are relevant to any cloud provider or on-premise network. We'll use four virtual machines: a primary database, a secondary database, a monitor, and an "application." The monitor watches the other nodes’ health, manages global state, and assigns nodes their roles.
Our database machines need to talk to each other and to the monitor node, so let's create a virtual network.
az group create \
--name ha-demo \
--location eastus az network vnet create \
--resource-group ha-demo \
--name ha-demo-net \
--address-prefix 10.0.0.0/16
We need to open ports 5432 (Postgres) and 22 (SSH) between the machines, and also give ourselves access from our remote IP. We'll do this with a network security group and a subnet.
az network nsg create \
--resource-group ha-demo \
--name ha-demo-nsg az network nsg rule create \
--resource-group ha-demo \
--nsg-name ha-demo-nsg \
--name ha-demo-ssh-and-pg \
--access allow \
--protocol Tcp \
--direction Inbound \
--priority 100 \
--source-address-prefixes `curl ifconfig.me` 10.0.1.0/24 \
--source-port-range "*" \
--destination-address-prefix "*" \
--destination-port-ranges 22 5432 az network vnet subnet create \
--resource-group ha-demo \
--vnet-name ha-demo-net \
--name ha-demo-subnet \
--address-prefixes 10.0.1.0/24 \
--network-security-group ha-demo-nsg
Finally add four virtual machines (ha-demo-a, ha-demo-b, ha-demo-monitor, and ha-demo-app). For speed we background the az vm create processes and run them in parallel:
# create VMs in parallel for node in monitor a b app do az vm create \
--resource-group ha-demo \
--name ha-demo-${node} \
--vnet-name ha-demo-net \
--subnet ha-demo-subnet \
--nsg ha-demo-nsg \
--public-ip-address ha-demo-${node}-ip \
--image debian \
--admin-username ha-admin \
--generate-ssh-keys & done wait
To make it easier to SSH into these VMs in future steps, let's make a shell function to retrieve their IP addresses:
# run this in your local shell as well vm_ip () {
az vm list-ip-addresses -g ha-demo -n ha-demo-$1 -o tsv \
--query '[] [] .virtualMachine.network.publicIpAddresses[0].ipAddress' } # for convenience with ssh for node in monitor a b app do ssh-keyscan -H `vm_ip $node` >> ~/.ssh/known_hosts done
Let's review what we created so far.
az resource list --output table --query \
"[?resourceGroup=='ha-demo'].{ name: name, flavor: kind, resourceType: type, region: location }"
This shows the following resources:
Name ResourceType Region ------------------------------- ----------------------------------------------------- -------- ha-demo-a Microsoft.Compute/virtualMachines eastus ha-demo-app Microsoft.Compute/virtualMachines eastus ha-demo-b Microsoft.Compute/virtualMachines eastus ha-demo-monitor Microsoft.Compute/virtualMachines eastus ha-demo-appVMNic Microsoft.Network/networkInterfaces eastus ha-demo-aVMNic Microsoft.Network/networkInterfaces eastus ha-demo-bVMNic Microsoft.Network/networkInterfaces eastus ha-demo-monitorVMNic Microsoft.Network/networkInterfaces eastus ha-demo-nsg Microsoft.Network/networkSecurityGroups eastus ha-demo-a-ip Microsoft.Network/publicIPAddresses eastus ha-demo-app-ip Microsoft.Network/publicIPAddresses eastus ha-demo-b-ip Microsoft.Network/publicIPAddresses eastus ha-demo-monitor-ip Microsoft.Network/publicIPAddresses eastus ha-demo-net Microsoft.Network/virtualNetworks eastus
This guide uses Debian Linux, but similar steps will work on other distributions. All that differs are the packages and paths. See Installing pg_auto_failover.
The pg_auto_failover system is distributed as a single pg_autoctl binary with subcommands to initialize and manage a replicated PostgreSQL service. We’ll install the binary with the operating system package manager on all nodes. It will help us run and observe PostgreSQL.
for node in monitor a b app do az vm run-command invoke \
--resource-group ha-demo \
--name ha-demo-${node} \
--command-id RunShellScript \
--scripts \
"sudo touch /home/ha-admin/.hushlogin" \
"curl https://install.citusdata.com/community/deb.sh | sudo bash" \
"sudo DEBIAN_FRONTEND=noninteractive apt-get install -q -y postgresql-common" \
"echo 'create_main_cluster = false' | sudo tee -a /etc/postgresql-common/createcluster.conf" \
"sudo DEBIAN_FRONTEND=noninteractive apt-get install -q -y postgresql-11-auto-failover-1.4" \
"sudo usermod -a -G postgres ha-admin" & done wait
The pg_auto_failover monitor is the first component to run. It periodically attempts to contact the other nodes and watches their health. It also maintains global state that “keepers” on each node consult to determine their own roles in the system.
# on the monitor virtual machine ssh -l ha-admin `vm_ip monitor` -- \
pg_autoctl create monitor \
--auth trust \
--ssl-self-signed \
--pgdata monitor \
--pgctl /usr/lib/postgresql/11/bin/pg_ctl
This command initializes a PostgreSQL cluster at the location pointed by the --pgdata option. When --pgdata is omitted, pg_autoctl attempts to use the PGDATA environment variable. If a PostgreSQL instance had already existing in the destination directory, this command would have configured it to serve as a monitor.
pg_auto_failover, installs the pgautofailover Postgres extension, and grants access to a new autoctl_node user.
In the Quick Start we use --auth trust to avoid complex security settings. The Postgres trust authentication method is not considered a reasonable choice for production environments. Consider either using the --skip-pg-hba option or --auth scram-sha-256 and then setting up passwords yourself.
At this point the monitor is created. Now we'll install it as a service with systemd so that it will resume if the VM restarts.
ssh -T -l ha-admin `vm_ip monitor` << CMD
pg_autoctl -q show systemd --pgdata ~ha-admin/monitor > pgautofailover.service
sudo mv pgautofailover.service /etc/systemd/system
sudo systemctl daemon-reload
sudo systemctl enable pgautofailover
sudo systemctl start pgautofailover CMD
We’ll create the primary database using the pg_autoctl create subcommand.
ssh -l ha-admin `vm_ip a` -- \
pg_autoctl create postgres \
--pgdata ha \
--auth trust \
--ssl-self-signed \
--username ha-admin \
--dbname appdb \
--hostname ha-demo-a.internal.cloudapp.net \
--pgctl /usr/lib/postgresql/11/bin/pg_ctl \
--monitor 'postgres://autoctl_node@ha-demo-monitor.internal.cloudapp.net/pg_auto_failover?sslmode=require'
Notice the user and database name in the monitor connection string -- these are what monitor init created. We also give it the path to pg_ctl so that the keeper will use the correct version of pg_ctl in future even if other versions of postgres are installed on the system.
In the example above, the keeper creates a primary database. It chooses to set up node A as primary because the monitor reports there are no other nodes in the system yet. This is one example of how the keeper is state-based: it makes observations and then adjusts its state, in this case from "init" to "single."
Also add a setting to trust connections from our "application" VM:
ssh -T -l ha-admin `vm_ip a` << CMD
echo 'hostssl "appdb" "ha-admin" ha-demo-app.internal.cloudapp.net trust' \
>> ~ha-admin/ha/pg_hba.conf CMD
At this point the monitor and primary node are created and running. Next we need to run the keeper. It’s an independent process so that it can continue operating even if the PostgreSQL process goes terminates on the node. We'll install it as a service with systemd so that it will resume if the VM restarts.
ssh -T -l ha-admin `vm_ip a` << CMD
pg_autoctl -q show systemd --pgdata ~ha-admin/ha > pgautofailover.service
sudo mv pgautofailover.service /etc/systemd/system
sudo systemctl daemon-reload
sudo systemctl enable pgautofailover
sudo systemctl start pgautofailover CMD
Next connect to node B and do the same process. We'll do both steps at once:
ssh -l ha-admin `vm_ip b` -- \
pg_autoctl create postgres \
--pgdata ha \
--auth trust \
--ssl-self-signed \
--username ha-admin \
--dbname appdb \
--hostname ha-demo-b.internal.cloudapp.net \
--pgctl /usr/lib/postgresql/11/bin/pg_ctl \
--monitor 'postgres://autoctl_node@ha-demo-monitor.internal.cloudapp.net/pg_auto_failover?sslmode=require' ssh -T -l ha-admin `vm_ip b` << CMD
pg_autoctl -q show systemd --pgdata ~ha-admin/ha > pgautofailover.service
sudo mv pgautofailover.service /etc/systemd/system
sudo systemctl daemon-reload
sudo systemctl enable pgautofailover
sudo systemctl start pgautofailover CMD
It discovers from the monitor that a primary exists, and then switches its own state to be a hot standby and begins streaming WAL contents from the primary.
For convenience, pg_autoctl modifies each node's pg_hba.conf file to allow the nodes to connect to one another. For instance, pg_autoctl added the following lines to node A:
# automatically added to node A hostssl "appdb" "ha-admin" ha-demo-a.internal.cloudapp.net trust hostssl replication "pgautofailover_replicator" ha-demo-b.internal.cloudapp.net trust hostssl "appdb" "pgautofailover_replicator" ha-demo-b.internal.cloudapp.net trust
For pg_hba.conf on the monitor node pg_autoctl inspects the local network and makes its best guess about the subnet to allow. In our case it guessed correctly:
# automatically added to the monitor hostssl "pg_auto_failover" "autoctl_node" 10.0.1.0/24 trust
If worker nodes have more ad-hoc addresses and are not in the same subnet, it's better to disable pg_autoctl's automatic modification of pg_hba using the --skip-pg-hba command line option during creation. You will then need to edit the hba file by hand. Another reason for manual edits would be to use special authentication methods.
First let’s verify that the monitor knows about our nodes, and see what states it has assigned them:
ssh -l ha-admin `vm_ip monitor` pg_autoctl show state --pgdata monitor Name | Node | Host:Port | LSN | Reachable | Current State | Assigned State -------+-------+--------------------------------------+-----------+-----------+---------------------+-------------------- node_1 | 1 | ha-demo-a.internal.cloudapp.net:5432 | 0/3000060 | yes | primary | primary node_2 | 2 | ha-demo-b.internal.cloudapp.net:5432 | 0/3000060 | yes | secondary | secondary
This looks good. We can add data to the primary, and later see it appear in the secondary. We'll connect to the database from inside our "app" virtual machine, using a connection string obtained from the monitor.
ssh -l ha-admin `vm_ip monitor` pg_autoctl show uri --pgdata monitor
Type | Name | Connection String -----------+---------+-------------------------------
monitor | monitor | postgres://autoctl_node@ha-demo-monitor.internal.cloudapp.net:5432/pg_auto_failover?sslmode=require
formation | default | postgres://ha-demo-b.internal.cloudapp.net:5432,ha-demo-a.internal.cloudapp.net:5432/appdb?target_session_attrs=read-write&sslmode=require
Now we'll get the connection string and store it in a local environment variable:
APP_DB_URI=$( \
ssh -l ha-admin `vm_ip monitor` \
pg_autoctl show uri --formation default --pgdata monitor \ )
The connection string contains both our nodes, comma separated, and includes the url parameter ?target_session_attrs=read-write telling psql that we want to connect to whichever of these servers supports reads and writes. That will be the primary server.
# connect to database via psql on the app vm and # create a table with a million rows ssh -l ha-admin -t `vm_ip app` -- \
psql "'$APP_DB_URI'" \
-c "'CREATE TABLE foo AS SELECT generate_series(1,1000000) bar;'"
Now that we've added data to node A, let's switch which is considered the primary and which the secondary. After the switch we'll connect again and query the data, this time from node B.
# initiate failover to node B ssh -l ha-admin -t `vm_ip monitor` \
pg_autoctl perform switchover --pgdata monitor
Once node B is marked "primary" (or "wait_primary") we can connect and verify that the data is still present:
# connect to database via psql on the app vm ssh -l ha-admin -t `vm_ip app` -- \
psql "'$APP_DB_URI'" \
-c "'SELECT count(*) FROM foo;'"
It shows
count ---------
1000000
This plot is too boring, time to introduce a problem. We’ll turn off VM for node B (currently the primary after our previous failover) and watch node A get promoted.
In one terminal let’s keep an eye on events:
ssh -t -l ha-admin `vm_ip monitor` -- \
watch -n 1 -d pg_autoctl show state --pgdata monitor
In another terminal we’ll turn off the virtual server.
az vm stop \
--resource-group ha-demo \
--name ha-demo-b
After a number of failed attempts to talk to node B, the monitor determines the node is unhealthy and puts it into the "demoted" state. The monitor promotes node A to be the new primary.
Name | Node | Host:Port | LSN | Reachable | Current State | Assigned State -------+-------+--------------------------------------+-----------+-----------+---------------------+-------------------- node_1 | 1 | ha-demo-a.internal.cloudapp.net:5432 | 0/6D4E068 | yes | wait_primary | wait_primary node_2 | 2 | ha-demo-b.internal.cloudapp.net:5432 | 0/6D4E000 | yes | demoted | catchingup
Node A cannot be considered in full "primary" state since there is no secondary present, but it can still serve client requests. It is marked as "wait_primary" until a secondary appears, to indicate that it's running without a backup.
Let's add some data while B is offline.
# notice how $APP_DB_URI continues to work no matter which node # is serving as primary ssh -l ha-admin -t `vm_ip app` -- \
psql "'$APP_DB_URI'" \
-c "'INSERT INTO foo SELECT generate_series(1000001, 2000000);'"
Run this command to bring node B back online:
az vm start \
--resource-group ha-demo \
--name ha-demo-b
Now the next time the keeper retries its health check, it brings the node back. Node B goes through the state "catchingup" while it updates its data to match A. Once that's done, B becomes a secondary, and A is now a full primary again.
Name | Node | Host:Port | LSN | Reachable | Current State | Assigned State -------+-------+--------------------------------------+------------+-----------+---------------------+-------------------- node_1 | 1 | ha-demo-a.internal.cloudapp.net:5432 | 0/12000738 | yes | primary | primary node_2 | 2 | ha-demo-b.internal.cloudapp.net:5432 | 0/12000738 | yes | secondary | secondary
What's more, if we connect directly to the database again, all two million rows are still present.
ssh -l ha-admin -t `vm_ip app` -- \
psql "'$APP_DB_URI'" \
-c "'SELECT count(*) FROM foo;'"
It shows
count ---------
2000000
pg_auto_failover is designed as a simple and robust way to manage automated Postgres failover in production. On-top of robust operations, pg_auto_failover setup is flexible and allows either Business Continuity or High Availability configurations. pg_auto_failover design includes configuration changes in a live system without downtime.
pg_auto_failover is designed to be able to handle a single PostgreSQL service using three nodes. In this setting, the system is resilient to losing any one of three nodes.
It is important to understand that when using only two Postgres nodes then pg_auto_failover is optimized for Business Continuity. In the event of losing a single node, pg_auto_failover is capable of continuing the PostgreSQL service, and prevents any data loss when doing so, thanks to PostgreSQL Synchronous Replication.
That said, there is a trade-off involved in this architecture. The business continuity bias relaxes replication guarantees for asynchronous replication in the event of a standby node failure. This allows the PostgreSQL service to accept writes when there's a single server available, and opens the service for potential data loss if the primary server were also to fail.
Each PostgreSQL node in pg_auto_failover runs a Keeper process which informs a central Monitor node about notable local changes. Some changes require the Monitor to orchestrate a correction across the cluster:
At initialization time, it's necessary to prepare the configuration of each node for PostgreSQL streaming replication, and get the cluster to converge to the nominal state with both a primary and a secondary node in each group. The monitor determines each new node's role
The monitor orchestrates a failover when it detects an unhealthy node. The design of pg_auto_failover allows the monitor to shut down service to a previously designated primary node without causing a "split-brain" situation.
The monitor is the authoritative node that manages global state and makes changes in the cluster by issuing commands to the nodes' keeper processes. A pg_auto_failover monitor node failure has limited impact on the system. While it prevents reacting to other nodes' failures, it does not affect replication. The PostgreSQL streaming replication setup installed by pg_auto_failover does not depend on having the monitor up and running.
pg_auto_failover handles a single PostgreSQL service with the following concepts:
The pg_auto_failover monitor is a service that keeps track of one or several formations containing groups of nodes.
The monitor is implemented as a PostgreSQL extension, so when you run the command pg_autoctl create monitor a PostgreSQL instance is initialized, configured with the extension, and started. The monitor service embeds a PostgreSQL instance.
A formation is a logical set of PostgreSQL services that are managed together.
It is possible to operate many formations with a single monitor instance. Each formation has a group of Postgres nodes and the FSM orchestration implemented by the monitor applies separately to each group.
A group of two PostgreSQL nodes work together to provide a single PostgreSQL service in a Highly Available fashion. A group consists of a PostgreSQL primary server and a secondary server setup with Hot Standby synchronous replication. Note that pg_auto_failover can orchestrate the whole setting-up of the replication for you.
In pg_auto_failover versions up to 1.3, a single Postgres group can contain only two Postgres nodes. Starting with pg_auto_failover 1.4, there's no limit to the number of Postgres nodes in a single group. Note that each Postgres instance that belongs to the same group serves the same dataset in its data directory (PGDATA).
NOTE:
The pg_auto_failover keeper is an agent that must be running on the same server where your PostgreSQL nodes are running. The keeper controls the local PostgreSQL instance (using both the pg_ctl command-line tool and SQL queries), and communicates with the monitor:
Also the keeper maintains local state that includes the most recent communication established with the monitor and the other PostgreSQL node of its group, enabling it to detect Network Partitions.
NOTE:
Starting in pg_auto_failover version 1.4, the keeper process (started with pg_autoctl run) runs the Postgres instance as a sub-process of the main pg_autoctl process, allowing tighter control over the Postgres execution. Running the sub-process also makes the solution work better both in container environments (because it's now a single process tree) and with systemd, because it uses a specific cgroup per service unit.
A node is a server (virtual or physical) that runs PostgreSQL instances and a keeper service. At any given time, any node might be a primary or a secondary Postgres instance. The whole point of pg_auto_failover is to decide this state.
As a result, refrain from naming your nodes with the role you intend for them. Their roles can change. If they didn't, your system wouldn't need pg_auto_failover!
A state is the representation of the per-instance and per-group situation. The monitor and the keeper implement a Finite State Machine to drive operations in the PostgreSQL groups; allowing pg_auto_failover to implement High Availability with the goal of zero data loss.
The keeper main loop enforces the current expected state of the local PostgreSQL instance, and reports the current state and some more information to the monitor. The monitor uses this set of information and its own health-check information to drive the State Machine and assign a goal state to the keeper.
The keeper implements the transitions between a current state and a monitor-assigned goal state.
Implementing client-side High Availability is included in PostgreSQL's driver libpq from version 10 onward. Using this driver, it is possible to specify multiple host names or IP addresses in the same connection string:
$ psql -d "postgresql://host1,host2/dbname?target_session_attrs=read-write" $ psql -d "postgresql://host1:port2,host2:port2/dbname?target_session_attrs=read-write" $ psql -d "host=host1,host2 port=port1,port2 target_session_attrs=read-write"
When using either of the syntax above, the psql application attempts to connect to host1, and when successfully connected, checks the target_session_attrs as per the PostgreSQL documentation of it:
When the connection attempt to host1 fails, or when the target_session_attrs can not be verified, then the psql application attempts to connect to host2.
The behavior is implemented in the connection library libpq, so any application using it can benefit from this implementation, not just psql.
When using pg_auto_failover, configure your application connection string to use the primary and the secondary server host names, and set target_session_attrs=read-write too, so that your application automatically connects to the current primary, even after a failover occurred.
The monitor interacts with the data nodes in 2 ways:
When a data node calls node_active, the state of the node is stored in the pgautofailover.node table and the state machines of both nodes are progressed. The state machines are described later in this readme. The monitor typically only moves one state forward and waits for the node(s) to converge except in failure states.
If a node is not communicating to the monitor, it will either cause a failover (if node is a primary), disabling synchronous replication (if node is a secondary), or cause the state machine to pause until the node comes back (other cases). In most cases, the latter is harmless, though in some cases it may cause downtime to last longer, e.g. if a standby goes down during a failover.
To simplify operations, a node is only considered unhealthy if the monitor cannot connect and it hasn't reported its state through node_active for a while. This allows, for example, PostgreSQL to be restarted without causing a health check failure.
By default, pg_auto_failover uses synchronous replication, which means all writes block until at least one standby node has reported receiving them. To handle cases in which the standby fails, the primary switches between two states called wait_primary and primary based on the health of standby nodes, and based on the replication setting number_sync_standby.
When in the wait_primary state, synchronous replication is disabled by automatically setting synchronous_standby_names = '' to allow writes to proceed. However doing so also disables failover, since the standby might get arbitrarily far behind. If the standby is responding to health checks and within 1 WAL segment of the primary (by default), synchronous replication is enabled again on the primary by setting synchronous_standby_names = '*' which may cause a short latency spike since writes will then block until the standby has caught up.
When using several standby nodes with replication quorum enabled, the actual setting for synchronous_standby_names is set to a list of those standby nodes that are set to participate to the replication quorum.
If you wish to disable synchronous replication, you need to add the following to postgresql.conf:
synchronous_commit = 'local'
This ensures that writes return as soon as they are committed on the primary -- under all circumstances. In that case, failover might lead to some data loss, but failover is not initiated if the secondary is more than 10 WAL segments (by default) behind on the primary. During a manual failover, the standby will continue accepting writes from the old primary. The standby will stop accepting writes only if it's fully caught up (most common), the primary fails, or it does not receive writes for 2 minutes.
In some cases the performance impact on write latency when setting synchronous replication makes the application fail to deliver expected performance. If testing or production feedback shows this to be the case, it is beneficial to switch to using asynchronous replication.
The way to use asynchronous replication in pg_auto_failover is to change the synchronous_commit setting. This setting can be set per transaction, per session, or per user. It does not have to be set globally on your Postgres instance.
One way to benefit from that would be:
alter role fast_and_loose set synchronous_commit to local;
That way performance-critical parts of the application don't have to wait for the standby nodes. Only use this when you can also lower your data durability guarantees.
When bringing a node back after a failover, the keeper (pg_autoctl run) can simply be restarted. It will also restart postgres if needed and obtain its goal state from the monitor. If the failed node was a primary and was demoted, it will learn this from the monitor. Once the node reports, it is allowed to come back as a standby by running pg_rewind. If it is too far behind, the node performs a new pg_basebackup.
Pg_auto_failover allows you to have more than one standby node, and offers advanced control over your production architecture characteristics.
When adding your second standby node with default settings, you get the following architecture:
In this case, three nodes get set up with the same characteristics, achieving HA for both the Postgres service and the production dataset. An important setting for this architecture is number_sync_standbys.
The replication setting number_sync_standbys sets how many standby nodes the primary should wait for when committing a transaction. In order to have a good availability in your system, pg_auto_failover requires number_sync_standbys + 1 standby nodes participating in the replication quorum: this allows any standby node to fail without impact on the system's ability to respect the replication quorum.
When only two nodes are registered in a group on the monitor we have a primary and a single secondary node. Then number_sync_standbys can only be set to zero. When adding a second standby node to a pg_auto_failover group, then the monitor automatically increments number_sync_standbys to one, as we see in the diagram above.
When number_sync_standbys is set to zero then pg_auto_failover implements the Business Continuity setup as seen in Architecture Basics: synchronous replication is then used as a way to guarantee that failover can be implemented without data loss.
In more details:
When one of the standby nodes is unavailable, the second copy of the dataset can still be maintained thanks to the remaining standby.
When both the standby nodes are unavailable, then it's no longer possible to guarantee the replication quorum, and thus writes on the primary are blocked. The Postgres primary node waits until at least one standby node acknowledges the transactions locally committed, thus degrading your Postgres service to read-only.
In that case, when the second standby node becomes unhealthy at the same time as the first standby node, the primary node is assigned the state Wait_primary. In that state, synchronous replication is disabled on the primary by setting synchronous_standby_names to an empty string. Writes are allowed on the primary, even though there's no extra copy of the production dataset available at this time.
Setting number_sync_standbys to zero allows data to be written even when both standby nodes are down. In this case, a single copy of the production data set is kept and, if the primary was then to fail, some data will be lost. How much depends on your backup and recovery mechanisms.
The entire flexibility of pg_auto_failover can be leveraged with the following three replication settings:
This parameter is used by Postgres in the synchronous_standby_names parameter: number_sync_standby is the number of synchronous standbys for whose replies transactions must wait.
This parameter can be set at the formation level in pg_auto_failover, meaning that it applies to the current primary, and "follows" a failover to apply to any new primary that might replace the current one.
To set this parameter to the value <n>, use the following command:
pg_autoctl set formation number-sync-standbys <n>
The default value in pg_auto_failover is zero. When set to zero, the Postgres parameter synchronous_standby_names can be set to either '*' or to '':
pg_autofailover uses synchronous_standby_names = '*' when there's at least one standby that is known to be healthy.
pg_autofailover uses synchronous_standby_names = '' only when number_sync_standbys is set to zero and there's no standby node known healthy by the monitor.
In order to set number_sync_standbys to a non-zero value, pg_auto_failover requires that at least number_sync_standbys + 1 standby nodes be registered in the system.
When the first standby node is added to the pg_auto_failover monitor, the only acceptable value for number_sync_standbys is zero. When a second standby is added that participates in the replication quorum, then number_sync_standbys is automatically set to one.
The command pg_autoctl set formation number-sync-standbys can be used to change the value of this parameter in a formation, even when all the nodes are already running in production. The pg_auto_failover monitor then sets a transition for the primary to update its local value of synchronous_standby_names.
The replication quorum setting is a boolean and defaults to true, and can be set per-node. Pg_auto_failover includes a given node in synchronous_standby_names only when the replication quorum parameter has been set to true. This means that asynchronous replication will be used for nodes where replication-quorum is set to false.
It is possible to force asynchronous replication globally by setting replication quorum to false on all the nodes in a formation. Remember that failovers will happen, and thus to set your replication settings on the current primary node too when needed: it is going to be a standby later.
To set this parameter to either true or false, use one of the following commands:
pg_autoctl set node replication-quorum true pg_autoctl set node replication-quorum false
The candidate priority setting is an integer that can be set to any value between 0 (zero) and 100 (one hundred). The default value is 50. When the pg_auto_failover monitor decides to orchestrate a failover, it uses each node's candidate priority to pick the new primary node.
When setting the candidate priority of a node down to zero, this node will never be selected to be promoted as the new primary when a failover is orchestrated by the monitor. The monitor will instead wait until another node registered is healthy and in a position to be promoted.
To set this parameter to the value <n>, use the following command:
pg_autoctl set node candidate-priority <n>
When nodes have the same candidate priority, the monitor then picks the standby with the most advanced LSN position published to the monitor. When more than one node has published the same LSN position, a random one is chosen.
When the candidate for failover has not published the most advanced LSN position in the WAL, pg_auto_failover orchestrates an intermediate step in the failover mechanism. The candidate fetches the missing WAL bytes from one of the standby with the most advanced LSN position prior to being promoted. Postgres allows this operation thanks to cascading replication: any standby can be the upstream node for another standby.
It is required at all times that at least two nodes have a non-zero candidate priority in any pg_auto_failover formation. Otherwise no failover is possible.
The command pg_autoctl get formation settings (also known as pg_autoctl show settings) can be used to obtain a summary of all the replication settings currently in effect in a formation. Still using the first diagram on this page, we get the following summary:
$ pg_autoctl get formation settings
Context | Name | Setting | Value ----------+---------+---------------------------+------------------------------------------------------------- formation | default | number_sync_standbys | 1
primary | node_A | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_3, pgautofailover_standby_2)'
node | node_A | replication quorum | true
node | node_B | replication quorum | true
node | node_C | replication quorum | true
node | node_A | candidate priority | 50
node | node_B | candidate priority | 50
node | node_C | candidate priority | 50
We can see that the number_sync_standbys has been used to compute the current value of the synchronous_standby_names setting on the primary.
Because all the nodes in that example have the same default candidate priority (50), then pg_auto_failover is using the form ANY 1 with the list of standby nodes that are currently participating in the replication quorum.
The entries in the synchronous_standby_names list are meant to match the application_name connection setting used in the primary_conninfo, and the format used by pg_auto_failover there is the format string "pgautofailover_standby_%d" where %d is replaced by the node id. This allows keeping the same connection string to the primary when the node name is changed (using the command pg_autoctl set metadata --name).
Here we can see the node id of each registered Postgres node with the following command:
$ pg_autoctl show state
Name | Node | Host:Port | LSN | Reachable | Current State | Assigned State -------+-------+----------------+-----------+-----------+---------------------+-------------------- node_A | 1 | localhost:5001 | 0/7002310 | yes | primary | primary node_B | 2 | localhost:5002 | 0/7002310 | yes | secondary | secondary node_C | 3 | localhost:5003 | 0/7002310 | yes | secondary | secondary
When setting pg_auto_failover with per formation number_sync_standby and then per node replication quorum and candidate priority replication settings, those properties are then used to compute the synchronous_standby_names value on the primary node. This value is automatically maintained on the primary by pg_auto_failover, and is updated either when replication settings are changed or when a failover happens.
The other situation when the pg_auto_failover replication settings are used is a candidate election when a failover happens and there is more than two nodes registered in a group. Then the node with the highest candidate priority is selected, as detailed above in the Candidate Priority section.
When setting the three parameters above, it's possible to design very different Postgres architectures for your production needs.
In this case, the system is set up with three standby nodes all set the same way, with default parameters. The default parameters support setting number_sync_standbys = 2. This means that Postgres will maintain three copies of the production data set at all times.
On the other hand, if two standby nodes were to fail at the same time, despite the fact that two copies of the data are still maintained, the Postgres service would be degraded to read-only.
With this architecture diagram, here's the summary that we obtain:
$ pg_autoctl show settings
Context | Name | Setting | Value ----------+---------+---------------------------+--------------------------------------------------------------------------------------- formation | default | number_sync_standbys | 2
primary | node_A | synchronous_standby_names | 'ANY 2 (pgautofailover_standby_2, pgautofailover_standby_4, pgautofailover_standby_3)'
node | node_A | replication quorum | true
node | node_B | replication quorum | true
node | node_C | replication quorum | true
node | node_D | replication quorum | true
node | node_A | candidate priority | 50
node | node_B | candidate priority | 50
node | node_C | candidate priority | 50
node | node_D | candidate priority | 50
In this case, the system is set up with two standby nodes participating in the replication quorum, allowing for number_sync_standbys = 1. The system always maintains at least two copies of the data set, one on the primary, another on either node B or node C. Whenever we lose one of those nodes, we can hold to the guarantee of having two copies of the data set.
Additionally, we have the standby server D which has been set up to not participate in the replication quorum. Node D will not be found in the synchronous_standby_names list of nodes. Also, node D is set up to never be a candidate for failover, with candidate-priority = 0.
This architecture would fit a situation with nodes A, B, and C are deployed in the same data center or availability zone and node D in another one. Those three nodes are set up to support the main production traffic and implement high availability of both the Postgres service and the data set.
Node D might be set up for Business Continuity in case the first data center is lost, or maybe for reporting needs on another application domain.
With this architecture diagram, here's the summary that we obtain:
pg_autoctl show settings
Context | Name | Setting | Value ----------+---------+---------------------------+------------------------------------------------------------- formation | default | number_sync_standbys | 1
primary | node_A | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)'
node | node_A | replication quorum | true
node | node_B | replication quorum | true
node | node_C | replication quorum | true
node | node_D | replication quorum | false
node | node_A | candidate priority | 50
node | node_B | candidate priority | 50
node | node_C | candidate priority | 50
node | node_D | candidate priority | 0
The usual pg_autoctl commands work both with Postgres standalone nodes and with Citus nodes.
When using pg_auto_failover with Citus, a pg_auto_failover formation is composed of a coordinator and a set of worker nodes.
When High-Availability is enabled at the formation level, which is the default, then a minimum of two coordinator nodes are required: a primary and a secondary coordinator to be able to orchestrate a failover when needed.
The same applies to the worker nodes: when using pg_auto_failover for Citus HA, then each worker node is a pg_auto_failover group in the formation, and each worker group is setup with at least two nodes (primary, secondary).
Have a look at our documentation of Citus Cluster Quick Start for more details with a full tutorial setup on a single VM, for testing and QA.
When setting up Citus with pg_auto_failover, the following Citus specific commands are provided. Other pg_autoctl commands work as usual when deploying a Citus formation, so that you can use the rest of this documentation to operate your Citus deployments.
This creates a Citus coordinator, that is to say a Postgres node with the Citus extension loaded and ready to act as a coordinator. The coordinator is always places in the pg_auto_failover group zero of a given formation.
See pg_autoctl create coordinator for details.
IMPORTANT:
Citus does not support multiple databases, you have to use the database where Citus is created. When using Citus, that is essential to the well behaving of worker failover.
This command creates a new Citus worker node, that is to say a Postgres node with the Citus extensions loaded, and registered to the Citus coordinator created with the previous command. Because the Citus coordinator is always given group zero, the pg_auto_failover monitor knows how to reach the Citus coordinator and automate workers registration.
The default worker creation policy is to assign the primary role to the first worker registered, then secondary in the same group, then primary in a new group, etc.
If you want to extend an existing group to have a third worker node in the same group, enabling multiple-standby capabilities in your setup, then make sure to use the --group option to the pg_autoctl create worker command.
See pg_autoctl create worker for details.
This command calls the Citus “activation” API so that a node can be used to host shards for your reference and distributed tables.
When creating a Citus worker, pg_autoctl create worker automatically activates the worker node to the coordinator. You only need this command when something unexpected have happened and you want to manually make sure the worker node has been activated at the Citus coordinator level.
Starting with Citus 10 it is also possible to activate the coordinator itself as a node with shard placement. Use pg_autoctl activate on your Citus coordinator node manually to use that feature.
When the Citus coordinator is activated, an extra step is then needed for it to host shards of distributed tables. If you want your coordinator to have shards, then have a look at the Citus API citus_set_node_property to set the shouldhaveshards property to true.
See pg_autoctl activate for details.
When a failover is orchestrated by pg_auto_failover for a Citus worker node, Citus offers the opportunity to make the failover close to transparent to the application. Because the application connects to the coordinator, which in turn connects to the worker nodes, then it is possible with Citus to _pause_ the SQL write traffic on the coordinator for the shards hosted on a failed worker node. The Postgres failover then happens while the traffic is kept on the coordinator, and resumes as soon as a secondary worker node is ready to accept read-write queries.
This is implemented thanks to Citus smart locking strategy in its citus_update_node API, and pg_auto_failover takes full benefit with a special built set of FSM transitions for Citus workers.
It is possible to setup Citus read-only replicas. This Citus feature allows using a set of dedicated nodes (both coordinator and workers) to serve read-only traffic, such as reporting, analytics, or other parts of your workload that are read-only.
Citus read-replica nodes are a solution for load-balancing. Those nodes can't be used as HA failover targets, and thus have their candidate-priority set to zero. This setting of a read-replica can not be changed later.
This setup is done by setting the Citus property citus.use_secondary_nodes to always (it defaults to never), and the Citus property citus.cluster_name to your read-only cluster name.
Both of those settings are entirely supported and managed by pg_autoctl when using the --citus-secondary --cluster-name cluster_d options to the pg_autoctl create coordinator|worker commands.
Here is an example where we have created a formation with three nodes for HA for the coordinator (one primary and two secondary nodes), then a single worker node with the same three nodes setup for HA, and then one read-replica environment on-top of that, for a total of 8 nodes:
$ pg_autoctl show state
Name | Node | Host:Port | LSN | Reachable | Current State | Assigned State ---------+-------+----------------+-----------+-----------+---------------------+--------------------
coord0a | 0/1 | localhost:5501 | 0/5003298 | yes | primary | primary
coord0b | 0/3 | localhost:5502 | 0/5003298 | yes | secondary | secondary
coord0c | 0/6 | localhost:5503 | 0/5003298 | yes | secondary | secondary
coord0d | 0/7 | localhost:5504 | 0/5003298 | yes | secondary | secondary worker1a | 1/2 | localhost:5505 | 0/4000170 | yes | primary | primary worker1b | 1/4 | localhost:5506 | 0/4000170 | yes | secondary | secondary worker1c | 1/5 | localhost:5507 | 0/4000170 | yes | secondary | secondary reader1d | 1/8 | localhost:5508 | 0/4000170 | yes | secondary | secondary
Nodes named coord0d and reader1d are members of the read-replica cluster cluster_d. We can see that a read-replica cluster needs a dedicated coordinator and then one dedicated worker node per group.
TIP:
$ pg_autoctl create worker --name ... $ pg_autoctl set node metadata --name ...
Here coord0d is the node name for the dedicated coordinator for cluster_d, and reader1d is the node name for the dedicated worker for cluster_d in the worker group 1 (the only worker group in that setup).
Now the usual command to show the connection strings for your application is listing the read-replica setup that way:
$ pg_autoctl show uri
Type | Name | Connection String -------------+-----------+-------------------------------
monitor | monitor | postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer
formation | default | postgres://localhost:5503,localhost:5501,localhost:5502/postgres?target_session_attrs=read-write&sslmode=prefer read-replica | cluster_d | postgres://localhost:5504/postgres?sslmode=prefer
Given that setup, your application can now use the formation default Postgres URI to connect to the highly-available read-write service, or to the read-replica cluster_d service to connect to the read-only replica where you can offload some of your SQL workload.
When connecting to the cluster_d connection string, the Citus properties citus.use_secondary_nodes and citus.cluster_name are automatically setup to their expected values, of course.
In this guide we’ll create a Citus cluster with a coordinator node and three workers. Every node will have a secondary for failover. We’ll simulate failure in the coordinator and worker nodes and see how the system continues to function.
This tutorial uses docker-compose in order to separate the architecture design from some of the implementation details. This allows reasonning at the architecture level within this tutorial, and better see which software component needs to be deployed and run on which node.
The setup provided in this tutorial is good for replaying at home in the lab. It is not intended to be production ready though. In particular, no attention have been spent on volume management. After all, this is a tutorial: the goal is to walk through the first steps of using pg_auto_failover to provide HA to a Citus formation.
When using docker-compose we describe a list of services, each service may run on one or more nodes, and each service just runs a single isolated process in a container.
Within the context of a tutorial, or even a development environment, this matches very well to provisioning separate physical machines on-prem, or Virtual Machines either on-prem on in a Cloud service.
The docker image used in this tutorial is named pg_auto_failover:citus. It can be built locally when using the attached Dockerfile found within the GitHub repository for pg_auto_failover.
To build the image, either use the provided Makefile and run make build, or run the docker build command directly:
$ git clone https://github.com/citusdata/pg_auto_failover $ cd pg_auto_failover/docs/cluster $ docker build -t pg_auto_failover:citus -f Dockerfile ../.. $ docker-compose build
To create a cluster we use the following docker-compose definition:
version: "3.9" # optional since v1.27.0 services:
monitor:
image: pg_auto_failover:citus
environment:
PGDATA: /tmp/pgaf
command: |
pg_autoctl create monitor --ssl-self-signed --auth trust --run
expose:
- 5432
coord:
image: pg_auto_failover:citus
environment:
PGDATA: /tmp/pgaf
PGUSER: citus
PGDATABASE: citus
PG_AUTOCTL_MONITOR: "postgresql://autoctl_node@monitor/pg_auto_failover"
expose:
- 5432
command: |
pg_autoctl create coordinator --ssl-self-signed --auth trust --pg-hba-lan --run
worker:
image: pg_auto_failover:citus
environment:
PGDATA: /tmp/pgaf
PGUSER: citus
PGDATABASE: citus
PG_AUTOCTL_MONITOR: "postgresql://autoctl_node@monitor/pg_auto_failover"
expose:
- 5432
command: |
pg_autoctl create worker --ssl-self-signed --auth trust --pg-hba-lan --run
To run the full Citus cluster with HA from this definition, we can use the following command:
$ docker-compose up --scale coord=2 --scale worker=6
The command above starts the services up. The command also specifies a --scale option that is different for each service. We need:
The default policy for the pg_auto_failover monitor is to assign a primary and a secondary per auto failover Group. In our case, every node being provisioned with the same command, we benefit from that default policy:
$ pg_autoctl create worker --ssl-self-signed --auth trust --pg-hba-lan --run
When provisioning a production cluster, it is often required to have a better control over which node participates in which group, then using the --group N option in the pg_autoctl create worker command line.
Within a given group, the first node that registers is a primary, and the other nodes are secondary nodes. The monitor takes care of that in a way that we don't have to. In a High Availability setup, every node should be ready to be promoted primary at any time, so knowing which node in a group is assigned primary first is not very interesting.
While the cluster is being provisionned by docker-compose, you can run the following command and have a dynamic dashboard to follow what's happening. The following command is like top for pg_auto_failover:
$ docker-compose exec monitor pg_autoctl watch
Because the pg_basebackup operation that is used to create the secondary nodes takes some time when using Citus, because of the first CHECKPOINT which is quite slow. So at first when inquiring about the cluster state you might see the following output:
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+-------------------+----------------+--------------+---------------------+--------------------
coord0a | 0/1 | cd52db444544:5432 | 1: 0/200C4A0 | read-write | wait_primary | wait_primary
coord0b | 0/2 | 66a31034f2e4:5432 | 1: 0/0 | none ! | wait_standby | catchingup worker1a | 1/3 | dae7c062e2c1:5432 | 1: 0/2003B18 | read-write | wait_primary | wait_primary worker1b | 1/4 | 397e6069b09b:5432 | 1: 0/0 | none ! | wait_standby | catchingup worker2a | 2/5 | 5bf86f9ef784:5432 | 1: 0/2006AB0 | read-write | wait_primary | wait_primary worker2b | 2/6 | 23498b801a61:5432 | 1: 0/0 | none ! | wait_standby | catchingup worker3a | 3/7 | c23610380024:5432 | 1: 0/2003B18 | read-write | wait_primary | wait_primary worker3b | 3/8 | 2ea8aac8a04a:5432 | 1: 0/0 | none ! | wait_standby | catchingup
After a while though (typically around a minute or less), you can run that same command again for stable result:
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+-------------------+----------------+--------------+---------------------+--------------------
coord0a | 0/1 | cd52db444544:5432 | 1: 0/3127AD0 | read-write | primary | primary
coord0b | 0/2 | 66a31034f2e4:5432 | 1: 0/3127AD0 | read-only | secondary | secondary worker1a | 1/3 | dae7c062e2c1:5432 | 1: 0/311B610 | read-write | primary | primary worker1b | 1/4 | 397e6069b09b:5432 | 1: 0/311B610 | read-only | secondary | secondary worker2a | 2/5 | 5bf86f9ef784:5432 | 1: 0/311B610 | read-write | primary | primary worker2b | 2/6 | 23498b801a61:5432 | 1: 0/311B610 | read-only | secondary | secondary worker3a | 3/7 | c23610380024:5432 | 1: 0/311B648 | read-write | primary | primary worker3b | 3/8 | 2ea8aac8a04a:5432 | 1: 0/311B648 | read-only | secondary | secondary
You can see from the above that the coordinator node has a primary and secondary instance for high availability. When connecting to the coordinator, clients should try connecting to whichever instance is running and supports reads and writes.
We can review the available Postgres URIs with the pg_autoctl show uri command:
$ docker-compose exec monitor pg_autoctl show uri
Type | Name | Connection String
-------------+---------+-------------------------------
monitor | monitor | postgres://autoctl_node@552dd89d5d63:5432/pg_auto_failover?sslmode=require
formation | default | postgres://66a31034f2e4:5432,cd52db444544:5432/citus?target_session_attrs=read-write&sslmode=require
To check that Citus worker nodes have been registered to the coordinator, we can run a psql session right from the coordinator container:
$ docker-compose exec coord psql -d citus -c 'select * from citus_get_active_worker_nodes();'
node_name | node_port --------------+-----------
dae7c062e2c1 | 5432
5bf86f9ef784 | 5432
c23610380024 | 5432 (3 rows)
We are now reaching the limits of using a simplified docker-compose setup. When using the --scale option, it is not possible to give a specific hostname to each running node, and then we get a randomly generated string instead or useful node names such as worker1a or worker3b.
In order to implement the following architecture, we need to introduce a more complex docker-compose file than in the previous section.
This time we create a cluster using the following docker-compose definition:
version: "3.9" # optional since v1.27.0 x-coord: &coordinator
image: pg_auto_failover:citus
environment:
PGDATA: /tmp/pgaf
PGUSER: citus
PGDATABASE: citus
PG_AUTOCTL_HBA_LAN: true
PG_AUTOCTL_AUTH_METHOD: "trust"
PG_AUTOCTL_SSL_SELF_SIGNED: true
PG_AUTOCTL_MONITOR: "postgresql://autoctl_node@monitor/pg_auto_failover"
expose:
- 5432 x-worker: &worker
image: pg_auto_failover:citus
environment:
PGDATA: /tmp/pgaf
PGUSER: citus
PGDATABASE: citus
PG_AUTOCTL_HBA_LAN: true
PG_AUTOCTL_AUTH_METHOD: "trust"
PG_AUTOCTL_SSL_SELF_SIGNED: true
PG_AUTOCTL_MONITOR: "postgresql://autoctl_node@monitor/pg_auto_failover"
expose:
- 5432 services:
app:
build:
context: .
dockerfile: Dockerfile.app
environment:
PGUSER: citus
PGDATABASE: citus
PGHOST: coord0a,coord0b
PGPORT: 5432
PGAPPNAME: demo
PGSSLMODE: require
PGTARGETSESSIONATTRS: read-write
monitor:
image: pg_auto_failover:citus
environment:
PGDATA: /tmp/pgaf
PG_AUTOCTL_SSL_SELF_SIGNED: true
expose:
- 5432
command: |
pg_autoctl create monitor --auth trust --run
coord0a:
<<: *coordinator
hostname: coord0a
command: |
pg_autoctl create coordinator --name coord0a --run
coord0b:
<<: *coordinator
hostname: coord0b
command: |
pg_autoctl create coordinator --name coord0b --run
worker1a:
<<: *worker
hostname: worker1a
command: |
pg_autoctl create worker --group 1 --name worker1a --run
worker1b:
<<: *worker
hostname: worker1b
command: |
pg_autoctl create worker --group 1 --name worker1b --run
worker2a:
<<: *worker
hostname: worker2a
command: |
pg_autoctl create worker --group 2 --name worker2a --run
worker2b:
<<: *worker
hostname: worker2b
command: |
pg_autoctl create worker --group 2 --name worker2b --run
worker3a:
<<: *worker
hostname: worker3a
command: |
pg_autoctl create worker --group 3 --name worker3a --run
worker3b:
<<: *worker
hostname: worker3b
command: |
pg_autoctl create worker --group 3 --name worker3b --run
This definition is a little more involved than the previous one. We take benefit from YAML anchors and aliases to define a template for our coordinator nodes and worker nodes, and then apply that template to the actual nodes.
Also this time we provision an application service (named "app") that sits in the backgound and allow us to later connect to our current primary coordinator. See Dockerfile.app for the complete definition of this service.
We start this cluster with a simplified command line this time:
$ docker-compose up
And this time we get the following cluster as a result:
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+--------------------
coord0a | 0/3 | coord0a:5432 | 1: 0/312B040 | read-write | primary | primary
coord0b | 0/4 | coord0b:5432 | 1: 0/312B040 | read-only | secondary | secondary worker1a | 1/1 | worker1a:5432 | 1: 0/311C550 | read-write | primary | primary worker1b | 1/2 | worker1b:5432 | 1: 0/311C550 | read-only | secondary | secondary worker2b | 2/7 | worker2b:5432 | 2: 0/5032698 | read-write | primary | primary worker2a | 2/8 | worker2a:5432 | 2: 0/5032698 | read-only | secondary | secondary worker3a | 3/5 | worker3a:5432 | 1: 0/311C870 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/311C870 | read-only | secondary | secondary
And then we have the following application connection string to use:
$ docker-compose exec monitor pg_autoctl show uri
Type | Name | Connection String -------------+---------+-------------------------------
monitor | monitor | postgres://autoctl_node@f0135b83edcd:5432/pg_auto_failover?sslmode=require
formation | default | postgres://coord0b:5432,coord0a:5432/citus?target_session_attrs=read-write&sslmode=require
And finally, the nodes being registered as Citus worker nodes also make more sense:
$ docker-compose exec coord0a psql -d citus -c 'select * from citus_get_active_worker_nodes()'
node_name | node_port -----------+-----------
worker1a | 5432
worker3a | 5432
worker2b | 5432 (3 rows)
IMPORTANT:
We see that in the citus_get_active_worker_nodes() output we have worker1a, worker2b, and worker3a. As mentionned before, that should have no impact on the operations of the Citus cluster when nodes are all dimensionned the same.
That said, some readers among you will prefer to have the A nodes as primaries to get started with. So let's implement our first worker failover then. With pg_auto_failover, this is as easy as doing:
$ docker-compose exec monitor pg_autoctl perform failover --group 2 15:40:03 9246 INFO Waiting 60 secs for a notification with state "primary" in formation "default" and group 2 15:40:03 9246 INFO Listening monitor notifications about state changes in formation "default" and group 2 15:40:03 9246 INFO Following table displays times when notifications are received
Time | Name | Node | Host:Port | Current State | Assigned State ---------+----------+-------+---------------+---------------------+-------------------- 22:58:42 | worker2b | 2/7 | worker2b:5432 | primary | draining 22:58:42 | worker2a | 2/8 | worker2a:5432 | secondary | prepare_promotion 22:58:42 | worker2a | 2/8 | worker2a:5432 | prepare_promotion | prepare_promotion 22:58:42 | worker2a | 2/8 | worker2a:5432 | prepare_promotion | wait_primary 22:58:42 | worker2b | 2/7 | worker2b:5432 | primary | demoted 22:58:42 | worker2b | 2/7 | worker2b:5432 | draining | demoted 22:58:42 | worker2b | 2/7 | worker2b:5432 | demoted | demoted 22:58:43 | worker2a | 2/8 | worker2a:5432 | wait_primary | wait_primary 22:58:44 | worker2b | 2/7 | worker2b:5432 | demoted | catchingup 22:58:46 | worker2b | 2/7 | worker2b:5432 | catchingup | catchingup 22:58:46 | worker2b | 2/7 | worker2b:5432 | catchingup | secondary 22:58:46 | worker2b | 2/7 | worker2b:5432 | secondary | secondary 22:58:46 | worker2a | 2/8 | worker2a:5432 | wait_primary | primary 22:58:46 | worker2a | 2/8 | worker2a:5432 | primary | primary
So it took around 5 seconds to do a full worker failover in worker group 2. Now we'll do the same on the group 1 to fix the other situation, and review the resulting cluster state.
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+--------------------
coord0a | 0/3 | coord0a:5432 | 1: 0/312ADA8 | read-write | primary | primary
coord0b | 0/4 | coord0b:5432 | 1: 0/312ADA8 | read-only | secondary | secondary worker1a | 1/1 | worker1a:5432 | 1: 0/311B610 | read-write | primary | primary worker1b | 1/2 | worker1b:5432 | 1: 0/311B610 | read-only | secondary | secondary worker2b | 2/7 | worker2b:5432 | 2: 0/50000D8 | read-only | secondary | secondary worker2a | 2/8 | worker2a:5432 | 2: 0/50000D8 | read-write | primary | primary worker3a | 3/5 | worker3a:5432 | 1: 0/311B648 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/311B648 | read-only | secondary | secondary
Which seen from the Citus coordinator, looks like the following:
$ docker-compose exec coord0a psql -d citus -c 'select * from citus_get_active_worker_nodes()'
node_name | node_port -----------+-----------
worker1a | 5432
worker3a | 5432
worker2a | 5432 (3 rows)
Let's create a database schema with a single distributed table.
$ docker-compose exec app psql
-- in psql CREATE TABLE companies (
id bigserial PRIMARY KEY,
name text NOT NULL,
image_url text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL ); SELECT create_distributed_table('companies', 'id');
Next download and ingest some sample data, still from within our psql session:
\copy companies from program 'curl -o- https://examples.citusdata.com/mt_ref_arch/companies.csv' with csv # ( COPY 75 )
Now we'll intentionally crash a worker's primary node and observe how the pg_auto_failover monitor unregisters that node in the coordinator and registers the secondary instead.
# the pg_auto_failover keeper process will be unable to resurrect # the worker node if pg_control has been removed $ docker-compose exec worker1a rm /tmp/pgaf/global/pg_control # shut it down $ docker-compose exec worker1a /usr/lib/postgresql/14/bin/pg_ctl stop -D /tmp/pgaf
The keeper will attempt to start worker 1a three times and then report the failure to the monitor, who promotes worker1b to replace worker1a. Citus worker worker1a is unregistered with the coordinator node, and worker1b is registered in its stead.
Asking the coordinator for active worker nodes now shows worker1b, worker2a, and worker3a:
$ docker-compose exec app psql -c 'select * from master_get_active_worker_nodes();'
node_name | node_port -----------+-----------
worker3a | 5432
worker2a | 5432
worker1b | 5432 (3 rows)
Finally, verify that all rows of data are still present:
$ docker-compose exec app psql -c 'select count(*) from companies;'
count -------
75
Meanwhile, the keeper on worker 1a heals the node. It runs pg_basebackup to fetch the current PGDATA from worker1a. This restores, among other things, a new copy of the file we removed. After streaming replication completes, worker1b becomes a full-fledged primary and worker1a its secondary.
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+--------------------
coord0a | 0/3 | coord0a:5432 | 1: 0/3178B20 | read-write | primary | primary
coord0b | 0/4 | coord0b:5432 | 1: 0/3178B20 | read-only | secondary | secondary worker1a | 1/1 | worker1a:5432 | 2: 0/504C400 | read-only | secondary | secondary worker1b | 1/2 | worker1b:5432 | 2: 0/504C400 | read-write | primary | primary worker2b | 2/7 | worker2b:5432 | 2: 0/50FF048 | read-only | secondary | secondary worker2a | 2/8 | worker2a:5432 | 2: 0/50FF048 | read-write | primary | primary worker3a | 3/5 | worker3a:5432 | 1: 0/31CD8C0 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/31CD8C0 | read-only | secondary | secondary
Because our application connection string includes both coordinator hosts with the option target_session_attrs=read-write, the database client will connect to whichever of these servers supports both reads and writes.
However if we use the same trick with the pg_control file to crash our primary coordinator, we can watch how the monitor promotes the secondary.
$ docker-compose exec coord0a rm /tmp/pgaf/global/pg_control $ docker-compose exec coord0a /usr/lib/postgresql/14/bin/pg_ctl stop -D /tmp/pgaf
After some time, coordinator A's keeper heals it, and the cluster converges in this state:
$ docker-compose exec monitor pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ---------+-------+---------------+----------------+--------------+---------------------+--------------------
coord0a | 0/3 | coord0a:5432 | 2: 0/50000D8 | read-only | secondary | secondary
coord0b | 0/4 | coord0b:5432 | 2: 0/50000D8 | read-write | primary | primary worker1a | 1/1 | worker1a:5432 | 2: 0/504C520 | read-only | secondary | secondary worker1b | 1/2 | worker1b:5432 | 2: 0/504C520 | read-write | primary | primary worker2b | 2/7 | worker2b:5432 | 2: 0/50FF130 | read-only | secondary | secondary worker2a | 2/8 | worker2a:5432 | 2: 0/50FF130 | read-write | primary | primary worker3a | 3/5 | worker3a:5432 | 1: 0/31CD8C0 | read-write | primary | primary worker3b | 3/6 | worker3b:5432 | 1: 0/31CD8C0 | read-only | secondary | secondary
We can check that the data is still available through the new coordinator node too:
$ docker-compose exec app psql -c 'select count(*) from companies;'
count -------
75
As mentioned in the first section of this tutorial, the way we use docker-compose here is not meant to be production ready. It's useful to understand and play with a distributed system such as Citus though, and makes it simple to introduce faults and see how the pg_auto_failover High Availability reacts to those faults.
One obvious missing element to better test the system is the lack of persistent volumes in our docker-compose based test rig. It is possible to create external volumes and use them for each node in the docker-compose definition. This allows restarting nodes over the same data set.
See the command pg_autoctl do tmux compose session for more details about how to run a docker-compose test environment with docker-compose, including external volumes for each node.
Now is a good time to go read Citus Documentation too, so that you know how to use this cluster you just created!
pg_auto_failover uses a state machine for highly controlled execution. As keepers inform the monitor about new events (or fail to contact it at all), the monitor assigns each node both a current state and a goal state. A node's current state is a strong guarantee of its capabilities. States themselves do not cause any actions; actions happen during state transitions. The assigned goal states inform keepers of what transitions to attempt.
A good way to get acquainted with the states is by examining the transitions of a cluster from birth to high availability.
After starting a monitor and running keeper init for the first data node ("node A"), the monitor registers the state of that node as "init" with a goal state of "single." The init state means the monitor knows nothing about the node other than its existence because the keeper is not yet continuously running there to report node health.
Once the keeper runs and reports its health to the monitor, the monitor assigns it the state "single," meaning it is just an ordinary Postgres server with no failover. Because there are not yet other nodes in the cluster, the monitor also assigns node A the goal state of single -- there's nothing that node A's keeper needs to change.
As soon as a new node ("node B") is initialized, the monitor assigns node A the goal state of "wait_primary." This means the node still has no failover, but there's hope for a secondary to synchronize with it soon. To accomplish the transition from single to wait_primary, node A's keeper adds node B's hostname to pg_hba.conf to allow a hot standby replication connection.
At the same time, node B transitions into wait_standby with the goal initially of staying in wait_standby. It can do nothing but wait until node A gives it access to connect. Once node A has transitioned to wait_primary, the monitor assigns B the goal of "catchingup," which gives B's keeper the green light to make the transition from wait_standby to catchingup. This transition involves running pg_basebackup, editing recovery.conf and restarting PostgreSQL in Hot Standby node.
Node B reports to the monitor when it's in hot standby mode and able to connect to node A. The monitor then assigns node B the goal state of "secondary" and A the goal of "primary." Postgres ships WAL logs from node A and replays them on B. Finally B is caught up and tells the monitor (specifically B reports its pg_stat_replication.sync_state and WAL replay lag). At this glorious moment the monitor assigns A the state primary (goal: primary) and B secondary (goal: secondary).
The following diagram shows the pg_auto_failover State Machine. It's missing links to the single state, which can always been reached when removing all the other nodes.
In the previous diagram we can see that we have a list of six states where the application can connect to a read-write Postgres service: single, wait_primary, primary, prepare_maintenance, and apply_settings.
A node is assigned the "init" state when it is first registered with the monitor. Nothing is known about the node at this point beyond its existence. If no other node has been registered with the monitor for the same formation and group ID then this node is assigned a goal state of "single." Otherwise the node has the goal state of "wait_standby."
There is only one node in the group. It behaves as a regular PostgreSQL instance, with no high availability and no failover. If the administrator removes a node the other node will revert to the single state.
Applied to a node intended to be the primary but not yet in that position. The primary-to-be at this point knows the secondary's node name or IP address, and has granted the node hot standby access in the pg_hba.conf file.
The wait_primary state may be caused either by a new potential secondary being registered with the monitor (good), or an existing secondary becoming unhealthy (bad). In the latter case, during the transition from primary to wait_primary, the primary node's keeper disables synchronous replication on the node. It also cancels currently blocked queries.
Applied to a primary node when another standby is joining the group. This allows the primary node to apply necessary changes to its HBA setup before allowing the new node joining the system to run the pg_basebackup command.
IMPORTANT:
A healthy secondary node exists and has caught up with WAL replication. Specifically, the keeper reports the primary state only when it has verified that the secondary is reported "sync" in pg_stat_replication.sync_state, and with a WAL lag of 0.
The primary state is a strong assurance. It's the only state where we know we can fail over when required.
During the transition from wait_primary to primary, the keeper also enables synchronous replication. This means that after a failover the secondary will be fully up to date.
Monitor decides this node is a standby. Node must wait until the primary has authorized it to connect and setup hot standby replication.
The monitor assigns catchingup to the standby node when the primary is ready for a replication connection (pg_hba.conf has been properly edited, connection role added, etc).
The standby node keeper runs pg_basebackup, connecting to the primary's hostname and port. The keeper then edits recovery.conf and starts PostgreSQL in hot standby node.
A node with this state is acting as a hot standby for the primary, and is up to date with the WAL log there. In particular, it is within 16MB or 1 WAL segment of the primary.
The cluster administrator can manually move a secondary into the maintenance state to gracefully take it offline. The primary will then transition from state primary to wait_primary, during which time the secondary will be online to accept writes. When the old primary reaches the wait_primary state then the secondary is safe to take offline with minimal consequences.
The cluster administrator can manually move a primary node into the maintenance state to gracefully take it offline. The primary then transitions to the prepare_maintenance state to make sure the secondary is not missing any writes. In the prepare_maintenance state, the primary shuts down.
The custer administrator can manually move a secondary into the maintenance state to gracefully take it offline. Before reaching the maintenance state though, we want to switch the primary node to asynchronous replication, in order to avoid writes being blocked. In the state wait_maintenance the standby waits until the primary has reached wait_primary.
A state between primary and demoted where replication buffers finish flushing. A draining node will not accept new client writes, but will continue to send existing data to the secondary.
To implement that with Postgres we actually stop the service. When stopping, Postgres ensures that the current replication buffers are flushed correctly to synchronous standbys.
The primary keeper or its database were unresponsive past a certain threshold. The monitor assigns demoted state to the primary to avoid a split-brain scenario where there might be two nodes that don't communicate with each other and both accept client writes.
In that state the keeper stops PostgreSQL and prevents it from running.
If the monitor assigns the primary a demoted goal state but the primary keeper doesn't acknowledge transitioning to that state within a timeout window, then the monitor assigns demote_timeout to the primary.
Most commonly may happen when the primary machine goes silent. The keeper is not reporting to the monitor.
The stop_replication state is meant to ensure that the primary goes to the demoted state before the standby goes to single and accepts writes (in case the primary can’t contact the monitor anymore). Before promoting the secondary node, the keeper stops PostgreSQL on the primary to avoid split-brain situations.
For safety, when the primary fails to contact the monitor and fails to see the pg_auto_failover connection in pg_stat_replication, then it goes to the demoted state of its own accord.
The prepare_promotion state is meant to prepare the standby server to being promoted. This state allows synchronisation on the monitor, making sure that the primary has stopped Postgres before promoting the secondary, hence preventing split brain situations.
The report_lsn state is assigned to standby nodes when a failover is orchestrated and there are several standby nodes. In order to pick the furthest standby in the replication, pg_auto_failover first needs a fresh report of the current LSN position reached on each standby node.
When a node reaches the report_lsn state, the replication stream is stopped, by restarting Postgres without a primary_conninfo. This allows the primary node to detect Network Partitions, i.e. when the primary can't connect to the monitor and there's no standby listed in pg_stat_replication.
The fast_forward state is assigned to the selected promotion candidate during a failover when it won the election thanks to the candidate priority settings, but the selected node is not the most advanced standby node as reported in the report_lsn state.
Missing WAL bytes are fetched from one of the most advanced standby nodes by using Postgres cascading replication features: it is possible to use any standby node in the primary_conninfo.
The dropped state is assigned to a node when the pg_autoctl drop node command is used. This allows the node to implement specific local actions before being entirely removed from the monitor database.
When a node reports reaching the dropped state, the monitor removes its entry. If a node is not reporting anymore, maybe because it's completely unavailable, then it's possible to run the pg_autoctl drop node --force command, and then the node entry is removed from the monitor.
When built in TEST mode, it is then possible to use the following command to get a visual representation of the Keeper's Finite State Machine:
$ PG_AUTOCTL_DEBUG=1 pg_autoctl do fsm gv | dot -Tsvg > fsm.svg
The dot program is part of the Graphviz suite and produces the following output:
At the heart of the pg_auto_failover implementation is a State Machine. The state machine is driven by the monitor, and its transitions are implemented in the keeper service, which then reports success to the monitor.
The keeper is allowed to retry transitions as many times as needed until they succeed, and reports also failures to reach the assigned state to the monitor node. The monitor also implements frequent health-checks targeting the registered PostgreSQL nodes.
When the monitor detects something is not as expected, it takes action by assigning a new goal state to the keeper, that is responsible for implementing the transition to this new state, and then reporting.
The pg_auto_failover monitor is responsible for running regular health-checks with every PostgreSQL node it manages. A health-check is successful when it is able to connect to the PostgreSQL node using the PostgreSQL protocol (libpq), imitating the pg_isready command.
How frequent those health checks are (5s by default), the PostgreSQL connection timeout in use (5s by default), and how many times to retry in case of a failure before marking the node unhealthy (2 by default) are GUC variables that you can set on the Monitor node itself. Remember, the monitor is implemented as a PostgreSQL extension, so the setup is a set of PostgreSQL configuration settings:
SELECT name, setting
FROM pg_settings
WHERE name ~ 'pgautofailover\.health';
name | setting -----------------------------------------+---------
pgautofailover.health_check_max_retries | 2
pgautofailover.health_check_period | 5000
pgautofailover.health_check_retry_delay | 2000
pgautofailover.health_check_timeout | 5000 (4 rows)
The pg_auto_failover keeper also reports if PostgreSQL is running as expected. This is useful for situations where the PostgreSQL server / OS is running fine and the keeper (pg_autoctl run) is still active, but PostgreSQL has failed. Situations might include File System is Full on the WAL disk, some file system level corruption, missing files, etc.
Here's what happens to your PostgreSQL service in case of any single-node failure is observed:
When the primary node is unhealthy, and only when the secondary node is itself in good health, then the primary node is asked to transition to the DRAINING state, and the attached secondary is asked to transition to the state PREPARE_PROMOTION. In this state, the secondary is asked to catch-up with the WAL traffic from the primary, and then report success.
The monitor then continues orchestrating the promotion of the standby: it stops the primary (implementing STONITH in order to prevent any data loss), and promotes the secondary into being a primary now.
Depending on the exact situation that triggered the primary unhealthy, it's possible that the secondary fails to catch-up with WAL from it, in that case after the PREPARE_PROMOTION_CATCHUP_TIMEOUT the standby reports success anyway, and the failover sequence continues from the monitor.
When the secondary node is unhealthy, the monitor assigns to it the state CATCHINGUP, and assigns the state WAIT_PRIMARY to the primary node. When implementing the transition from PRIMARY to WAIT_PRIMARY, the keeper disables synchronous replication.
When the keeper reports an acceptable WAL difference in the two nodes again, then the replication is upgraded back to being synchronous. While a secondary node is not in the SECONDARY state, secondary promotion is disabled.
Then the primary and secondary node just work as if you didn't have setup pg_auto_failover in the first place, as the keeper fails to report local state from the nodes. Also, health checks are not performed. It means that no automated failover may happen, even if needed.
Adding to those simple situations, pg_auto_failover is also resilient to Network Partitions. Here's the list of situation that have an impact to pg_auto_failover behavior, and the actions taken to ensure High Availability of your PostgreSQL service:
Then it could be that either the primary is alone on its side of a network split, or that the monitor has failed. The keeper decides depending on whether the secondary node is still connected to the replication slot, and if we have a secondary, continues to serve PostgreSQL queries.
Otherwise, when the secondary isn't connected, and after the NETWORK_PARTITION_TIMEOUT has elapsed, the primary considers it might be alone in a network partition: that's a potential split brain situation and with only one way to prevent it. The primary stops, and reports a new state of DEMOTE_TIMEOUT.
The network_partition_timeout can be setup in the keeper's configuration and defaults to 20s.
Once all the retries have been done and the timeouts are elapsed, then the primary node is considered unhealthy, and the monitor begins the failover routine. This routine has several steps, each of them allows to control our expectations and step back if needed.
For the failover to happen, the secondary node needs to be healthy and caught-up with the primary. Only if we timeout while waiting for the WAL delta to resorb (30s by default) then the secondary can be promoted with uncertainty about the data durability in the group.
As soon as the secondary is considered unhealthy then the monitor changes the replication setting to asynchronous on the primary, by assigning it the WAIT_PRIMARY state. Also the secondary is assigned the state CATCHINGUP, which means it can't be promoted in case of primary failure.
As the monitor tracks the WAL delta between the two servers, and they both report it independently, the standby is eligible to promotion again as soon as it's caught-up with the primary again, and at this time it is assigned the SECONDARY state, and the replication will be switched back to synchronous.
If a node cannot communicate to the monitor, either because the monitor is down or because there is a problem with the network, it will simply remain in the same state until the monitor comes back.
If there is a network partition, it might be that the monitor and secondary can still communicate and the monitor decides to promote the secondary since the primary is no longer responsive. Meanwhile, the primary is still up-and-running on the other side of the network partition. If a primary cannot communicate to the monitor it starts checking whether the secondary is still connected. In PostgreSQL, the secondary connection automatically times out after 30 seconds. If last contact with the monitor and the last time a connection from the secondary was observed are both more than 30 seconds in the past, the primary concludes it is on the losing side of a network partition and shuts itself down. It may be that the secondary and the monitor were actually down and the primary was the only node that was alive, but we currently do not have a way to distinguish such a situation. As with consensus algorithms, availability can only be correctly preserved if at least 2 out of 3 nodes are up.
In asymmetric network partitions, the primary might still be able to talk to the secondary, while unable to talk to the monitor. During failover, the monitor therefore assigns the secondary the stop_replication state, which will cause it to disconnect from the primary. After that, the primary is expected to shut down after at least 30 and at most 60 seconds. To factor in worst-case scenarios, the monitor waits for 90 seconds before promoting the secondary to become the new primary.
We provide native system packages for pg_auto_failover on most popular Linux distributions.
Use the steps below to install pg_auto_failover on PostgreSQL 11. At the current time pg_auto_failover is compatible with both PostgreSQL 10 and PostgreSQL 11.
Binary packages for debian and derivatives (ubuntu) are available from apt.postgresql.org repository, install by following the linked documentation and then:
$ sudo apt-get install pg-auto-failover-cli $ sudo apt-get install postgresql-14-auto-failover
The Postgres extension named "pgautofailover" is only necessary on the monitor node. To install that extension, you can install the postgresql-14-auto-failover package when using Postgres 14. It's available for other Postgres versions too.
When installing the debian Postgres package, the installation script will initialize a Postgres data directory automatically, and register it to the systemd services. When using pg_auto_failover, it is best to avoid that step.
To avoid automated creation of a Postgres data directory when installing the debian package, follow those steps:
$ curl https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - $ echo "deb http://apt.postgresql.org/pub/repos/apt buster-pgdg main" > /etc/apt/sources.list.d/pgdg.list # bypass initdb of a "main" cluster $ echo 'create_main_cluster = false' | sudo tee -a /etc/postgresql-common/createcluster.conf $ apt-get update $ apt-get install -y --no-install-recommends postgresql-14
That way when it's time to pg_autoctl create monitor or pg_autoctl create postgres there is no confusion about how to handle the default Postgres service created by debian: it has not been created at all.
The following installation method downloads a bash script that automates several steps. The full script is available for review at our package cloud installation instructions page url.
# add the required packages to your system curl https://install.citusdata.com/community/rpm.sh | sudo bash # install pg_auto_failover sudo yum install -y pg-auto-failover14_12 # confirm installation /usr/pgsql-12/bin/pg_autoctl --version
If you'd prefer to install your repo on your system manually, follow the instructions from package cloud manual installation page. This page will guide you with the specific details to achieve the 3 steps:
Then when that's done, you can proceed with installing pg_auto_failover itself as in the previous case:
# install pg_auto_failover sudo yum install -y pg-auto-failover14_12 # confirm installation /usr/pgsql-12/bin/pg_autoctl --version
The command pg_autoctl show systemd outputs a systemd unit file that you can use to setup a boot-time registered service for pg_auto_failover on your machine.
Here's a sample output from the command:
$ export PGDATA=/var/lib/postgresql/monitor $ pg_autoctl show systemd 13:44:34 INFO HINT: to complete a systemd integration, run the following commands: 13:44:34 INFO pg_autoctl -q show systemd --pgdata "/var/lib/postgresql/monitor" | sudo tee /etc/systemd/system/pgautofailover.service 13:44:34 INFO sudo systemctl daemon-reload 13:44:34 INFO sudo systemctl start pgautofailover [Unit] Description = pg_auto_failover [Service] WorkingDirectory = /var/lib/postgresql Environment = 'PGDATA=/var/lib/postgresql/monitor' User = postgres ExecStart = /usr/lib/postgresql/10/bin/pg_autoctl run Restart = always StartLimitBurst = 0 [Install] WantedBy = multi-user.target
Copy/pasting the commands given in the hint output from the command will enable the pgautofailer service on your system, when using systemd.
It is important that PostgreSQL is started by pg_autoctl rather than by systemd itself, as it might be that a failover has been done during a reboot, for instance, and that once the reboot complete we want the local Postgres to re-join as a secondary node where it used to be a primary node.
In order to be able to orchestrate fully automated failovers, pg_auto_failover needs to be able to establish the following Postgres connections:
Postgres Client authentication is controlled by a configuration file: pg_hba.conf. This file contains a list of rules where each rule may allow or reject a connection attempt.
For pg_auto_failover to work as intended, some HBA rules need to be added to each node configuration. You can choose to provision the pg_hba.conf file yourself thanks to pg_autoctl options' --skip-pg-hba, or you can use the following options to control which kind of rules are going to be added for you.
For your application to be able to connect to the current Postgres primary servers, some application specific HBA rules have to be added to pg_hba.conf. There is no provision for doing that in pg_auto_failover.
In other words, it is expected that you have to edit pg_hba.conf to open connections for your application needs.
As its name suggests the trust security model is not enabling any kind of security validation. This setting is popular for testing deployments though, as it makes it very easy to verify that everything works as intended before putting security restrictions in place.
To enable a “trust” security model with pg_auto_failover, use the pg_autoctl option --auth trust when creating nodes:
$ pg_autoctl create monitor --auth trust ... $ pg_autoctl create postgres --auth trust ... $ pg_autoctl create postgres --auth trust ...
When using --auth trust pg_autoctl adds new HBA rules in the monitor and the Postgres nodes to enable connections as seen above.
To setup pg_auto_failover with password for connections, you can use one of the password based authentication methods supported by Postgres, such as password or scram-sha-256. We recommend the latter, as in the following example:
$ pg_autoctl create monitor --auth scram-sha-256 ...
The pg_autoctl does not set the password for you. The first step is to set the database user password in the monitor database thanks to the following command:
$ psql postgres://monitor.host/pg_auto_failover > alter user autoctl_node password 'h4ckm3';
Now that the monitor is ready with our password set for the autoctl_node user, we can use the password in the monitor connection string used when creating Postgres nodes.
On the primary node, we can create the Postgres setup as usual, and then set our replication password, that we will use if we are demoted and then re-join as a standby:
$ pg_autoctl create postgres \
--auth scram-sha-256 \
... \
--monitor postgres://autoctl_node:h4ckm3@monitor.host/pg_auto_failover $ pg_autoctl config set replication.password h4ckm3m0r3
The second Postgres node is going to be initialized as a secondary and pg_autoctl then calls pg_basebackup at create time. We need to have the replication password already set at this time, and we can achieve that the following way:
$ export PGPASSWORD=h4ckm3m0r3 $ pg_autoctl create postgres \
--auth scram-sha-256 \
... \
--monitor postgres://autoctl_node:h4ckm3@monitor.host/pg_auto_failover $ pg_autoctl config set replication.password h4ckm3m0r3
Note that you can use The Password File mechanism as discussed in the Postgres documentation in order to maintain your passwords in a separate file, not in your main pg_auto_failover configuration file. This also avoids using passwords in the environment and in command lines.
Postgres knows how to use SSL to enable network encryption of all communications, including authentication with passwords and the whole data set when streaming replication is used.
To enable SSL on the server an SSL certificate is needed. It could be as simple as a self-signed certificate, and pg_autoctl creates such a certificate for you when using --ssl-self-signed command line option:
$ pg_autoctl create monitor --ssl-self-signed ... \
--auth scram-sha-256 ... \
--ssl-mode require \
... $ pg_autoctl create postgres --ssl-self-signed ... \
--auth scram-sha-256 ... \
... $ pg_autoctl create postgres --ssl-self-signed ... \
--auth scram-sha-256 ... \
...
In that example we setup SSL connections to encrypt the network traffic, and we still have to setup an authentication mechanism exactly as in the previous sections of this document. Here scram-sha-256 has been selected, and the password will be sent over an encrypted channel.
When using the --ssl-self-signed option, pg_autoctl creates a self-signed certificate, as per the Postgres documentation at the Creating Certificates page.
The certificate subject CN defaults to the --hostname parameter, which can be given explicitly or computed by pg_autoctl as either your hostname when you have proper DNS resolution, or your current IP address.
Self-signed certificates provide protection against eavesdropping; this setup does NOT protect against Man-In-The-Middle attacks nor Impersonation attacks. See PostgreSQL documentation page SSL Support for details.
In many cases you will want to install certificates provided by your local security department and signed by a trusted Certificate Authority. In that case one solution is to use --skip-pg-hba and do the whole setup yourself.
It is still possible to give the certificates to pg_auto_failover and have it handle the Postgres setup for you:
$ pg_autoctl create monitor --ssl-ca-file root.crt \
--ssl-crl-file root.crl \
--server-cert server.crt \
--server-key server.key \
--ssl-mode verify-full \
... $ pg_autoctl create postgres --ssl-ca-file root.crt \
--server-cert server.crt \
--server-key server.key \
--ssl-mode verify-full \
... $ pg_autoctl create postgres --ssl-ca-file root.crt \
--server-cert server.crt \
--server-key server.key \
--ssl-mode verify-full \
...
The option --ssl-mode can be used to force connection strings used by pg_autoctl to contain your preferred ssl mode. It defaults to require when using --ssl-self-signed and to allow when --no-ssl is used. Here, we set --ssl-mode to verify-full which requires SSL Certificates Authentication, covered next.
The default --ssl-mode when providing your own certificates (signed by your trusted CA) is then verify-full. This setup applies to the client connection where the server identity is going to be checked against the root certificate provided with --ssl-ca-file and the revocation list optionally provided with the --ssl-crl-file. Both those files are used as the respective parameters sslrootcert and sslcrl in pg_autoctl connection strings to both the monitor and the streaming replication primary server.
Given those files, it is then possible to use certificate based authentication of client connections. For that, it is necessary to prepare client certificates signed by your root certificate private key and using the target user name as its CN, as per Postgres documentation for Certificate Authentication:
For enabling the cert authentication method with pg_auto_failover, you need to prepare a Client Certificate for the user postgres and used by pg_autoctl when connecting to the monitor, to place in ~/.postgresql/postgresql.crt along with its key ~/.postgresql/postgresql.key, in the home directory of the user that runs the pg_autoctl service (which defaults to postgres).
Then you need to create a user name map as documented in Postgres page User Name Maps so that your certificate can be used to authenticate pg_autoctl users.
The ident map in pg_ident.conf on the pg_auto_failover monitor should then have the following entry, to allow postgres to connect as the autoctl_node user for pg_autoctl operations:
# MAPNAME SYSTEM-USERNAME PG-USERNAME # pg_autoctl runs as postgres and connects to the monitor autoctl_node user pgautofailover postgres autoctl_node
To enable streaming replication, the pg_ident.conf file on each Postgres node should now allow the postgres user in the client certificate to connect as the pgautofailover_replicator database user:
# MAPNAME SYSTEM-USERNAME PG-USERNAME # pg_autoctl runs as postgres and connects to the monitor autoctl_node user pgautofailover postgres pgautofailover_replicator
Given that user name map, you can then use the cert authentication method. As with the pg_ident.conf provisioning, it is best to now provision the HBA rules yourself, using the --skip-pg-hba option:
$ pg_autoctl create postgres --skip-pg-hba --ssl-ca-file ...
The HBA rule will use the authentication method cert with a map option, and might then look like the following on the monitor:
# allow certificate based authentication to the monitor hostssl pg_auto_failover autoctl_node 10.0.0.0/8 cert map=pgautofailover
Then your pg_auto_failover nodes on the 10.0.0.0 network are allowed to connect to the monitor with the user autoctl_node used by pg_autoctl, assuming they have a valid and trusted client certificate.
The HBA rule to use on the Postgres nodes to allow for Postgres streaming replication connections looks like the following:
# allow streaming replication for pg_auto_failover nodes hostssl replication pgautofailover_replicator 10.0.0.0/8 cert map=pgautofailover
Because the Postgres server runs as the postgres system user, the connection to the primary node can be made with SSL enabled and will then use the client certificates installed in the postgres home directory in ~/.postgresql/postgresql.{key,cert} locations.
While pg_auto_failover knows how to manage the Postgres HBA rules that are necessary for your stream replication needs and for its monitor protocol, it will not manage the Postgres HBA rules that are needed for your applications.
If you have your own HBA provisioning solution, you can include the rules needed for pg_auto_failover and then use the --skip-pg-hba option to the pg_autoctl create commands.
Whether you upgrade pg_auto_failover from a previous version that did not have support for the SSL features, or when you started with --no-ssl and later change your mind, it is possible with pg_auto_failover to add SSL settings on system that has already been setup without explicit SSL support.
In this section we detail how to upgrade to SSL settings.
Installing Self-Signed certificates on-top of an already existing pg_auto_failover setup is done with one of the following pg_autoctl command variants, depending if you want self-signed certificates or fully verified ssl certificates:
$ pg_autoctl enable ssl --ssl-self-signed --ssl-mode required $ pg_autoctl enable ssl --ssl-ca-file root.crt \
--ssl-crl-file root.crl \
--server-cert server.crt \
--server-key server.key \
--ssl-mode verify-full
The pg_autoctl enable ssl command edits the postgresql-auto-failover.conf Postgres configuration file to match the command line arguments given and enable SSL as instructed, and then updates the pg_autoctl configuration.
The connection string to connect to the monitor is also automatically updated by the pg_autoctl enable ssl command. You can verify your new configuration with:
$ pg_autoctl config get pg_autoctl.monitor
Note that an already running pg_autoctl daemon will try to reload its configuration after pg_autoctl enable ssl has finished. In some cases this is not possible to do without a restart. So be sure to check the logs from a running daemon to confirm that the reload succeeded. If it did not you may need to restart the daemon to ensure the new connection string is used.
The HBA settings are not edited, irrespective of the --skip-pg-hba that has been used at creation time. That's because the host records match either SSL or non-SSL connection attempts in Postgres HBA file, so the pre-existing setup will continue to work. To enhance the SSL setup, you can manually edit the HBA files and change the existing lines from host to hostssl to dissallow unencrypted connections at the server side.
In summary, to upgrade an existing pg_auto_failover setup to enable SSL:
The pg_autoctl tool hosts many commands and sub-commands. Each of them have their own manual page.
pg_autoctl - control a pg_auto_failover node
pg_autoctl provides the following commands:
pg_autoctl + create Create a pg_auto_failover node, or formation + drop Drop a pg_auto_failover node, or formation + config Manages the pg_autoctl configuration + show Show pg_auto_failover information + enable Enable a feature on a formation + disable Disable a feature on a formation + get Get a pg_auto_failover node, or formation setting + set Set a pg_auto_failover node, or formation setting + perform Perform an action orchestrated by the monitor
activate Activate a Citus worker from the Citus coordinator
run Run the pg_autoctl service (monitor or keeper)
stop signal the pg_autoctl service for it to stop
reload signal the pg_autoctl for it to reload its configuration
status Display the current status of the pg_autoctl service
help print help message
version print pg_autoctl version pg_autoctl create
monitor Initialize a pg_auto_failover monitor node
postgres Initialize a pg_auto_failover standalone postgres node
coordinator Initialize a pg_auto_failover citus coordinator node
worker Initialize a pg_auto_failover citus worker node
formation Create a new formation on the pg_auto_failover monitor pg_autoctl drop
monitor Drop the pg_auto_failover monitor
node Drop a node from the pg_auto_failover monitor
formation Drop a formation on the pg_auto_failover monitor pg_autoctl config
check Check pg_autoctl configuration
get Get the value of a given pg_autoctl configuration variable
set Set the value of a given pg_autoctl configuration variable pg_autoctl show
uri Show the postgres uri to use to connect to pg_auto_failover nodes
events Prints monitor's state of nodes in a given formation and group
state Prints monitor's state of nodes in a given formation and group
settings Print replication settings for a formation from the monitor
standby-names Prints synchronous_standby_names for a given group
file List pg_autoctl internal files (config, state, pid)
systemd Print systemd service file for this node pg_autoctl enable
secondary Enable secondary nodes on a formation
maintenance Enable Postgres maintenance mode on this node
ssl Enable SSL configuration on this node
monitor Enable a monitor for this node to be orchestrated from pg_autoctl disable
secondary Disable secondary nodes on a formation
maintenance Disable Postgres maintenance mode on this node
ssl Disable SSL configuration on this node
monitor Disable the monitor for this node pg_autoctl get + node get a node property from the pg_auto_failover monitor + formation get a formation property from the pg_auto_failover monitor pg_autoctl get node
replication-quorum get replication-quorum property from the monitor
candidate-priority get candidate property from the monitor pg_autoctl get formation
settings get replication settings for a formation from the monitor
number-sync-standbys get number_sync_standbys for a formation from the monitor pg_autoctl set + node set a node property on the monitor + formation set a formation property on the monitor pg_autoctl set node
metadata set metadata on the monitor
replication-quorum set replication-quorum property on the monitor
candidate-priority set candidate property on the monitor pg_autoctl set formation
number-sync-standbys set number-sync-standbys for a formation on the monitor pg_autoctl perform
failover Perform a failover for given formation and group
switchover Perform a switchover for given formation and group
promotion Perform a failover that promotes a target node
The pg_autoctl tool is the client tool provided by pg_auto_failover to create and manage Postgres nodes and the pg_auto_failover monitor node. The command is built with many sub-commands that each have their own manual page.
To get the full recursive list of supported commands, use:
pg_autoctl help
To grab the version of pg_autoctl that you're using, use:
pg_autoctl --version pg_autoctl version
A typical output would be:
pg_autoctl version 1.4.2 pg_autoctl extension version 1.4 compiled with PostgreSQL 12.3 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit compatible with Postgres 10, 11, 12, and 13
The version is also available as a JSON document when using the --json option:
pg_autoctl --version --json pg_autoctl version --json
A typical JSON output would be:
{
"pg_autoctl": "1.4.2",
"pgautofailover": "1.4",
"pg_major": "12",
"pg_version": "12.3",
"pg_version_str": "PostgreSQL 12.3 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit",
"pg_version_num": 120003 }
This is for version 1.4.2 of pg_auto_failover. This particular version of the pg_autoctl client tool has been compiled using libpq for PostgreSQL 12.3 and is compatible with Postgres 10, 11, 12, and 13.
PG_AUTOCTL_DEBUG
pg_autoctl create - Create a pg_auto_failover node, or formation
pg_autoctl create monitor - Initialize a pg_auto_failover monitor node
This command initializes a PostgreSQL cluster and installs the pgautofailover extension so that it's possible to use the new instance to monitor PostgreSQL services:
usage: pg_autoctl create monitor [ --pgdata --pgport --pgctl --hostname ] --pgctl path to pg_ctl --pgdata path to data directory --pgport PostgreSQL's port number --hostname hostname by which postgres is reachable --auth authentication method for connections from data nodes --skip-pg-hba skip editing pg_hba.conf rules --run create node then run pg_autoctl service --ssl-self-signed setup network encryption using self signed certificates (does NOT protect against MITM) --ssl-mode use that sslmode in connection strings --ssl-ca-file set the Postgres ssl_ca_file to that file path --ssl-crl-file set the Postgres ssl_crl_file to that file path --no-ssl don't enable network encryption (NOT recommended, prefer --ssl-self-signed) --server-key set the Postgres ssl_key_file to that file path --server-cert set the Postgres ssl_cert_file to that file path
The pg_autoctl tool is the client tool provided by pg_auto_failover to create and manage Postgres nodes and the pg_auto_failover monitor node. The command is built with many sub-commands that each have their own manual page.
The following options are available to pg_autoctl create monitor:
Defaults to the pg_ctl found in the PATH when there is a single entry for pg_ctl in the PATH. Check your setup using which -a pg_ctl.
When using an RPM based distribution such as RHEL or CentOS, the path would usually be /usr/pgsql-13/bin/pg_ctl for Postgres 13.
When using a debian based distribution such as debian or ubuntu, the path would usually be /usr/lib/postgresql/13/bin/pg_ctl for Postgres 13. Those distributions also use the package postgresql-common which provides /usr/bin/pg_config. This tool can be automatically used by pg_autoctl to discover the default version of Postgres to use on your setup.
When not provided, a default value is computed by running the following algorithm.
When the forward DNS lookup response in step 3. is an IP address found in one of our local network interfaces, then pg_autoctl uses the hostname found in step 2. as the default --hostname. Otherwise it uses the IP address found in step 1.
You may use the --hostname command line option to bypass the whole DNS lookup based process and force the local node name to a fixed value.
When --skip-pg-hba is used, pg_autoctl still outputs the HBA entries it needs in the logs, it only skips editing the HBA file.
PGDATA
PG_CONFIG
PATH
PGHOST, PGPORT, PGDATABASE, PGUSER, PGCONNECT_TIMEOUT, ...
TMPDIR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl create postgres - Initialize a pg_auto_failover postgres node
The command pg_autoctl create postgres initializes a standalone Postgres node to a pg_auto_failover monitor. The monitor is then handling auto-failover for this Postgres node (as soon as a secondary has been registered too, and is known to be healthy).
usage: pg_autoctl create postgres
--pgctl path to pg_ctl
--pgdata path to data directory
--pghost PostgreSQL's hostname
--pgport PostgreSQL's port number
--listen PostgreSQL's listen_addresses
--username PostgreSQL's username
--dbname PostgreSQL's database name
--name pg_auto_failover node name
--hostname hostname used to connect from the other nodes
--formation pg_auto_failover formation
--monitor pg_auto_failover Monitor Postgres URL
--auth authentication method for connections from monitor
--skip-pg-hba skip editing pg_hba.conf rules
--pg-hba-lan edit pg_hba.conf rules for --dbname in detected LAN
--ssl-self-signed setup network encryption using self signed certificates (does NOT protect against MITM)
--ssl-mode use that sslmode in connection strings
--ssl-ca-file set the Postgres ssl_ca_file to that file path
--ssl-crl-file set the Postgres ssl_crl_file to that file path
--no-ssl don't enable network encryption (NOT recommended, prefer --ssl-self-signed)
--server-key set the Postgres ssl_key_file to that file path
--server-cert set the Postgres ssl_cert_file to that file path
--candidate-priority priority of the node to be promoted to become primary
--replication-quorum true if node participates in write quorum
--maximum-backup-rate maximum transfer rate of data transferred from the server during initial sync
Three different modes of initialization are supported by this command, corresponding to as many implementation strategies.
This happens when --pgdata (or the environment variable PGDATA) points to an non-existing or empty directory. Then the given --hostname is registered to the pg_auto_failover --monitor as a member of the --formation.
The monitor answers to the registration call with a state to assign to the new member of the group, either SINGLE or WAIT_STANDBY. When the assigned state is SINGLE, then pg_autoctl create postgres proceeds to initialize a new PostgreSQL instance.
This happens when --pgdata (or the environment variable PGDATA) points to an already existing directory that belongs to a PostgreSQL instance. The standard PostgreSQL tool pg_controldata is used to recognize whether the directory belongs to a PostgreSQL instance.
In that case, the given --hostname is registered to the monitor in the tentative SINGLE state. When the given --formation and --group is currently empty, then the monitor accepts the registration and the pg_autoctl create prepares the already existing primary server for pg_auto_failover.
This happens when --pgdata (or the environment variable PGDATA) points to a non-existing or empty directory, and when the monitor registration call assigns the state WAIT_STANDBY in step 1.
In that case, the pg_autoctl create command steps through the initial states of registering a secondary server, which includes preparing the primary server PostgreSQL HBA rules and creating a replication slot.
When the command ends successfully, a PostgreSQL secondary server has been created with pg_basebackup and is now started, catching-up to the primary server.
When the data directory pointed to by the option --pgdata or the environment variable PGDATA already exists, then pg_auto_failover verifies that the system identifier matches the one of the other nodes already existing in the same group.
The system identifier can be obtained with the command pg_controldata. All nodes in a physical replication setting must have the same system identifier, and so in pg_auto_failover all the nodes in a same group have that constraint too.
When the system identifier matches the already registered system identifier of other nodes in the same group, then the node is set-up as a standby and Postgres is started with the primary conninfo pointed at the current primary.
The --auth option allows setting up authentication method to be used when monitor node makes a connection to data node with pgautofailover_monitor user. As with the pg_autoctl create monitor command, you could use --auth trust when playing with pg_auto_failover at first and consider something production grade later. Also, consider using --skip-pg-hba if you already have your own provisioning tools with a security compliance process.
See Security settings for pg_auto_failover for notes on .pgpass
The following options are available to pg_autoctl create postgres:
Defaults to the pg_ctl found in the PATH when there is a single entry for pg_ctl in the PATH. Check your setup using which -a pg_ctl.
When using an RPM based distribution such as RHEL or CentOS, the path would usually be /usr/pgsql-13/bin/pg_ctl for Postgres 13.
When using a debian based distribution such as debian or ubuntu, the path would usually be /usr/lib/postgresql/13/bin/pg_ctl for Postgres 13. Those distributions also use the package postgresql-common which provides /usr/bin/pg_config. This tool can be automatically used by pg_autoctl to discover the default version of Postgres to use on your setup.
The --name option allows using a user-friendly name for your Postgres nodes.
When not provided, a default value is computed by running the following algorithm.
When the forward DNS lookup response in step 3. is an IP address found in one of our local network interfaces, then pg_autoctl uses the hostname found in step 2. as the default --hostname. Otherwise it uses the IP address found in step 1.
You may use the --hostname command line option to bypass the whole DNS lookup based process and force the local node name to a fixed value.
When --skip-pg-hba is used, pg_autoctl still outputs the HBA entries it needs in the logs, it only skips editing the HBA file.
For instance, when the monitor resolves to 192.168.0.1 and your local Postgres node uses an inferface with IP address 192.168.0.2/255.255.255.0 to connect to the monitor, then the LAN CIDR is computed to be 192.168.0.0/24.
PGDATA
PG_AUTOCTL_MONITOR
PG_AUTOCTL_NODE_NAME
PG_AUTOCTL_REPLICATION_QUORUM
PG_AUTOCTL_CANDIDATE_PRIORITY
PG_CONFIG
PATH
PGHOST, PGPORT, PGDATABASE, PGUSER, PGCONNECT_TIMEOUT, ...
TMPDIR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl create coordinator - Initialize a pg_auto_failover coordinator node
The command pg_autoctl create coordinator initializes a pg_auto_failover Coordinator node for a Citus formation. The coordinator is special in a Citus formation: that's where the client application connects to either to manage the formation and the sharding of the tables, or for its normal SQL traffic.
The coordinator also has to register every worker in the formation.
usage: pg_autoctl create coordinator
--pgctl path to pg_ctl
--pgdata path to data directory
--pghost PostgreSQL's hostname
--pgport PostgreSQL's port number
--hostname hostname by which postgres is reachable
--listen PostgreSQL's listen_addresses
--username PostgreSQL's username
--dbname PostgreSQL's database name
--name pg_auto_failover node name
--formation pg_auto_failover formation
--monitor pg_auto_failover Monitor Postgres URL
--auth authentication method for connections from monitor
--skip-pg-hba skip editing pg_hba.conf rules
--citus-secondary when used, this worker node is a citus secondary
--citus-cluster name of the Citus Cluster for read-replicas
--ssl-self-signed setup network encryption using self signed certificates (does NOT protect against MITM)
--ssl-mode use that sslmode in connection strings
--ssl-ca-file set the Postgres ssl_ca_file to that file path
--ssl-crl-file set the Postgres ssl_crl_file to that file path
--no-ssl don't enable network encryption (NOT recommended, prefer --ssl-self-signed)
--server-key set the Postgres ssl_key_file to that file path
--server-cert set the Postgres ssl_cert_file to that file path
This commands works the same as the pg_autoctl create postgres command and implements the following extra steps:
IMPORTANT:
Citus does not support multiple databases, you have to use the database where Citus is created. When using Citus, that is essential to the well behaving of worker failover.
See the manual page for pg_autoctl create postgres for the common options. This section now lists the options that are specific to pg_autoctl create coordinator:
See Citus Secondaries and read-replica for more information.
See Citus Secondaries and read-replica for more information.
pg_autoctl create worker - Initialize a pg_auto_failover worker node
The command pg_autoctl create worker initializes a pg_auto_failover Worker node for a Citus formation. The worker is special in a Citus formation: that's where the client application connects to either to manage the formation and the sharding of the tables, or for its normal SQL traffic.
The worker also has to register every worker in the formation.
usage: pg_autoctl create worker
--pgctl path to pg_ctl
--pgdata path to data director
--pghost PostgreSQL's hostname
--pgport PostgreSQL's port number
--hostname hostname by which postgres is reachable
--listen PostgreSQL's listen_addresses
--proxyport Proxy's port number
--username PostgreSQL's username
--dbname PostgreSQL's database name
--name pg_auto_failover node name
--formation pg_auto_failover formation
--group pg_auto_failover group Id
--monitor pg_auto_failover Monitor Postgres URL
--auth authentication method for connections from monitor
--skip-pg-hba skip editing pg_hba.conf rules
--citus-secondary when used, this worker node is a citus secondary
--citus-cluster name of the Citus Cluster for read-replicas
--ssl-self-signed setup network encryption using self signed certificates (does NOT protect against MITM)
--ssl-mode use that sslmode in connection strings
--ssl-ca-file set the Postgres ssl_ca_file to that file path
--ssl-crl-file set the Postgres ssl_crl_file to that file path
--no-ssl don't enable network encryption (NOT recommended, prefer --ssl-self-signed)
--server-key set the Postgres ssl_key_file to that file path
--server-cert set the Postgres ssl_cert_file to that file path
This commands works the same as the pg_autoctl create postgres command and implements the following extra steps:
This operation is retried when it fails, as the coordinator might appear later than some of the workers when the whole formation is initialized at once, in parallel, on multiple nodes.
This is done in two steps. First, we call the SQL function master_add_inactive_node on the coordinator, and second, we call the SQL function master_activate_node.
This way allows for easier diagnostics in case things go wrong. In the first step, the network and authentication setup needs to allow for nodes to connect to each other. In the second step, the Citus reference tables are distributed to the new node, and this operation has its own set of failure cases to handle.
IMPORTANT:
Citus does not support multiple databases, you have to use the database where Citus is created. When using Citus, that is essential to the well behaving of worker failover.
See the manual page for pg_autoctl create postgres for the common options. This section now lists the options that are specific to pg_autoctl create worker:
See Citus Secondaries and read-replica for more information.
See Citus Secondaries and read-replica for more information.
pg_autoctl create formation - Create a new formation on the pg_auto_failover monitor
This command registers a new formation on the monitor, with the specified kind:
usage: pg_autoctl create formation [ --pgdata --monitor --formation --kind --dbname --with-secondary --without-secondary ] --pgdata path to data directory --monitor pg_auto_failover Monitor Postgres URL --formation name of the formation to create --kind formation kind, either "pgsql" or "citus" --dbname name for postgres database to use in this formation --enable-secondary create a formation that has multiple nodes that can be
used for fail over when others have issues --disable-secondary create a citus formation without nodes to fail over to --number-sync-standbys minimum number of standbys to confirm write
A single pg_auto_failover monitor may manage any number of formations, each composed of at least one Postgres service group. This commands creates a new formation so that it is then possible to register Postgres nodes in the new formation.
The following options are available to pg_autoctl create formation:
Defaults to zero when creating the first two Postgres nodes in a formation in the same group. When set to zero pg_auto_failover uses synchronous replication only when a standby node is available: the idea is to allow failover, this setting does not allow proper HA for Postgres.
When adding a third node that participates in the quorum (one primary, two secondaries), the setting is automatically changed from zero to one.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl drop - Drop a pg_auto_failover node, or formation
pg_autoctl drop monitor - Drop the pg_auto_failover monitor
This command allows to review all the replication settings of a given formation (defaults to 'default' as usual):
usage: pg_autoctl drop monitor [ --pgdata --destroy ] --pgdata path to data directory --destroy also destroy Postgres database
PGDATA
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl drop node - Drop a node from the pg_auto_failover monitor
This command drops a Postgres node from the pg_auto_failover monitor:
usage: pg_autoctl drop node [ [ [ --pgdata ] [ --destroy ] ] | [ --monitor [ [ --hostname --pgport ] | [ --formation --name ] ] ] ] --pgdata path to data directory --monitor pg_auto_failover Monitor Postgres URL --formation pg_auto_failover formation --name drop the node with the given node name --hostname drop the node with given hostname and pgport --pgport drop the node with given hostname and pgport --destroy also destroy Postgres database --force force dropping the node from the monitor --wait how many seconds to wait, default to 60
Two modes of operations are implemented in the pg_autoctl drop node command.
When removing a node that still exists, it is possible to use pg_autoctl drop node --destroy to remove the node both from the monitor and also delete the local Postgres instance entirely.
When removing a node that doesn't exist physically anymore, or when the VM that used to host the node has been lost entirely, use either the pair of options --hostname and --pgport or the pair of options --formation and --name to match the node registration record on the monitor database, and get it removed from the known list of nodes on the monitor.
Then option --force can be used when the target node to remove does not exist anymore. When a node has been lost entirely, it's not going to be able to finish the procedure itself, and it is then possible to instruct the monitor of the situation.
PGDATA
PG_AUTOCTL_MONITOR
PG_AUTOCTL_NODE_NAME
PG_AUTOCTL_REPLICATION_QUORUM
PG_AUTOCTL_CANDIDATE_PRIORITY
PG_CONFIG
PATH
PGHOST, PGPORT, PGDATABASE, PGUSER, PGCONNECT_TIMEOUT, ...
TMPDIR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl drop node --destroy --pgdata ./node3 17:52:21 54201 INFO Reaching assigned state "secondary" 17:52:21 54201 INFO Removing node with name "node3" in formation "default" from the monitor 17:52:21 54201 WARN Postgres is not running and we are in state secondary 17:52:21 54201 WARN Failed to update the keeper's state from the local PostgreSQL instance, see above for details. 17:52:21 54201 INFO Calling node_active for node default/4/0 with current state: PostgreSQL is running is false, sync_state is "", latest WAL LSN is 0/0. 17:52:21 54201 INFO FSM transition to "dropped": This node is being dropped from the monitor 17:52:21 54201 INFO Transition complete: current state is now "dropped" 17:52:21 54201 INFO This node with id 4 in formation "default" and group 0 has been dropped from the monitor 17:52:21 54201 INFO Stopping PostgreSQL at "/Users/dim/dev/MS/pg_auto_failover/tmux/node3" 17:52:21 54201 INFO /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl --pgdata /Users/dim/dev/MS/pg_auto_failover/tmux/node3 --wait stop --mode fast 17:52:21 54201 INFO /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl status -D /Users/dim/dev/MS/pg_auto_failover/tmux/node3 [3] 17:52:21 54201 INFO pg_ctl: no server running 17:52:21 54201 INFO pg_ctl stop failed, but PostgreSQL is not running anyway 17:52:21 54201 INFO Removing "/Users/dim/dev/MS/pg_auto_failover/tmux/node3" 17:52:21 54201 INFO Removing "/Users/dim/dev/MS/pg_auto_failover/tmux/config/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node3/pg_autoctl.cfg"
pg_autoctl drop formation - Drop a formation on the pg_auto_failover monitor
This command drops an existing formation on the monitor:
usage: pg_autoctl drop formation [ --pgdata --formation ] --pgdata path to data directory --monitor pg_auto_failover Monitor Postgres URL --formation name of the formation to drop
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl config - Manages the pg_autoctl configuration
pg_autoctl config get - Get the value of a given pg_autoctl configuration variable
This command prints a pg_autoctl configuration setting:
usage: pg_autoctl config get [ --pgdata ] [ --json ] [ section.option ] --pgdata path to data directory
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
When the argument section.option is used, this is the name of a configuration ooption. The configuration file for pg_autoctl is stored using the INI format.
When no argument is given to pg_autoctl config get the entire configuration file is given in the output. To figure out where the configuration file is stored, see pg_autoctl show file and use pg_autoctl show file --config.
Without arguments, we get the entire file:
$ pg_autoctl config get --pgdata node1 [pg_autoctl] role = keeper monitor = postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer formation = default group = 0 name = node1 hostname = localhost nodekind = standalone [postgresql] pgdata = /Users/dim/dev/MS/pg_auto_failover/tmux/node1 pg_ctl = /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl dbname = demo host = /tmp port = 5501 proxyport = 0 listen_addresses = * auth_method = trust hba_level = app [ssl] active = 1 sslmode = require cert_file = /Users/dim/dev/MS/pg_auto_failover/tmux/node1/server.crt key_file = /Users/dim/dev/MS/pg_auto_failover/tmux/node1/server.key [replication] maximum_backup_rate = 100M backup_directory = /Users/dim/dev/MS/pg_auto_failover/tmux/backup/node_1 [timeout] network_partition_timeout = 20 prepare_promotion_catchup = 30 prepare_promotion_walreceiver = 5 postgresql_restart_failure_timeout = 20 postgresql_restart_failure_max_retries = 3
It is possible to pipe JSON formatted output to the jq command line and filter the result down to a specific section of the file:
$ pg_autoctl config get --pgdata node1 --json | jq .pg_autoctl {
"role": "keeper",
"monitor": "postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer",
"formation": "default",
"group": 0,
"name": "node1",
"hostname": "localhost",
"nodekind": "standalone" }
Finally, a single configuration element can be listed:
$ pg_autoctl config get --pgdata node1 ssl.sslmode --json require
pg_autoctl config set - Set the value of a given pg_autoctl configuration variable
This command prints a pg_autoctl configuration setting:
usage: pg_autoctl config set [ --pgdata ] [ --json ] section.option [ value ] --pgdata path to data directory
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
This commands allows to set a pg_autoctl configuration setting to a new value. Most settings are possible to change and can be reloaded online.
Some of those commands can then be applied with a pg_autoctl reload command to an already running process.
pg_autoctl.role
pg_autoctl.monitor
To register an existing node to a new monitor, use pg_autoctl disable monitor and then pg_autoctl enable monitor.
pg_autoctl.formation
pg_autoctl.group
pg_autoctl.name
pg_autoctl.hostname
pg_autoctl.nodekind
postgresql.pgdata
postgresql.pg_ctl
Can be changed after a major upgrade of Postgres.
postgresql.dbname
WARNING:
postgresql.host
Can be changed with a reload.
postgresql.port
postgresql.listen_addresses
postgresql.auth_method
Can be changed online with a reload, but actually adding new HBA rules requires a restart of the "node-active" service.
postgresql.hba_level
ssl.active, ssl.sslmode, ssl.cert_file, ssl.key_file, etc
replication.maximum_backup_rate
Limiting the bandwidth used by pg_basebackup makes the operation slower, and still has the advantage of limiting the impact on the disks of the primary server.
replication.backup_directory
The rename(2) system-call is known to be atomic when both the source and the target of the operation are using the same file system / mount point.
Can be changed online with a reload, will not affect already running pg_basebackup sub-processes.
replication.password
Changing the replication.password of a pg_autoctl configuration has no effect on the Postgres database itself. The password must match what the Postgres upstream node expects, which can be set with the following SQL command run on the upstream server (primary or other standby node):
alter user pgautofailover_replicator password 'h4ckm3m0r3';
The replication.password can be changed online with a reload, but requires restarting the Postgres service to be activated. Postgres only reads the primary_conninfo connection string at start-up, up to and including Postgres 12. With Postgres 13 and following, it is possible to reload this Postgres parameter.
timeout.network_partition_timeout
Can be changed with a reload.
timeout.prepare_promotion_catchup
timeout.prepare_promotion_walreceiver
timeout.postgresql_restart_failure_timeout
Can be changed with a reload.
timeout.postgresql_restart_failure_max_retries
Can be changed with a reload.
pg_autoctl config check - Check pg_autoctl configuration
This command implements a very basic list of sanity checks for a pg_autoctl node setup:
usage: pg_autoctl config check [ --pgdata ] [ --json ] --pgdata path to data directory --json output data in the JSON format
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl config check --pgdata node1 18:37:27 63749 INFO Postgres setup for PGDATA "/Users/dim/dev/MS/pg_auto_failover/tmux/node1" is ok, running with PID 5501 and port 99698 18:37:27 63749 INFO Connection to local Postgres ok, using "port=5501 dbname=demo host=/tmp" 18:37:27 63749 INFO Postgres configuration settings required for pg_auto_failover are ok 18:37:27 63749 WARN Postgres 12.1 does not support replication slots on a standby node 18:37:27 63749 INFO Connection to monitor ok, using "postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer" 18:37:27 63749 INFO Monitor is running version "1.5.0.1", as expected pgdata: /Users/dim/dev/MS/pg_auto_failover/tmux/node1 pg_ctl: /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl pg_version: 12.3 pghost: /tmp pgport: 5501 proxyport: 0 pid: 99698 is in recovery: no Control Version: 1201 Catalog Version: 201909212 System Identifier: 6941034382470571312 Latest checkpoint LSN: 0/6000098 Postmaster status: ready
pg_autoctl show - Show pg_auto_failover information
pg_autoctl show uri - Show the postgres uri to use to connect to pg_auto_failover nodes
This command outputs the monitor or the coordinator Postgres URI to use from an application to connect to Postgres:
usage: pg_autoctl show uri [ --pgdata --monitor --formation --json ]
--pgdata path to data directory
--monitor monitor uri
--formation show the coordinator uri of given formation
--json output data in the JSON format
Defaults to the value of the environment variable PG_AUTOCTL_MONITOR.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show uri
Type | Name | Connection String -------------+---------+-------------------------------
monitor | monitor | postgres://autoctl_node@localhost:5500/pg_auto_failover
formation | default | postgres://localhost:5502,localhost:5503,localhost:5501/demo?target_session_attrs=read-write&sslmode=prefer $ pg_autoctl show uri --formation monitor postgres://autoctl_node@localhost:5500/pg_auto_failover $ pg_autoctl show uri --formation default postgres://localhost:5503,localhost:5502,localhost:5501/demo?target_session_attrs=read-write&sslmode=prefer $ pg_autoctl show uri --json [
{
"uri": "postgres://autoctl_node@localhost:5500/pg_auto_failover",
"name": "monitor",
"type": "monitor"
},
{
"uri": "postgres://localhost:5503,localhost:5502,localhost:5501/demo?target_session_attrs=read-write&sslmode=prefer",
"name": "default",
"type": "formation"
} ]
PostgreSQL since version 10 includes support for multiple hosts in its connection driver libpq, with the special target_session_attrs connection property.
This multi-hosts connection string facility allows applications to keep using the same stable connection string over server-side failovers. That's why pg_autoctl show uri uses that format.
pg_autoctl show events - Prints monitor's state of nodes in a given formation and group
This command outputs the events that the pg_auto_failover events records about state changes of the pg_auto_failover nodes managed by the monitor:
usage: pg_autoctl show events [ --pgdata --formation --group --count ] --pgdata path to data directory --monitor pg_auto_failover Monitor Postgres URL --formation formation to query, defaults to 'default' --group group to query formation, defaults to all --count how many events to fetch, defaults to 10 --watch display an auto-updating dashboard --json output data in the JSON format
Depending on the terminal window size, a different set of columns is visible in the state part of the output. See pg_autoctl watch.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show events --count 2 --json [
{
"nodeid": 1,
"eventid": 15,
"groupid": 0,
"nodehost": "localhost",
"nodename": "node1",
"nodeport": 5501,
"eventtime": "2021-03-18T12:32:36.103467+01:00",
"goalstate": "primary",
"description": "Setting goal state of node 1 \"node1\" (localhost:5501) to primary now that at least one secondary candidate node is healthy.",
"formationid": "default",
"reportedlsn": "0/4000060",
"reportedstate": "wait_primary",
"reportedrepstate": "async",
"candidatepriority": 50,
"replicationquorum": true
},
{
"nodeid": 1,
"eventid": 16,
"groupid": 0,
"nodehost": "localhost",
"nodename": "node1",
"nodeport": 5501,
"eventtime": "2021-03-18T12:32:36.215494+01:00",
"goalstate": "primary",
"description": "New state is reported by node 1 \"node1\" (localhost:5501): \"primary\"",
"formationid": "default",
"reportedlsn": "0/4000110",
"reportedstate": "primary",
"reportedrepstate": "quorum",
"candidatepriority": 50,
"replicationquorum": true
} ]
pg_autoctl show state - Prints monitor's state of nodes in a given formation and group
This command outputs the current state of the formation and groups registered to the pg_auto_failover monitor:
usage: pg_autoctl show state [ --pgdata --formation --group ] --pgdata path to data directory --monitor pg_auto_failover Monitor Postgres URL --formation formation to query, defaults to 'default' --group group to query formation, defaults to all --local show local data, do not connect to the monitor --watch display an auto-updating dashboard --json output data in the JSON format
Depending on the terminal window size, a different set of columns is visible in the state part of the output. See pg_autoctl watch.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
The pg_autoctl show state output includes the following columns:
Only Citus formations allow several groups. When using a Citus formation the Node column contains the groupId and the nodeId, separated by a colon, such as 0:1 for the first coordinator node.
The LSN is the current position in the Postgres WAL stream. This is a hexadecimal number. See pg_lsn for more information.
The current timeline is incremented each time a failover happens, or when doing Point In Time Recovery. A node can only reach the secondary state when it is on the same timeline as its primary node.
$ pg_autoctl show state
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ------+-------+----------------+----------------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 1: 0/4000678 | read-write | primary | primary node2 | 2 | localhost:5502 | 1: 0/4000678 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 1: 0/4000678 | read-only | secondary | secondary $ pg_autoctl show state --local
Name | Node | Host:Port | TLI: LSN | Connection | Reported State | Assigned State ------+-------+----------------+----------------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 1: 0/4000678 | read-write ? | primary | primary $ pg_autoctl show state --json [
{
"health": 1,
"node_id": 1,
"group_id": 0,
"nodehost": "localhost",
"nodename": "node1",
"nodeport": 5501,
"reported_lsn": "0/4000678",
"reported_tli": 1,
"formation_kind": "pgsql",
"candidate_priority": 50,
"replication_quorum": true,
"current_group_state": "primary",
"assigned_group_state": "primary"
},
{
"health": 1,
"node_id": 2,
"group_id": 0,
"nodehost": "localhost",
"nodename": "node2",
"nodeport": 5502,
"reported_lsn": "0/4000678",
"reported_tli": 1,
"formation_kind": "pgsql",
"candidate_priority": 50,
"replication_quorum": true,
"current_group_state": "secondary",
"assigned_group_state": "secondary"
},
{
"health": 1,
"node_id": 3,
"group_id": 0,
"nodehost": "localhost",
"nodename": "node3",
"nodeport": 5503,
"reported_lsn": "0/4000678",
"reported_tli": 1,
"formation_kind": "pgsql",
"candidate_priority": 50,
"replication_quorum": true,
"current_group_state": "secondary",
"assigned_group_state": "secondary"
} ]
pg_autoctl show settings - Print replication settings for a formation from the monitor
This command allows to review all the replication settings of a given formation (defaults to 'default' as usual):
usage: pg_autoctl show settings [ --pgdata ] [ --json ] [ --formation ] --pgdata path to data directory --monitor pg_auto_failover Monitor Postgres URL --json output data in the JSON format --formation pg_auto_failover formation
See also pg_autoctl get formation settings which is a synonym.
The output contains setting and values that apply at different contexts, as shown here with a formation of four nodes, where node_4 is not participating in the replication quorum and also not a candidate for failover:
$ pg_autoctl show settings
Context | Name | Setting | Value
----------+---------+---------------------------+-------------------------------------------------------------
formation | default | number_sync_standbys | 1
primary | node_1 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_3, pgautofailover_standby_2)'
node | node_1 | replication quorum | true
node | node_2 | replication quorum | true
node | node_3 | replication quorum | true
node | node_4 | replication quorum | false
node | node_1 | candidate priority | 50
node | node_2 | candidate priority | 50
node | node_3 | candidate priority | 50
node | node_4 | candidate priority | 0
Three replication settings context are listed:
This command gives an overview of all the settings that apply to the current formation.
Defaults to the value of the environment variable PG_AUTOCTL_MONITOR.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show settings
Context | Name | Setting | Value
----------+---------+---------------------------+-------------------------------------------------------------
formation | default | number_sync_standbys | 1
primary | node1 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)'
node | node1 | candidate priority | 50
node | node2 | candidate priority | 50
node | node3 | candidate priority | 50
node | node1 | replication quorum | true
node | node2 | replication quorum | true
node | node3 | replication quorum | true
pg_autoctl show standby-names - Prints synchronous_standby_names for a given group
This command prints the current value for synchronous_standby_names for the primary Postgres server of the target group (default 0) in the target formation (default default), as computed by the monitor:
usage: pg_autoctl show standby-names [ --pgdata ] --formation --group
--pgdata path to data directory
--monitor pg_auto_failover Monitor Postgres URL
--formation formation to query, defaults to 'default'
--group group to query formation, defaults to all
--json output data in the JSON format
Defaults to the value of the environment variable PG_AUTOCTL_MONITOR.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show standby-names 'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)' $ pg_autoctl show standby-names --json {
"formation": "default",
"group": 0,
"synchronous_standby_names": "ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)" }
pg_autoctl show file - List pg_autoctl internal files (config, state, pid)
This command the files that pg_autoctl uses internally for its own configuration, state, and pid:
usage: pg_autoctl show file [ --pgdata --all --config | --state | --init | --pid --contents ] --pgdata path to data directory --all show all pg_autoctl files --config show pg_autoctl configuration file --state show pg_autoctl state file --init show pg_autoctl initialisation state file --pid show pg_autoctl PID file --contents show selected file contents --json output data in the JSON format
The pg_autoctl command follows the XDG Base Directory Specification and places its internal and configuration files by default in places such as ~/.config/pg_autoctl and ~/.local/share/pg_autoctl.
It is possible to change the default XDG locations by using the environment variables XDG_CONFIG_HOME, XDG_DATA_HOME, and XDG_RUNTIME_DIR.
Also, pg_config uses sub-directories that are specific to a given PGDATA, making it possible to run several Postgres nodes on the same machine, which is very practical for testing and development purposes, though not advised for production setups.
The pg_autoctl configuration file for an instance serving the data directory at /data/pgsql is found at ~/.config/pg_autoctl/data/pgsql/pg_autoctl.cfg, written in the INI format.
It is possible to get the location of the configuration file by using the command pg_autoctl show file --config --pgdata /data/pgsql and to output its content by using the command pg_autoctl show file --config --contents --pgdata /data/pgsql.
See also pg_autoctl config get and pg_autoctl config set.
The pg_autoctl state file for an instance serving the data directory at /data/pgsql is found at ~/.local/share/pg_autoctl/data/pgsql/pg_autoctl.state, written in a specific binary format.
This file is not intended to be written by anything else than pg_autoctl itself. In case of state corruption, see the trouble shooting section of the documentation.
It is possible to get the location of the state file by using the command pg_autoctl show file --state --pgdata /data/pgsql and to output its content by using the command pg_autoctl show file --state --contents --pgdata /data/pgsql.
The pg_autoctl init state file for an instance serving the data directory at /data/pgsql is found at ~/.local/share/pg_autoctl/data/pgsql/pg_autoctl.init, written in a specific binary format.
This file is not intended to be written by anything else than pg_autoctl itself. In case of state corruption, see the trouble shooting section of the documentation.
This initialization state file only exists during the initialization of a pg_auto_failover node. In normal operations, this file does not exist.
It is possible to get the location of the state file by using the command pg_autoctl show file --init --pgdata /data/pgsql and to output its content by using the command pg_autoctl show file --init --contents --pgdata /data/pgsql.
The pg_autoctl PID file for an instance serving the data directory at /data/pgsql is found at /tmp/pg_autoctl/data/pgsql/pg_autoctl.pid, written in a specific text format.
The PID file is located in a temporary directory by default, or in the XDG_RUNTIME_DIR directory when this is setup.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
The following examples are taken from a QA environment that has been prepared thanks to the make cluster command made available to the pg_auto_failover contributors. As a result, the XDG environment variables have been tweaked to obtain a self-contained test:
$ tmux show-env | grep XDG XDG_CONFIG_HOME=/Users/dim/dev/MS/pg_auto_failover/tmux/config XDG_DATA_HOME=/Users/dim/dev/MS/pg_auto_failover/tmux/share XDG_RUNTIME_DIR=/Users/dim/dev/MS/pg_auto_failover/tmux/run
Within that self-contained test location, we can see the following examples.
$ pg_autoctl show file --pgdata ./node1
File | Path --------+----------------
Config | /Users/dim/dev/MS/pg_auto_failover/tmux/config/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node1/pg_autoctl.cfg
State | /Users/dim/dev/MS/pg_auto_failover/tmux/share/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node1/pg_autoctl.state
Init | /Users/dim/dev/MS/pg_auto_failover/tmux/share/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node1/pg_autoctl.init
Pid | /Users/dim/dev/MS/pg_auto_failover/tmux/run/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node1/pg_autoctl.pid
'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)' $ pg_autoctl show file --pgdata node1 --state /Users/dim/dev/MS/pg_auto_failover/tmux/share/pg_autoctl/Users/dim/dev/MS/pg_auto_failover/tmux/node1/pg_autoctl.state $ pg_autoctl show file --pgdata node1 --state --contents Current Role: primary Assigned Role: primary Last Monitor Contact: Thu Mar 18 17:32:25 2021 Last Secondary Contact: 0 pg_autoctl state version: 1 group: 0 node id: 1 nodes version: 0 PostgreSQL Version: 1201 PostgreSQL CatVersion: 201909212 PostgreSQL System Id: 6940955496243696337 pg_autoctl show file --pgdata node1 --config --contents --json | jq .pg_autoctl {
"role": "keeper",
"monitor": "postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer",
"formation": "default",
"group": 0,
"name": "node1",
"hostname": "localhost",
"nodekind": "standalone" }
pg_autoctl show systemd - Print systemd service file for this node
This command outputs a configuration unit that is suitable for registering pg_autoctl as a systemd service.
$ pg_autoctl show systemd --pgdata node1 17:38:29 99778 INFO HINT: to complete a systemd integration, run the following commands: 17:38:29 99778 INFO pg_autoctl -q show systemd --pgdata "node1" | sudo tee /etc/systemd/system/pgautofailover.service 17:38:29 99778 INFO sudo systemctl daemon-reload 17:38:29 99778 INFO sudo systemctl enable pgautofailover 17:38:29 99778 INFO sudo systemctl start pgautofailover [Unit] Description = pg_auto_failover [Service] WorkingDirectory = /Users/dim Environment = 'PGDATA=node1' User = dim ExecStart = /Applications/Postgres.app/Contents/Versions/12/bin/pg_autoctl run Restart = always StartLimitBurst = 0 ExecReload = /Applications/Postgres.app/Contents/Versions/12/bin/pg_autoctl reload [Install] WantedBy = multi-user.target
To avoid the logs output, use the -q option:
$ pg_autoctl show systemd --pgdata node1 -q [Unit] Description = pg_auto_failover [Service] WorkingDirectory = /Users/dim Environment = 'PGDATA=node1' User = dim ExecStart = /Applications/Postgres.app/Contents/Versions/12/bin/pg_autoctl run Restart = always StartLimitBurst = 0 ExecReload = /Applications/Postgres.app/Contents/Versions/12/bin/pg_autoctl reload [Install] WantedBy = multi-user.target
pg_autoctl enable - Enable a feature on a formation
pg_autoctl enable secondary - Enable secondary nodes on a formation
This feature makes the most sense when using the Enterprise Edition of pg_auto_failover, which is fully compatible with Citus formations. When secondary are enabled, then Citus workers creation policy is to assign a primary node then a standby node for each group. When secondary is disabled the Citus workers creation policy is to assign only the primary nodes.
usage: pg_autoctl enable secondary [ --pgdata --formation ] --pgdata path to data directory --formation Formation to enable secondary on
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl enable maintenance - Enable Postgres maintenance mode on this node
A pg_auto_failover can be put to a maintenance state. The Postgres node is then still registered to the monitor, and is known to be unreliable until maintenance is disabled. A node in the maintenance state is not a candidate for promotion.
Typical use of the maintenance state include Operating System or Postgres reboot, e.g. when applying security upgrades.
usage: pg_autoctl enable maintenance [ --pgdata --allow-failover ] --pgdata path to data directory
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000760 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000760 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/4000760 | read-only | secondary | secondary $ pg_autoctl enable maintenance --pgdata node3 12:06:12 47086 INFO Listening monitor notifications about state changes in formation "default" and group 0 12:06:12 47086 INFO Following table displays times when notifications are received
Time | Name | Node | Host:Port | Current State | Assigned State ---------+-------+-------+----------------+---------------------+-------------------- 12:06:12 | node1 | 1 | localhost:5501 | primary | join_primary 12:06:12 | node3 | 3 | localhost:5503 | secondary | wait_maintenance 12:06:12 | node3 | 3 | localhost:5503 | wait_maintenance | wait_maintenance 12:06:12 | node1 | 1 | localhost:5501 | join_primary | join_primary 12:06:12 | node3 | 3 | localhost:5503 | wait_maintenance | maintenance 12:06:12 | node1 | 1 | localhost:5501 | join_primary | primary 12:06:13 | node3 | 3 | localhost:5503 | maintenance | maintenance $ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000810 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000810 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/4000810 | none | maintenance | maintenance
pg_autoctl enable ssl - Enable SSL configuration on this node
It is possible to manage Postgres SSL settings with the pg_autoctl command, both at pg_autoctl create postgres time and then again to change your mind and update the SSL settings at run-time.
usage: pg_autoctl enable ssl [ --pgdata ] [ --json ] --pgdata path to data directory --ssl-self-signed setup network encryption using self signed certificates (does NOT protect against MITM) --ssl-mode use that sslmode in connection strings --ssl-ca-file set the Postgres ssl_ca_file to that file path --ssl-crl-file set the Postgres ssl_crl_file to that file path --no-ssl don't enable network encryption (NOT recommended, prefer --ssl-self-signed) --server-key set the Postgres ssl_key_file to that file path --server-cert set the Postgres ssl_cert_file to that file path
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl enable monitor - Enable a monitor for this node to be orchestrated from
It is possible to disable the pg_auto_failover monitor and enable it again online in a running pg_autoctl Postgres node. The main use-cases where this operation is useful is when the monitor node has to be replaced, either after a full crash of the previous monitor node, of for migrating to a new monitor node (hardware replacement, region or zone migration, etc).
usage: pg_autoctl enable monitor [ --pgdata --allow-failover ] postgres://autoctl_node@new.monitor.add.ress/pg_auto_failover --pgdata path to data directory
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000760 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000760 | read-only | secondary | secondary $ pg_autoctl enable monitor --pgdata node3 'postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=require' 12:42:07 43834 INFO Registered node 3 (localhost:5503) with name "node3" in formation "default", group 0, state "wait_standby" 12:42:07 43834 INFO Successfully registered to the monitor with nodeId 3 12:42:08 43834 INFO Still waiting for the monitor to drive us to state "catchingup" 12:42:08 43834 WARN Please make sure that the primary node is currently running `pg_autoctl run` and contacting the monitor. $ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000810 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000810 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/4000810 | read-only | secondary | secondary
pg_autoctl disable - Disable a feature on a formation
pg_autoctl disable secondary - Disable secondary nodes on a formation
This feature makes the most sense when using the Enterprise Edition of pg_auto_failover, which is fully compatible with Citus formations. When secondary are disabled, then Citus workers creation policy is to assign a primary node then a standby node for each group. When secondary is disabled the Citus workers creation policy is to assign only the primary nodes.
usage: pg_autoctl disable secondary [ --pgdata --formation ] --pgdata path to data directory --formation Formation to disable secondary on
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl disable maintenance - Disable Postgres maintenance mode on this node
A pg_auto_failover can be put to a maintenance state. The Postgres node is then still registered to the monitor, and is known to be unreliable until maintenance is disabled. A node in the maintenance state is not a candidate for promotion.
Typical use of the maintenance state include Operating System or Postgres reboot, e.g. when applying security upgrades.
usage: pg_autoctl disable maintenance [ --pgdata --allow-failover ] --pgdata path to data directory
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000810 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000810 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/4000810 | none | maintenance | maintenance $ pg_autoctl disable maintenance --pgdata node3 12:06:37 47542 INFO Listening monitor notifications about state changes in formation "default" and group 0 12:06:37 47542 INFO Following table displays times when notifications are received
Time | Name | Node | Host:Port | Current State | Assigned State ---------+-------+-------+----------------+---------------------+-------------------- 12:06:37 | node3 | 3 | localhost:5503 | maintenance | catchingup 12:06:37 | node3 | 3 | localhost:5503 | catchingup | catchingup 12:06:37 | node3 | 3 | localhost:5503 | catchingup | secondary 12:06:37 | node3 | 3 | localhost:5503 | secondary | secondary $ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000848 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000848 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/4000000 | read-only | secondary | secondary
pg_autoctl disable ssl - Disable SSL configuration on this node
It is possible to manage Postgres SSL settings with the pg_autoctl command, both at pg_autoctl create postgres time and then again to change your mind and update the SSL settings at run-time.
usage: pg_autoctl disable ssl [ --pgdata ] [ --json ] --pgdata path to data directory --ssl-self-signed setup network encryption using self signed certificates (does NOT protect against MITM) --ssl-mode use that sslmode in connection strings --ssl-ca-file set the Postgres ssl_ca_file to that file path --ssl-crl-file set the Postgres ssl_crl_file to that file path --no-ssl don't disable network encryption (NOT recommended, prefer --ssl-self-signed) --server-key set the Postgres ssl_key_file to that file path --server-cert set the Postgres ssl_cert_file to that file path
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl disable monitor - Disable the monitor for this node
It is possible to disable the pg_auto_failover monitor and enable it again online in a running pg_autoctl Postgres node. The main use-cases where this operation is useful is when the monitor node has to be replaced, either after a full crash of the previous monitor node, of for migrating to a new monitor node (hardware replacement, region or zone migration, etc).
usage: pg_autoctl disable monitor [ --pgdata --force ] --pgdata path to data directory --force force unregistering from the monitor
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000148 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000148 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/4000148 | read-only | secondary | secondary $ pg_autoctl disable monitor --pgdata node3 12:41:21 43039 INFO Found node 3 "node3" (localhost:5503) on the monitor 12:41:21 43039 FATAL Use --force to remove the node from the monitor $ pg_autoctl disable monitor --pgdata node3 --force 12:41:32 43219 INFO Removing node 3 "node3" (localhost:5503) from monitor $ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000760 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/4000760 | read-only | secondary | secondary
pg_autoctl get - Get a pg_auto_failover node, or formation setting
pg_autoctl get formation settings - get replication settings for a formation from the monitor
This command prints a pg_autoctl replication settings:
usage: pg_autoctl get formation settings [ --pgdata ] [ --json ] [ --formation ] --pgdata path to data directory --json output data in the JSON format --formation pg_auto_failover formation
See also pg_autoctl show settings which is a synonym.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl get formation settings
Context | Name | Setting | Value ----------+---------+---------------------------+------------------------------------------------------------- formation | default | number_sync_standbys | 1
primary | node1 | synchronous_standby_names | 'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)'
node | node1 | candidate priority | 50
node | node2 | candidate priority | 50
node | node3 | candidate priority | 50
node | node1 | replication quorum | true
node | node2 | replication quorum | true
node | node3 | replication quorum | true $ pg_autoctl get formation settings --json {
"nodes": [
{
"value": "true",
"context": "node",
"node_id": 1,
"setting": "replication quorum",
"group_id": 0,
"nodename": "node1"
},
{
"value": "true",
"context": "node",
"node_id": 2,
"setting": "replication quorum",
"group_id": 0,
"nodename": "node2"
},
{
"value": "true",
"context": "node",
"node_id": 3,
"setting": "replication quorum",
"group_id": 0,
"nodename": "node3"
},
{
"value": "50",
"context": "node",
"node_id": 1,
"setting": "candidate priority",
"group_id": 0,
"nodename": "node1"
},
{
"value": "50",
"context": "node",
"node_id": 2,
"setting": "candidate priority",
"group_id": 0,
"nodename": "node2"
},
{
"value": "50",
"context": "node",
"node_id": 3,
"setting": "candidate priority",
"group_id": 0,
"nodename": "node3"
}
],
"primary": [
{
"value": "'ANY 1 (pgautofailover_standby_2, pgautofailover_standby_3)'",
"context": "primary",
"node_id": 1,
"setting": "synchronous_standby_names",
"group_id": 0,
"nodename": "node1"
}
],
"formation": {
"value": "1",
"context": "formation",
"node_id": null,
"setting": "number_sync_standbys",
"group_id": null,
"nodename": "default"
} }
pg_autoctl get formation number-sync-standbys - get number_sync_standbys for a formation from the monitor
This command prints a pg_autoctl replication settings for number sync standbys:
usage: pg_autoctl get formation number-sync-standbys [ --pgdata ] [ --json ] [ --formation ] --pgdata path to data directory --json output data in the JSON format --formation pg_auto_failover formation
See also pg_autoctl show settings for the full list of replication settings.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl get formation number-sync-standbys 1 $ pg_autoctl get formation number-sync-standbys --json {
"number-sync-standbys": 1 }
pg_autoctl get replication-quorum - get replication-quorum property from the monitor
This command prints pg_autoctl replication quorum for a given node:
usage: pg_autoctl get node replication-quorum [ --pgdata ] [ --json ] [ --formation ] [ --name ] --pgdata path to data directory --formation pg_auto_failover formation --name pg_auto_failover node name --json output data in the JSON format
See also pg_autoctl show settings for the full list of replication settings.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl get node replication-quorum --name node1 true $ pg_autoctl get node replication-quorum --name node1 --json {
"name": "node1",
"replication-quorum": true }
pg_autoctl get candidate-priority - get candidate-priority property from the monitor
This command prints pg_autoctl candidate priority for a given node:
usage: pg_autoctl get node candidate-priority [ --pgdata ] [ --json ] [ --formation ] [ --name ] --pgdata path to data directory --formation pg_auto_failover formation --name pg_auto_failover node name --json output data in the JSON format
See also pg_autoctl show settings for the full list of replication settings.
$ pg_autoctl get node candidate-priority --name node1 50 $ pg_autoctl get node candidate-priority --name node1 --json {
"name": "node1",
"candidate-priority": 50 }
pg_autoctl set - Set a pg_auto_failover node, or formation setting
pg_autoctl set formation number-sync-standbys - set number_sync_standbys for a formation from the monitor
This command set a pg_autoctl replication settings for number sync standbys:
usage: pg_autoctl set formation number-sync-standbys [ --pgdata ] [ --json ] [ --formation ] <number_sync_standbys> --pgdata path to data directory --formation pg_auto_failover formation --json output data in the JSON format
The pg_auto_failover monitor ensures that at least N+1 candidate standby nodes are registered when number-sync-standbys is N. This means that to be able to run the following command, at least 3 standby nodes with a non-zero candidate priority must be registered to the monitor:
$ pg_autoctl set formation number-sync-standbys 2
See also pg_autoctl show settings for the full list of replication settings.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl set replication-quorum - set replication-quorum property from the monitor
This command sets pg_autoctl replication quorum for a given node:
usage: pg_autoctl set node replication-quorum [ --pgdata ] [ --json ] [ --formation ] [ --name ] <true|false> --pgdata path to data directory --formation pg_auto_failover formation --name pg_auto_failover node name --json output data in the JSON format
See also pg_autoctl show settings for the full list of replication settings.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl set node replication-quorum --name node1 false 12:49:37 94092 INFO Waiting for the settings to have been applied to the monitor and primary node 12:49:37 94092 INFO New state is reported by node 1 "node1" (localhost:5501): "apply_settings" 12:49:37 94092 INFO Setting goal state of node 1 "node1" (localhost:5501) to primary after it applied replication properties change. 12:49:37 94092 INFO New state is reported by node 1 "node1" (localhost:5501): "primary" false $ pg_autoctl set node replication-quorum --name node1 true --json 12:49:42 94199 INFO Waiting for the settings to have been applied to the monitor and primary node 12:49:42 94199 INFO New state is reported by node 1 "node1" (localhost:5501): "apply_settings" 12:49:42 94199 INFO Setting goal state of node 1 "node1" (localhost:5501) to primary after it applied replication properties change. 12:49:43 94199 INFO New state is reported by node 1 "node1" (localhost:5501): "primary" {
"replication-quorum": true }
pg_autoctl set candidate-priority - set candidate-priority property from the monitor
This command sets the pg_autoctl candidate priority for a given node:
usage: pg_autoctl set node candidate-priority [ --pgdata ] [ --json ] [ --formation ] [ --name ] <priority: 0..100> --pgdata path to data directory --formation pg_auto_failover formation --name pg_auto_failover node name --json output data in the JSON format
See also pg_autoctl show settings for the full list of replication settings.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl set node candidate-priority --name node1 65 12:47:59 92326 INFO Waiting for the settings to have been applied to the monitor and primary node 12:47:59 92326 INFO New state is reported by node 1 "node1" (localhost:5501): "apply_settings" 12:47:59 92326 INFO Setting goal state of node 1 "node1" (localhost:5501) to primary after it applied replication properties change. 12:47:59 92326 INFO New state is reported by node 1 "node1" (localhost:5501): "primary" 65 $ pg_autoctl set node candidate-priority --name node1 50 --json 12:48:05 92450 INFO Waiting for the settings to have been applied to the monitor and primary node 12:48:05 92450 INFO New state is reported by node 1 "node1" (localhost:5501): "apply_settings" 12:48:05 92450 INFO Setting goal state of node 1 "node1" (localhost:5501) to primary after it applied replication properties change. 12:48:05 92450 INFO New state is reported by node 1 "node1" (localhost:5501): "primary" {
"candidate-priority": 50 }
pg_autoctl perform - Perform an action orchestrated by the monitor
pg_autoctl perform failover - Perform a failover for given formation and group
This command starts a Postgres failover orchestration from the pg_auto_failover monitor:
usage: pg_autoctl perform failover [ --pgdata --formation --group ] --pgdata path to data directory --formation formation to target, defaults to 'default' --group group to target, defaults to 0 --wait how many seconds to wait, default to 60
The pg_auto_failover monitor can be used to orchestrate a manual failover, sometimes also known as a switchover. When doing so, split-brain are prevented thanks to intermediary states being used in the Finite State Machine.
The pg_autoctl perform failover command waits until the failover is known complete on the monitor, or until the hard-coded 60s timeout has passed.
The failover orchestration is done in the background by the monitor, so even if the pg_autoctl perform failover stops on the timeout, the failover orchestration continues at the monitor.
PGDATA
PG_AUTOCTL_MONITOR
PGHOST, PGPORT, PGDATABASE, PGUSER, PGCONNECT_TIMEOUT, ...
TMPDIR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl perform failover 12:57:30 3635 INFO Listening monitor notifications about state changes in formation "default" and group 0 12:57:30 3635 INFO Following table displays times when notifications are received
Time | Name | Node | Host:Port | Current State | Assigned State ---------+-------+-------+----------------+---------------------+-------------------- 12:57:30 | node1 | 1 | localhost:5501 | primary | draining 12:57:30 | node1 | 1 | localhost:5501 | draining | draining 12:57:30 | node2 | 2 | localhost:5502 | secondary | report_lsn 12:57:30 | node3 | 3 | localhost:5503 | secondary | report_lsn 12:57:36 | node3 | 3 | localhost:5503 | report_lsn | report_lsn 12:57:36 | node2 | 2 | localhost:5502 | report_lsn | report_lsn 12:57:36 | node2 | 2 | localhost:5502 | report_lsn | prepare_promotion 12:57:36 | node2 | 2 | localhost:5502 | prepare_promotion | prepare_promotion 12:57:36 | node2 | 2 | localhost:5502 | prepare_promotion | stop_replication 12:57:36 | node1 | 1 | localhost:5501 | draining | demote_timeout 12:57:36 | node3 | 3 | localhost:5503 | report_lsn | join_secondary 12:57:36 | node1 | 1 | localhost:5501 | demote_timeout | demote_timeout 12:57:36 | node3 | 3 | localhost:5503 | join_secondary | join_secondary 12:57:37 | node2 | 2 | localhost:5502 | stop_replication | stop_replication 12:57:37 | node2 | 2 | localhost:5502 | stop_replication | wait_primary 12:57:37 | node1 | 1 | localhost:5501 | demote_timeout | demoted 12:57:37 | node1 | 1 | localhost:5501 | demoted | demoted 12:57:37 | node2 | 2 | localhost:5502 | wait_primary | wait_primary 12:57:37 | node3 | 3 | localhost:5503 | join_secondary | secondary 12:57:37 | node1 | 1 | localhost:5501 | demoted | catchingup 12:57:38 | node3 | 3 | localhost:5503 | secondary | secondary 12:57:38 | node2 | 2 | localhost:5502 | wait_primary | primary 12:57:38 | node1 | 1 | localhost:5501 | catchingup | catchingup 12:57:38 | node2 | 2 | localhost:5502 | primary | primary $ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000F50 | read-only | secondary | secondary node2 | 2 | localhost:5502 | 0/4000F50 | read-write | primary | primary node3 | 3 | localhost:5503 | 0/4000F50 | read-only | secondary | secondary
pg_autoctl perform switchover - Perform a switchover for given formation and group
This command starts a Postgres switchover orchestration from the pg_auto_switchover monitor:
usage: pg_autoctl perform switchover [ --pgdata --formation --group ] --pgdata path to data directory --formation formation to target, defaults to 'default' --group group to target, defaults to 0
The pg_auto_switchover monitor can be used to orchestrate a manual switchover, sometimes also known as a switchover. When doing so, split-brain are prevented thanks to intermediary states being used in the Finite State Machine.
The pg_autoctl perform switchover command waits until the switchover is known complete on the monitor, or until the hard-coded 60s timeout has passed.
The switchover orchestration is done in the background by the monitor, so even if the pg_autoctl perform switchover stops on the timeout, the switchover orchestration continues at the monitor.
See also pg_autoctl perform failover, a synonym for this command.
PGDATA
PG_AUTOCTL_MONITOR
PG_CONFIG
PATH
PGHOST, PGPORT, PGDATABASE, PGUSER, PGCONNECT_TIMEOUT, ...
TMPDIR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl perform promotion - Perform a failover that promotes a target node
This command starts a Postgres failover orchestration from the pg_auto_promotion monitor and targets given node:
usage: pg_autoctl perform promotion [ --pgdata --formation --group ] --pgdata path to data directory --formation formation to target, defaults to 'default' --name node name to target, defaults to current node --wait how many seconds to wait, default to 60
The pg_auto_promotion monitor can be used to orchestrate a manual promotion, sometimes also known as a switchover. When doing so, split-brain are prevented thanks to intermediary states being used in the Finite State Machine.
The pg_autoctl perform promotion command waits until the promotion is known complete on the monitor, or until the hard-coded 60s timeout has passed.
The promotion orchestration is done in the background by the monitor, so even if the pg_autoctl perform promotion stops on the timeout, the promotion orchestration continues at the monitor.
PGDATA
PG_AUTOCTL_MONITOR
PG_CONFIG
PATH
PGHOST, PGPORT, PGDATABASE, PGUSER, PGCONNECT_TIMEOUT, ...
TMPDIR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/4000F88 | read-only | secondary | secondary node2 | 2 | localhost:5502 | 0/4000F88 | read-write | primary | primary node3 | 3 | localhost:5503 | 0/4000F88 | read-only | secondary | secondary $ pg_autoctl perform promotion --name node1 13:08:13 15297 INFO Listening monitor notifications about state changes in formation "default" and group 0 13:08:13 15297 INFO Following table displays times when notifications are received
Time | Name | Node | Host:Port | Current State | Assigned State ---------+-------+-------+----------------+---------------------+-------------------- 13:08:13 | node1 | 0/1 | localhost:5501 | secondary | secondary 13:08:13 | node2 | 0/2 | localhost:5502 | primary | draining 13:08:13 | node2 | 0/2 | localhost:5502 | draining | draining 13:08:13 | node1 | 0/1 | localhost:5501 | secondary | report_lsn 13:08:13 | node3 | 0/3 | localhost:5503 | secondary | report_lsn 13:08:19 | node3 | 0/3 | localhost:5503 | report_lsn | report_lsn 13:08:19 | node1 | 0/1 | localhost:5501 | report_lsn | report_lsn 13:08:19 | node1 | 0/1 | localhost:5501 | report_lsn | prepare_promotion 13:08:19 | node1 | 0/1 | localhost:5501 | prepare_promotion | prepare_promotion 13:08:19 | node1 | 0/1 | localhost:5501 | prepare_promotion | stop_replication 13:08:19 | node2 | 0/2 | localhost:5502 | draining | demote_timeout 13:08:19 | node3 | 0/3 | localhost:5503 | report_lsn | join_secondary 13:08:19 | node2 | 0/2 | localhost:5502 | demote_timeout | demote_timeout 13:08:19 | node3 | 0/3 | localhost:5503 | join_secondary | join_secondary 13:08:20 | node1 | 0/1 | localhost:5501 | stop_replication | stop_replication 13:08:20 | node1 | 0/1 | localhost:5501 | stop_replication | wait_primary 13:08:20 | node2 | 0/2 | localhost:5502 | demote_timeout | demoted 13:08:20 | node1 | 0/1 | localhost:5501 | wait_primary | wait_primary 13:08:20 | node3 | 0/3 | localhost:5503 | join_secondary | secondary 13:08:20 | node2 | 0/2 | localhost:5502 | demoted | demoted 13:08:20 | node2 | 0/2 | localhost:5502 | demoted | catchingup 13:08:21 | node3 | 0/3 | localhost:5503 | secondary | secondary 13:08:21 | node1 | 0/1 | localhost:5501 | wait_primary | primary 13:08:21 | node2 | 0/2 | localhost:5502 | catchingup | catchingup 13:08:21 | node1 | 0/1 | localhost:5501 | primary | primary $ pg_autoctl show state
Name | Node | Host:Port | LSN | Connection | Current State | Assigned State ------+-------+----------------+-----------+--------------+---------------------+-------------------- node1 | 1 | localhost:5501 | 0/40012F0 | read-write | primary | primary node2 | 2 | localhost:5502 | 0/40012F0 | read-only | secondary | secondary node3 | 3 | localhost:5503 | 0/40012F0 | read-only | secondary | secondary
pg_autoctl do - Internal commands and internal QA tooling
The debug commands for pg_autoctl are only available when the environment variable PG_AUTOCTL_DEBUG is set (to any value).
When testing pg_auto_failover, it is helpful to be able to play with the local nodes using the same lower-level API as used by the pg_auto_failover Finite State Machine transitions. Some commands could be useful in contexts other than pg_auto_failover development and QA work, so some documentation has been made available.
pg_autoctl do tmux - Set of facilities to handle tmux interactive sessions
pg_autoctl do tmux provides the following commands:
pg_autoctl do tmux + compose Set of facilities to handle docker-compose sessions
script Produce a tmux script for a demo or a test case (debug only)
session Run a tmux session for a demo or a test case
stop Stop pg_autoctl processes that belong to a tmux session
wait Wait until a given node has been registered on the monitor
clean Clean-up a tmux session processes and root dir pg_autoctl do tmux compose
config Produce a docker-compose configuration file for a demo
script Produce a tmux script for a demo or a test case (debug only)
session Run a tmux session for a demo or a test case
An easy way to get started with pg_auto_failover in a localhost only formation with three nodes is to run the following command:
$ PG_AUTOCTL_DEBUG=1 pg_autoctl do tmux session \
--root /tmp/pgaf \
--first-pgport 9000 \
--nodes 4 \
--layout tiled
This requires the command tmux to be available in your PATH. The pg_autoctl do tmux session commands prepares a self-contained root directory where to create pg_auto_failover nodes and their configuration, then prepares a tmux script, and then runs the script with a command such as:
/usr/local/bin/tmux -v start-server ; source-file /tmp/pgaf/script-9000.tmux
The tmux session contains a single tmux window multiple panes:
Usually the first two commands to run in the interactive shell, once the formation is stable (one node is primary, the other ones are all secondary), are the following:
$ pg_autoctl get formation settings $ pg_autoctl perform failover
The same idea can also be implemented with docker-compose to run the nodes, and still using tmux to have three control panels this time:
For this setup, you can use the following command:
PG_AUTOCTL_DEBUG=1 pg_autoctl do tmux compose session \
--root ./tmux/citus \
--first-pgport 5600 \
--nodes 3 \
--async-nodes 0 \
--node-priorities 50,50,0 \
--sync-standbys -1 \
--citus-workers 4 \
--citus-secondaries 0 \
--citus \
--layout even-vertical
The pg_autoctl do tmux compose session command also takes care of creating external docker volumes and referencing them for each node in the docker-compose file.
This command runs a tmux session for a demo or a test case.
usage: pg_autoctl do tmux session [option ...]
--root path where to create a cluster
--first-pgport first Postgres port to use (5500)
--nodes number of Postgres nodes to create (2)
--async-nodes number of async nodes within nodes (0)
--node-priorities list of nodes priorities (50)
--sync-standbys number-sync-standbys to set (0 or 1)
--skip-pg-hba use --skip-pg-hba when creating nodes
--citus start a Citus formation
--citus-workers number of Citus workers to create (2)
--citus-secondaries number of Citus secondaries to create (0)
--layout tmux layout to use (even-vertical)
--binpath path to the pg_autoctl binary (current binary path)
This command runs a tmux session for a demo or a test case. It generates a docker-compose file and then uses docker-compose to drive many nodes.
usage: pg_autoctl do tmux compose session [option ...]
--root path where to create a cluster
--first-pgport first Postgres port to use (5500)
--nodes number of Postgres nodes to create (2)
--async-nodes number of async nodes within nodes (0)
--node-priorities list of nodes priorities (50)
--sync-standbys number-sync-standbys to set (0 or 1)
--skip-pg-hba use --skip-pg-hba when creating nodes
--layout tmux layout to use (even-vertical)
--binpath path to the pg_autoctl binary (current binary path)
pg_autoctl do demo - Use a demo application for pg_auto_failover
pg_autoctl do demo provides the following commands:
pg_autoctl do demo
run Run the pg_auto_failover demo application
uri Grab the application connection string from the monitor
ping Attempt to connect to the application URI
summary Display a summary of the previous demo app run
To run a demo, use pg_autoctl do demo run:
usage: pg_autoctl do demo run [option ...] --monitor Postgres URI of the pg_auto_failover monitor --formation Formation to use (default) --group Group Id to failover (0) --username PostgreSQL's username --clients How many client processes to use (1) --duration Duration of the demo app, in seconds (30) --first-failover Timing of the first failover (10) --failover-freq Seconds between subsequent failovers (45)
The pg_autoctl debug tooling includes a demo application.
The demo prepare its Postgres schema on the target database, and then starts several clients (see --clients) that concurrently connect to the target application URI and record the time it took to establish the Postgres connection to the current read-write node, with information about the retry policy metrics.
$ pg_autoctl do demo run --monitor 'postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer' --clients 10 14:43:35 19660 INFO Using application connection string "postgres://localhost:5502,localhost:5503,localhost:5501/demo?target_session_attrs=read-write&sslmode=prefer" 14:43:35 19660 INFO Using Postgres user PGUSER "dim" 14:43:35 19660 INFO Preparing demo schema: drop schema if exists demo cascade 14:43:35 19660 WARN NOTICE: schema "demo" does not exist, skipping 14:43:35 19660 INFO Preparing demo schema: create schema demo 14:43:35 19660 INFO Preparing demo schema: create table demo.tracking(ts timestamptz default now(), client integer, loop integer, retries integer, us bigint, recovery bool) 14:43:36 19660 INFO Preparing demo schema: create table demo.client(client integer, pid integer, retry_sleep_ms integer, retry_cap_ms integer, failover_count integer) 14:43:36 19660 INFO Starting 10 concurrent clients as sub-processes 14:43:36 19675 INFO Failover client is started, will failover in 10s and every 45s after that ... $ pg_autoctl do demo summary --monitor 'postgres://autoctl_node@localhost:5500/pg_auto_failover?sslmode=prefer' --clients 10 14:44:27 22789 INFO Using application connection string "postgres://localhost:5503,localhost:5501,localhost:5502/demo?target_session_attrs=read-write&sslmode=prefer" 14:44:27 22789 INFO Using Postgres user PGUSER "dim" 14:44:27 22789 INFO Summary for the demo app running with 10 clients for 30s
Client | Connections | Retries | Min Connect Time (ms) | max | p95 | p99 ----------------------+-------------+---------+-----------------------+----------+---------+---------
Client 1 | 136 | 14 | 58.318 | 2601.165 | 244.443 | 261.809
Client 2 | 136 | 5 | 55.199 | 2514.968 | 242.362 | 259.282
Client 3 | 134 | 6 | 55.815 | 2974.247 | 241.740 | 262.908
Client 4 | 135 | 7 | 56.542 | 2970.922 | 238.995 | 251.177
Client 5 | 136 | 8 | 58.339 | 2758.106 | 238.720 | 252.439
Client 6 | 134 | 9 | 58.679 | 2813.653 | 244.696 | 254.674
Client 7 | 134 | 11 | 58.737 | 2795.974 | 243.202 | 253.745
Client 8 | 136 | 12 | 52.109 | 2354.952 | 242.664 | 254.233
Client 9 | 137 | 19 | 59.735 | 2628.496 | 235.668 | 253.582
Client 10 | 133 | 6 | 57.994 | 3060.489 | 242.156 | 256.085
All Clients Combined | 1351 | 97 | 52.109 | 3060.489 | 241.848 | 258.450 (11 rows)
Min Connect Time (ms) | max | freq | bar -----------------------+----------+------+-----------------------------------------------
52.109 | 219.105 | 1093 | ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
219.515 | 267.168 | 248 | ▒▒▒▒▒▒▒▒▒▒
2354.952 | 2354.952 | 1 |
2514.968 | 2514.968 | 1 |
2601.165 | 2628.496 | 2 |
2758.106 | 2813.653 | 3 |
2970.922 | 2974.247 | 2 |
3060.489 | 3060.489 | 1 | (8 rows)
pg_autoctl do service restart - Run pg_autoctl sub-processes (services)
pg_autoctl do service restart provides the following commands:
pg_autoctl do service restart
postgres Restart the pg_autoctl postgres controller service
listener Restart the pg_autoctl monitor listener service
node-active Restart the pg_autoctl keeper node-active service
It is possible to restart the pg_autoctl or the Postgres service without affecting the other running service. Typically, to restart the pg_autoctl parts without impacting Postgres:
$ pg_autoctl do service restart node-active --pgdata node1 14:52:06 31223 INFO Sending the TERM signal to service "node-active" with pid 26626 14:52:06 31223 INFO Service "node-active" has been restarted with pid 31230 31230
The Postgres service has not been impacted by the restart of the pg_autoctl process.
pg_autoctl do show - Show some debug level information
The commands pg_autoctl create monitor and pg_autoctl create postgres both implement some level of automated detection of the node network settings when the option --hostname is not used.
Adding to those commands, when a new node is registered to the monitor, other nodes also edit their Postgres HBA rules to allow the new node to connect, unless the option --skip-pg-hba has been used.
The debug sub-commands for pg_autoctl do show can be used to see in details the network discovery done by pg_autoctl.
pg_autoctl do show provides the following commands:
pg_autoctl do show
ipaddr Print this node's IP address information
cidr Print this node's CIDR information
lookup Print this node's DNS lookup information
hostname Print this node's default hostname
reverse Lookup given hostname and check reverse DNS setup
Connects to an external IP address and uses getsockname(2) to retrieve the current address to which the socket is bound.
The external IP address defaults to 8.8.8.8, the IP address of a Google provided public DNS server, or to the monitor IP address or hostname in the context of pg_autoctl create postgres.
$ pg_autoctl do show ipaddr 16:42:40 62631 INFO ipaddr.c:107: Connecting to 8.8.8.8 (port 53) 192.168.1.156
Connects to an external IP address in the same way as the previous command pg_autoctl do show ipaddr and then matches the local socket name with the list of local network interfaces. When a match is found, uses the netmask of the interface to compute the CIDR notation from the IP address.
The computed CIDR notation is then used in HBA rules.
$ pg_autoctl do show cidr 16:43:19 63319 INFO Connecting to 8.8.8.8 (port 53) 192.168.1.0/24
Uses either its first (and only) argument or the result of gethostname(2) as the candidate hostname to use in HBA rules, and then check that the hostname resolves to an IP address that belongs to one of the machine network interfaces.
When the hostname forward-dns lookup resolves to an IP address that is local to the node where the command is run, then a reverse-lookup from the IP address is made to see if it matches with the candidate hostname.
$ pg_autoctl do show hostname DESKTOP-IC01GOOS.europe.corp.microsoft.com $ pg_autoctl -vv do show hostname 'postgres://autoctl_node@localhost:5500/pg_auto_failover' 13:45:00 93122 INFO cli_do_show.c:256: Using monitor hostname "localhost" and port 5500 13:45:00 93122 INFO ipaddr.c:107: Connecting to ::1 (port 5500) 13:45:00 93122 DEBUG cli_do_show.c:272: cli_show_hostname: ip ::1 13:45:00 93122 DEBUG cli_do_show.c:283: cli_show_hostname: host localhost 13:45:00 93122 DEBUG cli_do_show.c:294: cli_show_hostname: ip ::1 localhost
Checks that the given argument is an hostname that resolves to a local IP address, that is an IP address associated with a local network interface.
$ pg_autoctl do show lookup DESKTOP-IC01GOOS.europe.corp.microsoft.com DESKTOP-IC01GOOS.europe.corp.microsoft.com: 192.168.1.156
Implements the same DNS checks as Postgres HBA matching code: first does a forward DNS lookup of the given hostname, and then a reverse-lookup from all the IP addresses obtained. Success is reached when at least one of the IP addresses from the forward lookup resolves back to the given hostname (as the first answer to the reverse DNS lookup).
$ pg_autoctl do show reverse DESKTOP-IC01GOOS.europe.corp.microsoft.com 16:44:49 64910 FATAL Failed to find an IP address for hostname "DESKTOP-IC01GOOS.europe.corp.microsoft.com" that matches hostname again in a reverse-DNS lookup. 16:44:49 64910 INFO Continuing with IP address "192.168.1.156" $ pg_autoctl -vv do show reverse DESKTOP-IC01GOOS.europe.corp.microsoft.com 16:44:45 64832 DEBUG ipaddr.c:719: DESKTOP-IC01GOOS.europe.corp.microsoft.com has address 192.168.1.156 16:44:45 64832 DEBUG ipaddr.c:733: reverse lookup for "192.168.1.156" gives "desktop-ic01goos.europe.corp.microsoft.com" first 16:44:45 64832 DEBUG ipaddr.c:719: DESKTOP-IC01GOOS.europe.corp.microsoft.com has address 192.168.1.156 16:44:45 64832 DEBUG ipaddr.c:733: reverse lookup for "192.168.1.156" gives "desktop-ic01goos.europe.corp.microsoft.com" first 16:44:45 64832 DEBUG ipaddr.c:719: DESKTOP-IC01GOOS.europe.corp.microsoft.com has address 2a01:110:10:40c::2ad 16:44:45 64832 DEBUG ipaddr.c:728: Failed to resolve hostname from address "192.168.1.156": nodename nor servname provided, or not known 16:44:45 64832 DEBUG ipaddr.c:719: DESKTOP-IC01GOOS.europe.corp.microsoft.com has address 2a01:110:10:40c::2ad 16:44:45 64832 DEBUG ipaddr.c:728: Failed to resolve hostname from address "192.168.1.156": nodename nor servname provided, or not known 16:44:45 64832 DEBUG ipaddr.c:719: DESKTOP-IC01GOOS.europe.corp.microsoft.com has address 100.64.34.213 16:44:45 64832 DEBUG ipaddr.c:728: Failed to resolve hostname from address "192.168.1.156": nodename nor servname provided, or not known 16:44:45 64832 DEBUG ipaddr.c:719: DESKTOP-IC01GOOS.europe.corp.microsoft.com has address 100.64.34.213 16:44:45 64832 DEBUG ipaddr.c:728: Failed to resolve hostname from address "192.168.1.156": nodename nor servname provided, or not known 16:44:45 64832 FATAL cli_do_show.c:333: Failed to find an IP address for hostname "DESKTOP-IC01GOOS.europe.corp.microsoft.com" that matches hostname again in a reverse-DNS lookup. 16:44:45 64832 INFO cli_do_show.c:334: Continuing with IP address "192.168.1.156"
pg_autoctl do pgsetup - Manage a local Postgres setup
The main pg_autoctl commands implement low-level management tooling for a local Postgres instance. Some of the low-level Postgres commands can be used as their own tool in some cases.
pg_autoctl do pgsetup provides the following commands:
pg_autoctl do pgsetup
pg_ctl Find a non-ambiguous pg_ctl program and Postgres version
discover Discover local PostgreSQL instance, if any
ready Return true is the local Postgres server is ready
wait Wait until the local Postgres server is ready
logs Outputs the Postgres startup logs
tune Compute and log some Postgres tuning options
In a similar way to which -a, this commands scans your PATH for pg_ctl commands. Then it runs the pg_ctl --version command and parses the output to determine the version of Postgres that is available in the path.
$ pg_autoctl do pgsetup pg_ctl --pgdata node1 16:49:18 69684 INFO Environment variable PG_CONFIG is set to "/Applications/Postgres.app//Contents/Versions/12/bin/pg_config" 16:49:18 69684 INFO `pg_autoctl create postgres` would use "/Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl" for Postgres 12.3 16:49:18 69684 INFO `pg_autoctl create monitor` would use "/Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl" for Postgres 12.3
Given a PGDATA or --pgdata option, the command discovers if a running Postgres service matches the pg_autoctl setup, and prints the information that pg_autoctl typically needs when managing a Postgres instance.
$ pg_autoctl do pgsetup discover --pgdata node1 pgdata: /Users/dim/dev/MS/pg_auto_failover/tmux/node1 pg_ctl: /Applications/Postgres.app/Contents/Versions/12/bin/pg_ctl pg_version: 12.3 pghost: /tmp pgport: 5501 proxyport: 0 pid: 21029 is in recovery: no Control Version: 1201 Catalog Version: 201909212 System Identifier: 6942422768095393833 Latest checkpoint LSN: 0/4059C18 Postmaster status: ready
Similar to the pg_isready command, though uses the Postgres specifications found in the pg_autoctl node setup.
$ pg_autoctl do pgsetup ready --pgdata node1 16:50:08 70582 INFO Postgres status is: "ready"
When pg_autoctl do pgsetup ready would return false because Postgres is not ready yet, this command continues probing every second for 30 seconds, and exists as soon as Postgres is ready.
$ pg_autoctl do pgsetup wait --pgdata node1 16:50:22 70829 INFO Postgres is now serving PGDATA "/Users/dim/dev/MS/pg_auto_failover/tmux/node1" on port 5501 with pid 21029 16:50:22 70829 INFO Postgres status is: "ready"
Outputs the Postgres logs from the most recent log file in the PGDATA/log directory.
$ pg_autoctl do pgsetup logs --pgdata node1 16:50:39 71126 WARN Postgres logs from "/Users/dim/dev/MS/pg_auto_failover/tmux/node1/startup.log": 16:50:39 71126 INFO 2021-03-22 14:43:48.911 CET [21029] LOG: starting PostgreSQL 12.3 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit 16:50:39 71126 INFO 2021-03-22 14:43:48.913 CET [21029] LOG: listening on IPv6 address "::", port 5501 16:50:39 71126 INFO 2021-03-22 14:43:48.913 CET [21029] LOG: listening on IPv4 address "0.0.0.0", port 5501 16:50:39 71126 INFO 2021-03-22 14:43:48.913 CET [21029] LOG: listening on Unix socket "/tmp/.s.PGSQL.5501" 16:50:39 71126 INFO 2021-03-22 14:43:48.931 CET [21029] LOG: redirecting log output to logging collector process 16:50:39 71126 INFO 2021-03-22 14:43:48.931 CET [21029] HINT: Future log output will appear in directory "log". 16:50:39 71126 WARN Postgres logs from "/Users/dim/dev/MS/pg_auto_failover/tmux/node1/log/postgresql-2021-03-22_144348.log": 16:50:39 71126 INFO 2021-03-22 14:43:48.937 CET [21033] LOG: database system was shut down at 2021-03-22 14:43:46 CET 16:50:39 71126 INFO 2021-03-22 14:43:48.937 CET [21033] LOG: entering standby mode 16:50:39 71126 INFO 2021-03-22 14:43:48.942 CET [21033] LOG: consistent recovery state reached at 0/4022E88 16:50:39 71126 INFO 2021-03-22 14:43:48.942 CET [21033] LOG: invalid record length at 0/4022E88: wanted 24, got 0 16:50:39 71126 INFO 2021-03-22 14:43:48.946 CET [21029] LOG: database system is ready to accept read only connections 16:50:39 71126 INFO 2021-03-22 14:43:49.032 CET [21038] LOG: fetching timeline history file for timeline 4 from primary server 16:50:39 71126 INFO 2021-03-22 14:43:49.037 CET [21038] LOG: started streaming WAL from primary at 0/4000000 on timeline 3 16:50:39 71126 INFO 2021-03-22 14:43:49.046 CET [21038] LOG: replication terminated by primary server 16:50:39 71126 INFO 2021-03-22 14:43:49.046 CET [21038] DETAIL: End of WAL reached on timeline 3 at 0/4022E88. 16:50:39 71126 INFO 2021-03-22 14:43:49.047 CET [21033] LOG: new target timeline is 4 16:50:39 71126 INFO 2021-03-22 14:43:49.049 CET [21038] LOG: restarted WAL streaming at 0/4000000 on timeline 4 16:50:39 71126 INFO 2021-03-22 14:43:49.210 CET [21033] LOG: redo starts at 0/4022E88 16:50:39 71126 INFO 2021-03-22 14:52:06.692 CET [21029] LOG: received SIGHUP, reloading configuration files 16:50:39 71126 INFO 2021-03-22 14:52:06.906 CET [21029] LOG: received SIGHUP, reloading configuration files 16:50:39 71126 FATAL 2021-03-22 15:34:24.920 CET [21038] FATAL: terminating walreceiver due to timeout 16:50:39 71126 INFO 2021-03-22 15:34:24.973 CET [21033] LOG: invalid record length at 0/4059CC8: wanted 24, got 0 16:50:39 71126 INFO 2021-03-22 15:34:25.105 CET [35801] LOG: started streaming WAL from primary at 0/4000000 on timeline 4 16:50:39 71126 FATAL 2021-03-22 16:12:56.918 CET [35801] FATAL: terminating walreceiver due to timeout 16:50:39 71126 INFO 2021-03-22 16:12:57.086 CET [38741] LOG: started streaming WAL from primary at 0/4000000 on timeline 4 16:50:39 71126 FATAL 2021-03-22 16:23:39.349 CET [38741] FATAL: terminating walreceiver due to timeout 16:50:39 71126 INFO 2021-03-22 16:23:39.497 CET [41635] LOG: started streaming WAL from primary at 0/4000000 on timeline 4
Outputs the pg_autoclt automated tuning options. Depending on the number of CPU and amount of RAM detected in the environment where it is run, pg_autoctl can adjust some very basic Postgres tuning knobs to get started.
$ pg_autoctl do pgsetup tune --pgdata node1 -vv 13:25:25 77185 DEBUG pgtuning.c:85: Detected 12 CPUs and 16 GB total RAM on this server 13:25:25 77185 DEBUG pgtuning.c:225: Setting autovacuum_max_workers to 3 13:25:25 77185 DEBUG pgtuning.c:228: Setting shared_buffers to 4096 MB 13:25:25 77185 DEBUG pgtuning.c:231: Setting work_mem to 24 MB 13:25:25 77185 DEBUG pgtuning.c:235: Setting maintenance_work_mem to 512 MB 13:25:25 77185 DEBUG pgtuning.c:239: Setting effective_cache_size to 12 GB # basic tuning computed by pg_auto_failover track_functions = pl shared_buffers = '4096 MB' work_mem = '24 MB' maintenance_work_mem = '512 MB' effective_cache_size = '12 GB' autovacuum_max_workers = 3 autovacuum_vacuum_scale_factor = 0.08 autovacuum_analyze_scale_factor = 0.02
The low-level API is made available through the following pg_autoctl do commands, only available in debug environments:
pg_autoctl do + monitor Query a pg_auto_failover monitor + fsm Manually manage the keeper's state + primary Manage a PostgreSQL primary server + standby Manage a PostgreSQL standby server + show Show some debug level information + pgsetup Manage a local Postgres setup + pgctl Signal the pg_autoctl postgres service + service Run pg_autoctl sub-processes (services) + tmux Set of facilities to handle tmux interactive sessions + azure Manage a set of Azure resources for a pg_auto_failover demo + demo Use a demo application for pg_auto_failover pg_autoctl do monitor + get Get information from the monitor
register Register the current node with the monitor
active Call in the pg_auto_failover Node Active protocol
version Check that monitor version is 1.5.0.1; alter extension update if not
parse-notification parse a raw notification message pg_autoctl do monitor get
primary Get the primary node from pg_auto_failover in given formation/group
others Get the other nodes from the pg_auto_failover group of hostname/port
coordinator Get the coordinator node from the pg_auto_failover formation pg_autoctl do fsm
init Initialize the keeper's state on-disk
state Read the keeper's state from disk and display it
list List reachable FSM states from current state
gv Output the FSM as a .gv program suitable for graphviz/dot
assign Assign a new goal state to the keeper
step Make a state transition if instructed by the monitor + nodes Manually manage the keeper's nodes list pg_autoctl do fsm nodes
get Get the list of nodes from file (see --disable-monitor)
set Set the list of nodes to file (see --disable-monitor) pg_autoctl do primary + slot Manage replication slot on the primary server + adduser Create users on primary
defaults Add default settings to postgresql.conf
identify Run the IDENTIFY_SYSTEM replication command on given host pg_autoctl do primary slot
create Create a replication slot on the primary server
drop Drop a replication slot on the primary server pg_autoctl do primary adduser
monitor add a local user for queries from the monitor
replica add a local user with replication privileges pg_autoctl do standby
init Initialize the standby server using pg_basebackup
rewind Rewind a demoted primary server using pg_rewind
promote Promote a standby server to become writable pg_autoctl do show
ipaddr Print this node's IP address information
cidr Print this node's CIDR information
lookup Print this node's DNS lookup information
hostname Print this node's default hostname
reverse Lookup given hostname and check reverse DNS setup pg_autoctl do pgsetup
pg_ctl Find a non-ambiguous pg_ctl program and Postgres version
discover Discover local PostgreSQL instance, if any
ready Return true is the local Postgres server is ready
wait Wait until the local Postgres server is ready
logs Outputs the Postgres startup logs
tune Compute and log some Postgres tuning options pg_autoctl do pgctl
on Signal pg_autoctl postgres service to ensure Postgres is running
off Signal pg_autoctl postgres service to ensure Postgres is stopped pg_autoctl do service + getpid Get the pid of pg_autoctl sub-processes (services) + restart Restart pg_autoctl sub-processes (services)
pgcontroller pg_autoctl supervised postgres controller
postgres pg_autoctl service that start/stop postgres when asked
listener pg_autoctl service that listens to the monitor notifications
node-active pg_autoctl service that implements the node active protocol pg_autoctl do service getpid
postgres Get the pid of the pg_autoctl postgres controller service
listener Get the pid of the pg_autoctl monitor listener service
node-active Get the pid of the pg_autoctl keeper node-active service pg_autoctl do service restart
postgres Restart the pg_autoctl postgres controller service
listener Restart the pg_autoctl monitor listener service
node-active Restart the pg_autoctl keeper node-active service pg_autoctl do tmux
script Produce a tmux script for a demo or a test case (debug only)
session Run a tmux session for a demo or a test case
stop Stop pg_autoctl processes that belong to a tmux session
wait Wait until a given node has been registered on the monitor
clean Clean-up a tmux session processes and root dir pg_autoctl do azure + provision provision azure resources for a pg_auto_failover demo + tmux Run a tmux session with an Azure setup for QA/testing + show show azure resources for a pg_auto_failover demo
deploy Deploy a pg_autoctl VMs, given by name
create Create an azure QA environment
drop Drop an azure QA environment: resource group, network, VMs
ls List resources in a given azure region
ssh Runs ssh -l ha-admin <public ip address> for a given VM name
sync Rsync pg_auto_failover sources on all the target region VMs pg_autoctl do azure provision
region Provision an azure region: resource group, network, VMs
nodes Provision our pre-created VM with pg_autoctl Postgres nodes pg_autoctl do azure tmux
session Create or attach a tmux session for the created Azure VMs
kill Kill an existing tmux session for Azure VMs pg_autoctl do azure show
ips Show public and private IP addresses for selected VMs
state Connect to the monitor node to show the current state pg_autoctl do demo
run Run the pg_auto_failover demo application
uri Grab the application connection string from the monitor
ping Attempt to connect to the application URI
summary Display a summary of the previous demo app run
pg_autoctl run - Run the pg_autoctl service (monitor or keeper)
This commands starts the processes needed to run a monitor node or a keeper node, depending on the configuration file that belongs to the --pgdata option or PGDATA environment variable.
usage: pg_autoctl run [ --pgdata --name --hostname --pgport ] --pgdata path to data directory --name pg_auto_failover node name --hostname hostname used to connect from other nodes --pgport PostgreSQL's port number
When registering Postgres nodes to the pg_auto_failover monitor using the pg_autoctl create postgres command, the nodes are registered with metadata: the node name, hostname and Postgres port.
The node name is used mostly in the logs and pg_autoctl show state commands and helps human administrators of the formation.
The node hostname and pgport are used by other nodes, including the pg_auto_failover monitor, to open a Postgres connection.
Both the node name and the node hostname and port can be changed after the node registration by using either this command (pg_autoctl run) or the pg_autoctl config set command.
The --name option allows using a user-friendly name for your Postgres nodes.
When not provided, a default value is computed by running the following algorithm.
When the forward DNS lookup response in step 3. is an IP address found in one of our local network interfaces, then pg_autoctl uses the hostname found in step 2. as the default --hostname. Otherwise it uses the IP address found in step 1.
You may use the --hostname command line option to bypass the whole DNS lookup based process and force the local node name to a fixed value.
pg_autoctl watch - Display an auto-updating dashboard
This command outputs the events that the pg_auto_failover events records about state changes of the pg_auto_failover nodes managed by the monitor:
usage: pg_autoctl watch [ --pgdata --formation --group ] --pgdata path to data directory --monitor show the monitor uri --formation formation to query, defaults to 'default' --group group to query formation, defaults to all --json output data in the JSON format
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
The pg_autoctl watch output is divided in 3 sections.
The first section is a single header line which includes the name of the currently selected formation, the formation replication setting Number Sync Standbys, and then in the right most position the current time.
The second section displays one line per node, and each line contains a list of columns that describe the current state for the node. This list can includes the following columns, and which columns are part of the output depends on the terminal window size. This choice is dynamic and changes if your terminal window size changes:
Only Citus formations allow several groups. When using a Citus formation the Node column contains the groupId and the nodeId, separated by a colon, such as 0:1 for the first coordinator node.
This value is expected to stay under 2s or abouts, and is known to increment when either the pg_autoctl run service is not running, or when there is a network split.
This value is known to increment when either the Postgres service is not running on the target node, when there is a network split, or when the internal machinery (the health check worker background process) implements jitter.
The LSN is the current position in the Postgres WAL stream. This is a hexadecimal number. See pg_lsn for more information.
The current timeline is incremented each time a failover happens, or when doing Point In Time Recovery. A node can only reach the secondary state when it is on the same timeline as its primary node.
The third and last section lists the most recent events that the monitor has registered, the more recent event is found at the bottom of the screen.
To quit the command hit either the F1 key or the q key.
pg_autoctl stop - signal the pg_autoctl service for it to stop
This commands stops the processes needed to run a monitor node or a keeper node, depending on the configuration file that belongs to the --pgdata option or PGDATA environment variable.
usage: pg_autoctl stop [ --pgdata --fast --immediate ] --pgdata path to data directory --fast fast shutdown mode for the keeper --immediate immediate shutdown mode for the keeper
The pg_autoctl stop commands finds the PID of the running service for the given --pgdata, and if the process is still running, sends a SIGTERM signal to the process.
When pg_autoclt receives a shutdown signal a shutdown sequence is triggered. Depending on the signal received, an operation that has been started (such as a state transition) is either run to completion, stopped as the next opportunity, or stopped immediately even when in the middle of the transition.
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
pg_autoctl reload - signal the pg_autoctl for it to reload its configuration
This commands signals a running pg_autoctl process to reload its configuration from disk, and also signal the managed Postgres service to reload its configuration.
usage: pg_autoctl reload [ --pgdata ] [ --json ] --pgdata path to data directory
The pg_autoctl reload commands finds the PID of the running service for the given --pgdata, and if the process is still running, sends a SIGHUP signal to the process.
pg_autoctl status - Display the current status of the pg_autoctl service
This commands outputs the current process status for the pg_autoctl service running for the given --pgdata location.
usage: pg_autoctl status [ --pgdata ] [ --json ] --pgdata path to data directory --json output data in the JSON format
PGDATA
PG_AUTOCTL_MONITOR
XDG_CONFIG_HOME
XDG_DATA_HOME
$ pg_autoctl status --pgdata node1 11:26:30 27248 INFO pg_autoctl is running with pid 26618 11:26:30 27248 INFO Postgres is serving PGDATA "/Users/dim/dev/MS/pg_auto_failover/tmux/node1" on port 5501 with pid 26725 $ pg_autoctl status --pgdata node1 --json 11:26:37 27385 INFO pg_autoctl is running with pid 26618 11:26:37 27385 INFO Postgres is serving PGDATA "/Users/dim/dev/MS/pg_auto_failover/tmux/node1" on port 5501 with pid 26725 {
"postgres": {
"pgdata": "\/Users\/dim\/dev\/MS\/pg_auto_failover\/tmux\/node1",
"pg_ctl": "\/Applications\/Postgres.app\/Contents\/Versions\/12\/bin\/pg_ctl",
"version": "12.3",
"host": "\/tmp",
"port": 5501,
"proxyport": 0,
"pid": 26725,
"in_recovery": false,
"control": {
"version": 0,
"catalog_version": 0,
"system_identifier": "0"
},
"postmaster": {
"status": "ready"
}
},
"pg_autoctl": {
"pid": 26618,
"status": "running",
"pgdata": "\/Users\/dim\/dev\/MS\/pg_auto_failover\/tmux\/node1",
"version": "1.5.0",
"semId": 196609,
"services": [
{
"name": "postgres",
"pid": 26625,
"status": "running",
"version": "1.5.0",
"pgautofailover": "1.5.0.1"
},
{
"name": "node-active",
"pid": 26626,
"status": "running",
"version": "1.5.0",
"pgautofailover": "1.5.0.1"
}
]
} }
pg_autoctl activate - Activate a Citus worker from the Citus coordinator
This command calls the Citus “activation” API so that a node can be used to host shards for your reference and distributed tables.
usage: pg_autoctl activate [ --pgdata ]
--pgdata path to data directory
When creating a Citus worker, pg_autoctl create worker automatically activates the worker node to the coordinator. You only need this command when something unexpected have happened and you want to manually make sure the worker node has been activated at the Citus coordinator level.
Starting with Citus 10 it is also possible to activate the coordinator itself as a node with shard placement. Use pg_autoctl activate on your Citus coordinator node manually to use that feature.
When the Citus coordinator is activated, an extra step is then needed for it to host shards of distributed tables. If you want your coordinator to have shards, then have a look at the Citus API citus_set_node_property to set the shouldhaveshards property to true.
Several defaults settings of pg_auto_failover can be reviewed and changed depending on the trade-offs you want to implement in your own production setup. The settings that you can change will have an impact of the following operations:
pg_auto_failover decides to implement a failover to the secondary node when it detects that the primary node is unhealthy. Changing the following settings will have an impact on when the pg_auto_failover monitor decides to promote the secondary PostgreSQL node:
pgautofailover.health_check_max_retries pgautofailover.health_check_period pgautofailover.health_check_retry_delay pgautofailover.health_check_timeout pgautofailover.node_considered_unhealthy_timeout
At secondary promotion time, pg_auto_failover waits for the following timeout to make sure that all pending writes on the primary server made it to the secondary at shutdown time, thus preventing data loss.:
pgautofailover.primary_demote_timeout
pg_auto_failover implements a trade-off where data availability trumps service availability. When the primary node of a PostgreSQL service is detected unhealthy, the secondary is only promoted if it was known to be eligible at the moment when the primary is lost.
In the case when synchronous replication was in use at the moment when the primary node is lost, then we know we can switch to the secondary safely, and the wal lag is 0 in that case.
In the case when the secondary server had been detected unhealthy before, then the pg_auto_failover monitor switches it from the state SECONDARY to the state CATCHING-UP and promotion is prevented then.
The following setting allows to still promote the secondary, allowing for a window of data loss:
pgautofailover.promote_wal_log_threshold
The configuration for the behavior of the monitor happens in the PostgreSQL database where the extension has been deployed:
pg_auto_failover=> select name, setting, unit, short_desc from pg_settings where name ~ 'pgautofailover.'; -[ RECORD 1 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.enable_sync_wal_log_threshold setting | 16777216 unit | short_desc | Don't enable synchronous replication until secondary xlog is within this many bytes of the primary's -[ RECORD 2 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.health_check_max_retries setting | 2 unit | short_desc | Maximum number of re-tries before marking a node as failed. -[ RECORD 3 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.health_check_period setting | 5000 unit | ms short_desc | Duration between each check (in milliseconds). -[ RECORD 4 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.health_check_retry_delay setting | 2000 unit | ms short_desc | Delay between consecutive retries. -[ RECORD 5 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.health_check_timeout setting | 5000 unit | ms short_desc | Connect timeout (in milliseconds). -[ RECORD 6 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.node_considered_unhealthy_timeout setting | 20000 unit | ms short_desc | Mark node unhealthy if last ping was over this long ago -[ RECORD 7 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.primary_demote_timeout setting | 30000 unit | ms short_desc | Give the primary this long to drain before promoting the secondary -[ RECORD 8 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.promote_wal_log_threshold setting | 16777216 unit | short_desc | Don't promote secondary unless xlog is with this many bytes of the master -[ RECORD 9 ]---------------------------------------------------------------------------------------------------- name | pgautofailover.startup_grace_period setting | 10000 unit | ms short_desc | Wait for at least this much time after startup before initiating a failover.
You can edit the parameters as usual with PostgreSQL, either in the postgresql.conf file or using ALTER DATABASE pg_auto_failover SET parameter = value; commands, then issuing a reload.
For an introduction to the pg_autoctl commands relevant to the pg_auto_failover Keeper configuration, please see pg_autoctl config.
An example configuration file looks like the following:
[pg_autoctl] role = keeper monitor = postgres://autoctl_node@192.168.1.34:6000/pg_auto_failover formation = default group = 0 hostname = node1.db nodekind = standalone [postgresql] pgdata = /data/pgsql/ pg_ctl = /usr/pgsql-10/bin/pg_ctl dbname = postgres host = /tmp port = 5000 [replication] slot = pgautofailover_standby maximum_backup_rate = 100M backup_directory = /data/backup/node1.db [timeout] network_partition_timeout = 20 postgresql_restart_failure_timeout = 20 postgresql_restart_failure_max_retries = 3
To output, edit and check entries of the configuration, the following commands are provided:
pg_autoctl config check [--pgdata <pgdata>] pg_autoctl config get [--pgdata <pgdata>] section.option pg_autoctl config set [--pgdata <pgdata>] section.option value
The [postgresql] section is discovered automatically by the pg_autoctl command and is not intended to be changed manually.
pg_autoctl.monitor
PostgreSQL service URL of the pg_auto_failover monitor, as given in the output of the pg_autoctl show uri command.
pg_autoctl.formation
A single pg_auto_failover monitor may handle several postgres formations. The default formation name default is usually fine.
pg_autoctl.group
This information is retrieved by the pg_auto_failover keeper when registering a node to the monitor, and should not be changed afterwards. Use at your own risk.
pg_autoctl.hostname
Node hostname used by all the other nodes in the cluster to contact this node. In particular, if this node is a primary then its standby uses that address to setup streaming replication.
replication.slot
Name of the PostgreSQL replication slot used in the streaming replication setup automatically deployed by pg_auto_failover. Replication slots can't be renamed in PostgreSQL.
replication.maximum_backup_rate
When pg_auto_failover (re-)builds a standby node using the pg_basebackup command, this parameter is given to pg_basebackup to throttle the network bandwidth used. Defaults to 100Mbps.
replication.backup_directory
When pg_auto_failover (re-)builds a standby node using the pg_basebackup command, this parameter is the target directory where to copy the bits from the primary server. When the copy has been successful, then the directory is renamed to postgresql.pgdata.
The default value is computed from ${PGDATA}/../backup/${hostname} and can be set to any value of your preference. Remember that the directory renaming is an atomic operation only when both the source and the target of the copy are in the same filesystem, at least in Unix systems.
timeout
This section allows to setup the behavior of the pg_auto_failover keeper in interesting scenarios.
timeout.network_partition_timeout
Timeout in seconds before we consider failure to communicate with other nodes indicates a network partition. This check is only done on a PRIMARY server, so other nodes mean both the monitor and the standby.
When a PRIMARY node is detected to be on the losing side of a network partition, the pg_auto_failover keeper enters the DEMOTE state and stops the PostgreSQL instance in order to protect against split brain situations.
The default is 20s.
timeout.postgresql_restart_failure_timeout
timeout.postgresql_restart_failure_max_retries
When PostgreSQL is not running, the first thing the pg_auto_failover keeper does is try to restart it. In case of a transient failure (e.g. file system is full, or other dynamic OS resource constraint), the best course of action is to try again for a little while before reaching out to the monitor and ask for a failover.
The pg_auto_failover keeper tries to restart PostgreSQL timeout.postgresql_restart_failure_max_retries times in a row (default 3) or up to timeout.postgresql_restart_failure_timeout (defaults 20s) since it detected that PostgreSQL is not running, whichever comes first.
This section is not yet complete. Please contact us with any questions.
pg_auto_failover is a general purpose tool for setting up PostgreSQL replication in order to implement High Availability of the PostgreSQL service.
It is also possible to register pre-existing PostgreSQL instances with a pg_auto_failover monitor. The pg_autoctl create command honors the PGDATA environment variable, and checks whether PostgreSQL is already running. If Postgres is detected, the new node is registered in SINGLE mode, bypassing the monitor's role assignment policy.
The pg_autoctl create postgres command edits the default Postgres configuration file (postgresql.conf) to include pg_auto_failover settings.
The include directive is placed on the top of the postgresql.conf file in a way that you may override any setting by editing it later in the file.
Unless using the --skip-pg-hba option then pg_autoctl edits a minimal set of HBA rules for you, in order for the pg_auto_failover nodes to be able to connect to each other. The HBA rules that are needed for your application to connect to your Postgres nodes still need to be added. As pg_autoctl knows nothing about your applications, then you are responsible for editing the HBA file.
When upgrading a pg_auto_failover setup, the procedure is different on the monitor and on the Postgres nodes:
During the restart of the monitor, the other nodes might have trouble connecting to the monitor. The pg_autoctl command is designed to retry connecting to the monitor and handle errors gracefully.
As a result, here is the standard upgrade plan for pg_auto_failover:
sudo apt-get remove pg-auto-failover-cli-1.4 postgresql-11-auto-failover-1.4 sudo apt-get install -q -y pg-auto-failover-cli-1.5 postgresql-11-auto-failover-1.5
sudo systemctl restart pgautofailover
Then we may use the following commands to make sure that the service is running as expected:
sudo systemctl status pgautofailover sudo journalctl -u pgautofailover
At this point it is expected that the pg_autoctl logs show that an upgrade has been performed by using the ALTER EXTENSION pgautofailover UPDATE TO ... command. The monitor is ready with the new version of pg_auto_failover.
When the Postgres nodes pg_autoctl process connects to the new monitor version, the check for version compatibility fails, and the "node-active" sub-process exits. The main pg_autoctl process supervisor then restart the "node-active" sub-process from its on-disk binary executable file, which has been upgraded to the new version. That's why we first install the new packages for pg_auto_failover on every node, and only then restart the monitor.
IMPORTANT:
When that's not the case, pg_autoctl on the Postgres nodes will still detect a version mismatch with the monitor extension, and the "node-active" sub-process will exit. And when restarted automatically, the same version of the local pg_autoctl binary executable is found on-disk, leading to the same version mismatch with the monitor extension.
After restarting the "node-active" process 5 times, pg_autoctl quits retrying and stops. This includes stopping the Postgres service too, and a service downtime might then occur.
And when the upgrade is done we can use pg_autoctl show state on the monitor to see that eveything is as expected.
The new upgrade procedure described in the previous section is part of pg_auto_failover since version 1.4. When upgrading from a previous version of pg_auto_failover, up to and including version 1.3, then all the pg_autoctl processes have to be restarted fully.
To prevent triggering a failover during the upgrade, it's best to put your secondary nodes in maintenance. The procedure then looks like the following:
pg_autoctl enable maintenance
sudo systemctl restart pgautofailover
Then we may use the following commands to make sure that the service is running as expected:
sudo systemctl status pgautofailover sudo journalctl -u pgautofailover
At this point it is expected that the pg_autoctl logs show that an upgrade has been performed by using the ALTER EXTENSION pgautofailover UPDATE TO ... command. The monitor is ready with the new version of pg_auto_failover.
sudo systemctl restart pgautofailover
As in the previous point in this list, make sure the service is now running as expected.
pg_autoctl disable maintenance
Since version 1.4.0 the pgautofailover extension requires the Postgres contrib extension btree_gist. The pg_autoctl command arranges for the creation of this dependency, and has been buggy in some releases.
As a result, you might have trouble upgrade the pg_auto_failover monitor to a recent version. It is possible to fix the error by connecting to the monitor Postgres database and running the create extension command manually:
# create extension btree_gist;
It is possible to operate pg_auto_failover formations and groups directly from the monitor. All that is needed is an access to the monitor Postgres database as a client, such as psql. It's also possible to add those management SQL function calls in your own ops application if you have one.
For security reasons, the autoctl_node is not allowed to perform maintenance operations. This user is limited to what pg_autoctl needs. You can either create a specific user and authentication rule to expose for management, or edit the default HBA rules for the autoctl user. In the following examples we're directly connecting as the autoctl role.
The main operations with pg_auto_failover are node maintenance and manual failover, also known as a controlled switchover.
It is possible to put a secondary node in any group in a MAINTENANCE state, so that the Postgres server is not doing synchronous replication anymore and can be taken down for maintenance purposes, such as security kernel upgrades or the like.
The command line tool pg_autoctl exposes an API to schedule maintenance operations on the current node, which must be a secondary node at the moment when maintenance is requested.
Here's an example of using the maintenance commands on a secondary node, including the output. Of course, when you try that on your own nodes, dates and PID information might differ:
$ pg_autoctl enable maintenance 17:49:19 14377 INFO Listening monitor notifications about state changes in formation "default" and group 0 17:49:19 14377 INFO Following table displays times when notifications are received
Time | ID | Host | Port | Current State | Assigned State ---------+-----+-----------+--------+---------------------+-------------------- 17:49:19 | 1 | localhost | 5001 | primary | wait_primary 17:49:19 | 2 | localhost | 5002 | secondary | wait_maintenance 17:49:19 | 2 | localhost | 5002 | wait_maintenance | wait_maintenance 17:49:20 | 1 | localhost | 5001 | wait_primary | wait_primary 17:49:20 | 2 | localhost | 5002 | wait_maintenance | maintenance 17:49:20 | 2 | localhost | 5002 | maintenance | maintenance
The command listens to the state changes in the current node's formation and group on the monitor and displays those changes as it receives them. The operation is done when the node has reached the maintenance state.
It is now possible to disable maintenance to allow pg_autoctl to manage this standby node again:
$ pg_autoctl disable maintenance 17:49:26 14437 INFO Listening monitor notifications about state changes in formation "default" and group 0 17:49:26 14437 INFO Following table displays times when notifications are received
Time | ID | Host | Port | Current State | Assigned State ---------+-----+-----------+--------+---------------------+-------------------- 17:49:27 | 2 | localhost | 5002 | maintenance | catchingup 17:49:27 | 2 | localhost | 5002 | catchingup | catchingup 17:49:28 | 2 | localhost | 5002 | catchingup | secondary 17:49:28 | 1 | localhost | 5001 | wait_primary | primary 17:49:28 | 2 | localhost | 5002 | secondary | secondary 17:49:29 | 1 | localhost | 5001 | primary | primary
When a standby node is in maintenance, the monitor sets the primary node replication to WAIT_PRIMARY: in this role, the PostgreSQL streaming replication is now asynchronous and the standby PostgreSQL server may be stopped, rebooted, etc.
A primary node must be available at all times in any formation and group in pg_auto_failover, that is the invariant provided by the whole solution. With that in mind, the only way to allow a primary node to go to a maintenance mode is to first failover and promote the secondary node.
The same command pg_autoctl enable maintenance implements that operation when run on a primary node with the option --allow-failover. Here is an example of such an operation:
$ pg_autoctl enable maintenance 11:53:03 50526 WARN Enabling maintenance on a primary causes a failover 11:53:03 50526 FATAL Please use --allow-failover to allow the command proceed
As we can see the option allow-failover is mandatory. In the next example we use it:
$ pg_autoctl enable maintenance --allow-failover 13:13:42 1614 INFO Listening monitor notifications about state changes in formation "default" and group 0 13:13:42 1614 INFO Following table displays times when notifications are received
Time | ID | Host | Port | Current State | Assigned State ---------+-----+-----------+--------+---------------------+-------------------- 13:13:43 | 2 | localhost | 5002 | primary | prepare_maintenance 13:13:43 | 1 | localhost | 5001 | secondary | prepare_promotion 13:13:43 | 1 | localhost | 5001 | prepare_promotion | prepare_promotion 13:13:43 | 2 | localhost | 5002 | prepare_maintenance | prepare_maintenance 13:13:44 | 1 | localhost | 5001 | prepare_promotion | stop_replication 13:13:45 | 1 | localhost | 5001 | stop_replication | stop_replication 13:13:46 | 1 | localhost | 5001 | stop_replication | wait_primary 13:13:46 | 2 | localhost | 5002 | prepare_maintenance | maintenance 13:13:46 | 1 | localhost | 5001 | wait_primary | wait_primary 13:13:47 | 2 | localhost | 5002 | maintenance | maintenance
When the operation is done we can have the old primary re-join the group, this time as a secondary:
$ pg_autoctl disable maintenance 13:14:46 1985 INFO Listening monitor notifications about state changes in formation "default" and group 0 13:14:46 1985 INFO Following table displays times when notifications are received
Time | ID | Host | Port | Current State | Assigned State ---------+-----+-----------+--------+---------------------+-------------------- 13:14:47 | 2 | localhost | 5002 | maintenance | catchingup 13:14:47 | 2 | localhost | 5002 | catchingup | catchingup 13:14:52 | 2 | localhost | 5002 | catchingup | secondary 13:14:52 | 1 | localhost | 5001 | wait_primary | primary 13:14:52 | 2 | localhost | 5002 | secondary | secondary 13:14:53 | 1 | localhost | 5001 | primary | primary
It is possible to trigger a manual failover, or a switchover, using the command pg_autoctl perform failover. Here's an example of what happens when running the command:
$ pg_autoctl perform failover 11:58:00 53224 INFO Listening monitor notifications about state changes in formation "default" and group 0 11:58:00 53224 INFO Following table displays times when notifications are received
Time | ID | Host | Port | Current State | Assigned State ---------+-----+-----------+--------+--------------------+------------------- 11:58:01 | 1 | localhost | 5001 | primary | draining 11:58:01 | 2 | localhost | 5002 | secondary | prepare_promotion 11:58:01 | 1 | localhost | 5001 | draining | draining 11:58:01 | 2 | localhost | 5002 | prepare_promotion | prepare_promotion 11:58:02 | 2 | localhost | 5002 | prepare_promotion | stop_replication 11:58:02 | 1 | localhost | 5001 | draining | demote_timeout 11:58:03 | 1 | localhost | 5001 | demote_timeout | demote_timeout 11:58:04 | 2 | localhost | 5002 | stop_replication | stop_replication 11:58:05 | 2 | localhost | 5002 | stop_replication | wait_primary 11:58:05 | 1 | localhost | 5001 | demote_timeout | demoted 11:58:05 | 2 | localhost | 5002 | wait_primary | wait_primary 11:58:05 | 1 | localhost | 5001 | demoted | demoted 11:58:06 | 1 | localhost | 5001 | demoted | catchingup 11:58:06 | 1 | localhost | 5001 | catchingup | catchingup 11:58:08 | 1 | localhost | 5001 | catchingup | secondary 11:58:08 | 2 | localhost | 5002 | wait_primary | primary 11:58:08 | 1 | localhost | 5001 | secondary | secondary 11:58:08 | 2 | localhost | 5002 | primary | primary
Again, timings and PID numbers are not expected to be the same when you run the command on your own setup.
Also note in the output that the command shows the whole set of transitions including when the old primary is now a secondary node. The database is available for read-write traffic as soon as we reach the state wait_primary.
It is generally useful to distinguish a controlled switchover to a failover. In a controlled switchover situation it is possible to organise the sequence of events in a way to avoid data loss and lower downtime to a minimum.
In the case of pg_auto_failover, because we use synchronous replication, we don't face data loss risks when triggering a manual failover. Moreover, our monitor knows the current primary health at the time when the failover is triggered, and drives the failover accordingly.
So to trigger a controlled switchover with pg_auto_failover you can use the same API as for a manual failover:
$ pg_autoctl perform switchover
Because the subtelties of orchestrating either a controlled switchover or an unplanned failover are all handled by the monitor, rather than the client side command line, at the client level the two command pg_autoctl perform failover and pg_autoctl perform switchover are synonyms, or aliases.
The following commands display information from the pg_auto_failover monitor tables pgautofailover.node and pgautofailover.event:
$ pg_autoctl show state $ pg_autoctl show events
When run on the monitor, the commands outputs all the known states and events for the whole set of formations handled by the monitor. When run on a PostgreSQL node, the command connects to the monitor and outputs the information relevant to the service group of the local node only.
For interactive debugging it is helpful to run the following command from the monitor node while e.g. initializing a formation from scratch, or performing a manual failover:
$ watch pg_autoctl show state
The monitor reports every state change decision to a LISTEN/NOTIFY channel named state. PostgreSQL logs on the monitor are also stored in a table, pgautofailover.event, and broadcast by NOTIFY in the channel log.
When the monitor node is not available anymore, it is possible to create a new monitor node and then switch existing nodes to a new monitor by using the following commands.
$ pg_autoctl disable monitor --force
$ pg_autoctl create monitor ...
$ pg_autoctl enable monitor postgresql://autoctl_node@.../pg_auto_failover
$ pg_autoctl enable monitor postgresql://autoctl_node@.../pg_auto_failover
See pg_autoctl disable monitor and pg_autoctl enable monitor for details about those commands.
This operation relies on the fact that a pg_autoctl can be operated without a monitor, and when reconnecting to a new monitor, this process reset the parts of the node state that comes from the monitor, such as the node identifier.
pg_auto_failover commands can be run repeatedly. If initialization fails the first time -- for instance because a firewall rule hasn't yet activated -- it's possible to try pg_autoctl create again. pg_auto_failover will review its previous progress and repeat idempotent operations (create database, create extension etc), gracefully handling errors.
Those questions have been asked in GitHub issues for the project by several people. If you have more questions, feel free to open a new issue, and your question and its answer might make it to this FAQ.
In order to avoid spurious failovers when the network connectivity is not stable, pg_auto_failover implements a timeout of 20s before acting on a node that is known unavailable. This needs to be added to the delay between health checks and the retry policy.
See the Configuring pg_auto_failover part for more information about how to setup the different delays and timeouts that are involved in the decision making.
See also pg_autoctl watch to have a dashboard that helps understanding the system and what's going on in the moment.
In the pg_auto_failover design, the following two things are needed for the monitor to be able to orchestrate nodes integration completely:
The monitor runs periodic health checks with all the nodes registered in the system. Those health checks are Postgres connections from the monitor to the registered Postgres nodes, and use the hostname and port as registered.
The pg_autoctl show state commands column Reachable contains "yes" when the monitor could connect to a specific node, "no" when this connection failed, and "unknown" when no connection has been attempted yet, since the last startup time of the monitor.
The Reachable column from pg_autoctl show state command output must show a "yes" entry before a new standby node can be orchestrated up to the "secondary" goal state.
The pg_auto_failover monitor works by assigning goal states to individual Postgres nodes. The monitor will not assign a new goal state until the current one has been reached.
To implement a transition from the current state to the goal state assigned by the monitor, the pg_autoctl service must be running on every node.
When your new standby node stays in the "catchingup" state for a long time, please check that the node is reachable from the monitor given its hostname and port known on the monitor, and check that the pg_autoctl run command is running for this node.
When things are not obvious, the next step is to go read the logs. Both the output of the pg_autoctl command and the Postgres logs are relevant. See the Should I read the logs? Where are the logs? question for details.
Yes. If anything seems strange to you, please do read the logs.
As maintainers of the pg_autoctl tool, we can't foresee everything that may happen to your production environment. Still, a lot of efforts is spent on having a meaningful output. So when you're in a situation that's hard to understand, please make sure to read the pg_autoctl logs and the Postgres logs.
When using systemd integration, the pg_autoctl logs are then handled entirely by the journal facility of systemd. Please then refer to journalctl for viewing the logs.
The Postgres logs are to be found in the $PGDATA/log directory with the default configuration deployed by pg_autoctl create .... When a custom Postgres setup is used, please refer to your actual setup to find Postgres logs.
This question is a general case situation that is similar in nature to the previous situation, reached when adding a new standby to a group of Postgres nodes. Please check the same two elements: the monitor health checks are successful, and the pg_autoctl run command is running.
When things are not obvious, the next step is to go read the logs. Both the output of the pg_autoctl command and the Postgres logs are relevant. See the Should I read the logs? Where are the logs? question for details.
The pg_auto_failover Failover State Machine is great to simplify node management and orchestrate multi-nodes operations such as a switchover or a failover. That said, it might happen that the FSM is unable to proceed in some cases, usually after a hard crash of some components of the system, and mostly due to bugs.
Even if we have an extensive test suite to prevent such bugs from happening, you might have to deal with a situation that the monitor doesn't know how to solve.
The FSM has been designed with a last resort operation mode. It is always possible to unregister a node from the monitor with the pg_autoctl drop node command. This helps the FSM getting back to a simpler situation, the simplest possible one being when only one node is left registered in a given formation and group (state is then SINGLE).
When the monitor is back on its feet again, then you may add your nodes again with the pg_autoctl create postgres command. The command understands that a Postgres service is running and will recover from where you left.
In some cases you might have to also delete the local pg_autoctl state file, error messages will instruct you about the situation.
When using pg_auto_failover, the monitor is needed to make decisions and orchestrate changes in all the registered Postgres groups. Decisions are transmitted to the Postgres nodes by the monitor assigning nodes a goal state which is different from their current state.
Nodes contact the monitor each second and call the node_active stored procedure, which returns a goal state that is possibly different from the current state.
The monitor only assigns Postgres nodes with a new goal state when a cluster wide operation is needed. In practice, only the following operations require the monitor to assign a new goal state to a Postgres node:
When the monitor node is not available, the pg_autoctl processes on the Postgres nodes will fail to contact the monitor every second, and log about this failure. Adding to that, no orchestration is possible.
The Postgres streaming replication does not need the monitor to be available in order to deliver its service guarantees to your application, so your Postgres service is still available when the monitor is not available.
To repair your installation after having lost a monitor, the following scenarios are to be considered.
This is typically the case in Cloud Native environments such as Kubernetes, where you could have a service migrated to another pod and re-attached to its disk volume. This scenario is well supported by pg_auto_failover, and no intervention is needed.
It is also possible to use synchronous archiving with the monitor so that it's possible to recover from the current archives and continue operating without intervention on the Postgres nodes, except for updating their monitor URI. This requires an archiving setup that uses synchronous replication so that any transaction committed on the monitor is known to have been replicated in your WAL archive.
At the moment, you have to take care of that setup yourself. Here's a quick summary of what needs to be done:
Use pg_basebackup every once in a while to have a full copy of the monitor Postgres database available.
Use pg_receivewal --sync ... as a service to keep a WAL archive in sync with the monitor Postgres instance at all time.
Write a utility that knows how to create a new monitor node from your most recent pg_basebackup copy and the WAL files copy.
Bonus points if that tool/script is tested at least once a day, so that you avoid surprises on the unfortunate day that you actually need to use it in production.
A future version of pg_auto_failover will include this facility, but the current versions don't.
If you don't have synchronous archiving for the monitor set-up, then you might not be able to restore a monitor database with the expected up-to-date node metadata. Specifically we need the nodes state to be in sync with what each pg_autoctl process has received the last time they could contact the monitor, before it has been unavailable.
It is possible to register nodes that are currently running to a new monitor without restarting Postgres on the primary. For that, the procedure mentioned in Replacing the monitor online must be followed, using the following commands:
$ pg_autoctl disable monitor $ pg_autoctl enable monitor
Microsoft
Copyright (c) Microsoft Corporation. All rights reserved.
November 6, 2022 | 2.0 |