52.2. Installation and Setup #

52.2.1. Setting up a Multi-Master Cluster
52.2.2. Tuning Configuration Parameters
52.2.3. Setting Timeout for Failure Detection
52.2.4. 2+1 Mode: Setting up a Standalone Referee Node

To use multimaster, you need to install PostgreSQL Enterprise Server on all nodes of your cluster. PostgreSQL Enterprise Server includes all the required dependencies and extensions.

52.2.1. Setting up a Multi-Master Cluster #

Suppose you are setting up a cluster of three nodes, with node1, node2, and node3 host names. After installing PostgreSQL Enterprise Server on all nodes, you need to initialize data directory on each node. If you would like to set up a multi-master cluster for an already existing mydb database, you can load data from mydb to one of the nodes once the cluster is initialized, or you can load data to all new nodes before cluster initialization using any convenient mechanism, such as pg_basebackup or pg_dump.

Once the data directory is set up, complete the following steps on each cluster node:

  1. Modify the postgresql.conf configuration file, as follows:

    • Add multimaster to the shared_preload_libraries variable:

                                                  shared_preload_libraries = 'multimaster'
                                              

    • Set up PostgreSQL parameters related to replication:

                                                  wal_level = logical
                                                  max_connections = 100
                                                  max_prepared_transactions = 300 # max_connections * N
                                                  max_wal_senders = 10            # at least N
                                                  max_replication_slots = 10      # at least 2N
                                                  wal_sender_timeout = 0
                                              

      where N is the number of nodes in your cluster. You must change the replication level to logical as multimaster relies on logical replication. For a cluster of N nodes, enable at least N WAL sender processes and replication slots. Since multimaster implicitly adds a PREPARE phase to each COMMIT transaction, make sure to set the number of prepared transactions to N * max_connections. wal_sender_timeout should be disabled as multimaster uses its custom logic for failure detection.

    • Make sure you have enough background workers allocated for each node:

                                                  max_worker_processes = 250 # (N - 1) * (multimaster.max_workers + 1) + 5
                                              

      For example, for a three-node cluster with multimaster.max_workers = 100, multimaster may need up to 207 background workers at peak times: five always-on workers (monitor, resolver, dmq-sender, campaigner, replier), one walreceiver per each peer node and up to 200 replication dynamic workers. When setting this parameter, remember that other modules may also use background workers at the same time.

    • Depending on your network environment and usage patterns, you may want to tune other multimaster parameters.

  2. Start PostgreSQL on all nodes.

  3. Create database mydb and user mtmuser on each node:

                                    CREATE USER mtmuser WITH SUPERUSER PASSWORD 'mtmuserpassword';
                                    CREATE DATABASE mydb OWNER mtmuser;
                                

    If you are using password-based authentication, you may want to create a password file. You can omit this step if you already have a database you are going to replicate, but you are recommended to create a separate superuser for multi-master replication. The examples below assume that you are going to replicate the mydb database on behalf of mtmuser.

  4. Allow replication of the mydb database to each cluster node on behalf of mtmuser, as explained in pg_hba.conf. Make sure to use the authentication method that satisfies your security requirements. For example, pg_hba.conf might have the following lines on node1:

                                    host replication mtmuser node2 md5
                                    host mydb mtmuser node2 md5
                                    host replication mtmuser node3 md5
                                    host mydb mtmuser node3 md5
                                

  5. Connect to any node on behalf of the mtmuser database user, create the multimaster extension in the mydb database and run mtm.init_cluster(), specifying the connection string to the current node as the first argument and an array of connection strings to the other nodes as the second argument. For example, if you would like to connect to node1, run:

                                    CREATE EXTENSION multimaster;
                                    SELECT mtm.init_cluster('dbname=mydb user=mtmuser host=node1',
                                    '{"dbname=mydb user=mtmuser host=node2", "dbname=mydb user=mtmuser host=node3"}');
                                

  6. To ensure that multimaster is enabled, you can run the mtm.status() and mtm.nodes() functions:

                                    SELECT * FROM mtm.status();
                                    SELECT * FROM mtm.nodes();
                                

    If status is equal to online and all nodes are present in the mtm.nodes output, your cluster is successfully configured and ready to use.

52.2.2. Tuning Configuration Parameters #

While you can use multimaster in the default configuration, you may want to tune several parameters for faster failure detection or more reliable automatic recovery.

52.2.3. Setting Timeout for Failure Detection #

To check availability of the peer nodes, multimaster periodically sends heartbeat packets to all nodes. You can define the timeout for failure detection with the following variables:

  • The multimaster.heartbeat_send_timeout variable defines the time interval between the heartbeats. By default, this variable is set to 200ms.
  • The multimaster.heartbeat_recv_timeout variable sets the timeout for the response. If no heartbeats are received during this time, the node is assumed to be disconnected and is excluded from the cluster. By default, this variable is set to 2000ms.

It's a good idea to set multimaster.heartbeat_send_timeout based on typical ping latencies between the nodes. Small recv/send ratio decreases the time of failure detection, but increases the probability of false-positive failure detection. When setting this parameter, take into account the typical packet loss ratio between your cluster nodes.

52.2.4. 2+1 Mode: Setting up a Standalone Referee Node #

By default, multimaster uses a majority-based algorithm to determine whether the cluster nodes have a quorum: a cluster can only continue working if the majority of its nodes are alive and can access each other. Majority-based approach is pointless for two nodes cluster: if one of them fails, another one becomes unaccessible. There is a special 2+1 or referee mode which trades less harware resources by decreasing availabilty: two nodes hold full copy of data, and separate referee node participates only in voting, acting as a tie-breaker.

If one node goes down, another one requests referee grant (elects referee-approved generation with single node). One the grant is received, it continues to work normally. If offline node gets up, it recovers and elects full generation containing both nodes, essentially removing the grant - this allows the node to get it in its turn later. While the grant is issued, it can't be given to another node until full generation is elected and excluded node recovers. This ensures data loss doesn't happen by the price of availabilty: in this setup two nodes (one normal and one referee) can be alive but cluster might be still unavailable if the referee winner is down, which is impossible with classic three nodes configuration.

The referee node does not store any cluster data, so it is not resource-intensive and can be configured on virtually any system with PostgreSQL installed.

To avoid split-brain problems, you must have only a single referee in your cluster.

To set up a referee for your cluster:

  1. Install PostgreSQL Enterprise Server on the node you are going to make a referee and create the referee extension:

                                    CREATE EXTENSION referee;
                                

  2. Make sure the pg_hba.conf file allows access to the referee node.

  3. Set up the nodes that will hold cluster data following the instructions in the section called "Setting up a Multi-Master Cluster".

  4. On all data nodes, specify the referee connection string in the postgresql.conf file:

                                    multimaster.referee_connstring = connstring
                                

    where connstring holds libpq options required to access the referee.

The first subset of nodes that gets connected to the referee wins the voting and starts working. The other nodes have to go through the recovery process to catch up with them and join the cluster. Under heavy load, the recovery can take unpredictably long, so it is recommended to wait for all data nodes going online before switching on the load when setting up a new cluster. Once all the nodes get online, the referee discards the voting result, and all data nodes start operating together.

In case of any failure, the voting mechanism is triggered again. At this time, all nodes appear to be offline for a short period of time to allow the referee to choose a new winner, so you can see the following error message when trying to access the cluster: [multimaster] node is not online: current status is "disabled".