Since each server in a multi-master cluster can accept writes, any server can abort a transaction because of a concurrent update — in the same way as it happens on a single server between different backends. To ensure high availability and data consistency on all cluster nodes, multimaster uses logical replication and the two phase commit protocol with transaction outcome determined by Paxos consensus algorithm.
When PostgreSQL loads the multimaster shared library, multimaster sets up a logical replication producer and consumer for each node, and hooks into the transaction commit pipeline. The typical data replication workflow consists of the following phases:
PREPARE
phase. multimaster captures and implicitly transforms each COMMIT
statement
to a PREPARE
statement. All the nodes that get the transaction via the replication protocol (the cohort nodes)
send their vote for approving or declining the transaction to the backend process on the initiating node. This ensures that all the cohort
can accept the transaction, and no write conflicts occur.
PRECOMMIT
phase. If all the cohort nodes approve the transaction,
the backend process sends a PRECOMMIT
message to all the cohort nodes to express an intention to commit the transaction.
The cohort nodes respond to the backend with the PRECOMMITTED
message.
In case of a failure, all the nodes can use this information to complete the transaction using a quorum-based voting procedure.
COMMIT
phase. If PRECOMMIT
is successful, the transaction is committed to all nodes.
If a node crashes or gets disconnected from the cluster between the PREPARE
and
COMMIT
phases, the PRECOMMIT
phase ensures that the survived nodes
have enough information to complete the prepared transaction. The PRECOMMITTED
messages
help avoid the situation when the crashed node has already committed or aborted the transaction, but has
not notified other nodes about the transaction status. In a two-phase commit (2PC), such a transaction
would block resources (hold locks) until the recovery of the crashed node. Otherwise, data inconsistencies
can appear in the database when the failed node is recovered, for example, if the failed node committed
the transaction, but the survived node aborted it.
To complete the transaction, the backend must receive a response from the majority of the nodes. For example, for a cluster of 2N+1 nodes, at least N+1 responses are required. Thus, multimaster ensures that your cluster is available for reads and writes while the majority of the nodes are connected, and no data inconsistencies occur in case of a node or connection failure.
Since multimaster allows writes to each node, it has to wait for responses about transaction acknowledgment from all the other nodes. Without special actions in case of a node failure, each commit would have to wait until the failed node recovery. To deal with such situations, multimaster periodically sends heartbeats to check the node state and the connectivity between nodes. When several heartbeats to the node are lost in a row, this node is kicked out of the cluster to allow writes to the remaining alive nodes. You can configure the heartbeat frequency and the response timeout in the multimaster.heartbeat_send_timeout and multimaster.heartbeat_recv_timeout parameters, respectively.
For example, suppose a five-node multi-master cluster experienced a network failure that split the network into two isolated subnets, with two and three cluster nodes. Based on heartbeats propagation information, multimaster will continue accepting writes at each node in the bigger partition, and deny all writes in the smaller one. Thus, a cluster consisting of 2N+1 nodes can tolerate N node failures and stay alive if any N+1 nodes are alive and connected to each other. You can also set up a two nodes cluster plus a lightweight referee node that does not hold the data, but acts as a tie-breaker during symmetric node partitioning.
In case of a partial network split when different nodes have different connectivity, multimaster finds a fully connected subset of nodes and disconnects nodes outside of this subset. For example, in a three-node cluster, if node A can access both B and C, but node B cannot access node C, multimaster isolates node C to ensure that both A and B can work.
To preserve order of transactions on different nodes and thus data integrity, the decision to exclude or
add back node(s) must be taken coherently. Generations which represent a subset of currently supposedly
live nodes serve this purpose. Technically, generation is a pair <n, members> where n is unique number
and members is subset of configured nodes. A node always lives in some generation and switches to the one
with higher number as soon as it learns about its existence; generation numbers act as logical clocks/terms/epochs here.
Each transaction is stamped during commit with current generation of the node it is being executed on.
The transaction can be proposed to be committed only after it has been PREPAREd on all its generation members.
This allows to design the recovery protocol so that order of conflicting committed transactions is the same on all nodes.
Node resides in generation in one of three states (can be shown with mtm.status()
):
ONLINE
: node is member of the generation and making transactions normally;
RECOVERY
: node is member of the generation, but it must apply in recovery mode transactions from previous generations to become ONLINE
;
DEAD
: node will never be ONLINE
in this generation;
For alive nodes, there is no way to distinguish between a failed node that stopped serving requests and a network-partitioned node that can be accessed by database users, but is unreachable for other nodes. If during commit of writing transaction some of current generation members are disconnected, transaction is rolled back according to generation rules. To avoid futile work, connectivity is also checked during transaction start; if you try to access an isolated node, multimaster returns an error message indicating the current status of the node. Thus, to prevent stale reads read-only queries are also forbidden. If you would like to continue using a disconnected node outside of the cluster in the standalone mode, you have to uninstall the multimaster extension on this node
Each node maintains a data structure that keeps the information about the state of all nodes in relation to this node.
You can get this data by calling the mtm.status()
and the mtm.nodes()
functions.
When a failed node connects back to the cluster, multimaster starts automatic recovery:
The reconnected node selects a cluster node which is ONLINE
in the highest
generation and starts catching up with the current state of the cluster based on the Write-Ahead Log (WAL).
When the node is caught up, it ballots for including itself in the next generation. Once generation is elected, commit of new transactions will start waiting for apply on the joining node.
When the rest of transactions till the switch to the new generation is applied, the reconnected node is promoted to the online state and included into the replication scheme.
Automatic recovery requires presence of all WAL files generated after node failure. If a node is down for a long time and storing more WALs is unacceptable,
you may have to exclude this node from the cluster and manually restore it from one of the working nodes using pg_basebackup
.
Starts all other workers for a database managed with multimaster. This is the first worker loaded during multimaster boot. Each multimaster node has a single mtm-monitor worker. When a new node is added, mtm-monitor starts mtm-logrep-receiver and mtm-dmq-receiver workers to enable replication to this node. If a node is dropped, mtm-monitor stops mtm-logrep-receiver and mtm-dmq-receiver workers that have been serving the dropped node. Each mtm-monitor controls workers on its own node only.
Receives logical replication stream from a given peer node. During recovery, all received transactions are applied by mtm-logrep-receiver. During normal operation, mtm-logrep-receiver passes transactions to the pool of dynamic workers (see mtm-logrep-receiver-dynworker). The number of mtm-logrep-receiver workers on each node corresponds to the number of peer nodes available.
Receives acknowledgment for transactions sent to peers and checks for heartbeat timeouts. The number of mtm-logrep-receiver workers on each node corresponds to the number of peer nodes available.
Collects acknowledgment for transactions applied on the current node and sends them to the corresponding mtm-dmq-receiver on the peer node. There is a single worker per PostgreSQL instance.
Dynamic pool worker for a given mtm-logrep-receiver. Applies the replicated transaction received during normal operation. There are up to multimaster.max_workers workers per each peer node.
Performs Paxos to resolve unfinished transactions. This worker is only active during recovery or when connection with other nodes was lost. There is a single worker per PostgreSQL instance.
Ballots for new generations to exclude some node(s) or add myself. There is a single worker per PostgreSQL instance.
Responds to requests of mtm-campaigner and mtm-resolver.