Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions product_docs/docs/postgres_for_kubernetes/1/certificates.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,36 @@ the following parameters:
instances, you can add a label with the key `k8s.enterprisedb.io/reload` to it. Otherwise,
you must reload the instances using the `kubectl cnp reload` subcommand.

#### Customizing the `streaming_replica` client certificate

In some environments, it may not be possible to generate a certificate with the
common name `streaming_replica` due to company policies or other security
concerns, such as a CA shared across multiple clusters. In such cases, the user
mapping feature can be used to allow authentication as the `streaming_replica`
user with certificates containing different common names.

To configure this setup, add a `pg_ident.conf` entry for the predefined map
named `cnp_streaming_replica`.

For example, to enable `streaming_replica` authentication using a certificate
with the common name `streaming-replica.cnp.svc.cluster.local`, add the
following to your cluster definition:

```yaml
apiVersion: postgresql.k8s.enterprisedb.io/v1
kind: Cluster
metadata:
name: cluster-example
spec:
postgresql:
pg_ident:
- cnp_streaming_replica streaming-replica.cnp.svc.cluster.local streaming_replica
```

For further details on how `pg_ident.conf` is managed by the operator, see the
["PostgreSQL Configuration" page](postgresql_conf.md#the-pg_ident-section) in
the documentation.

#### Cert-manager example

This simple example shows how to use [cert-manager](https://cert-manager.io/)
Expand Down
265 changes: 265 additions & 0 deletions product_docs/docs/postgres_for_kubernetes/1/failover.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,268 @@ expected outage.

Enabling a new configuration option to delay failover provides a mechanism to
prevent premature failover for short-lived network or node instability.

## Failover Quorum (Quorum-based Failover)

!!! Warning
*Failover quorum* is an experimental feature introduced in version 1.27.0.
Use with caution in production environments.

Failover quorum is a mechanism that enhances data durability and safety during
failover events in EDB Postgres for Kubernetes-managed PostgreSQL clusters.

Quorum-based failover allows the controller to determine whether to promote a replica
to primary based on the state of a quorum of replicas.
This is useful when stronger data durability is required than the one offered
by [synchronous replication](replication.md#synchronous-replication) and
default automated failover procedures.

When synchronous replication is not enabled, some data loss is expected and
accepted during failover, as a replica may lag behind the primary when
promoted.

With synchronous replication enabled, the guarantee is that the application
will not receive explicit acknowledgment of the successful commit of a
transaction until the WAL data is known to be safely received by all required
synchronous standbys.
This is not enough to guarantee that the operator is able to promote the most
advanced replica.

For example, in a three-node cluster with synchronous replication set to `ANY 1
(...)`, data is written to the primary and one standby before a commit is
acknowledged. If both the primary and the aligned standby become unavailable
(such as during a network partition), the remaining replica may not have the
latest data. Promoting it could lose some data that the application considered
committed.

Quorum-based failover addresses this risk by ensuring that failover only occurs
if the operator can confirm the presence of all synchronously committed data in
the instance to promote, and it does not occur otherwise.

This feature allows users to choose their preferred trade-off between data
durability and data availability.

Failover quorum can be enabled by setting the annotation
`alpha.k8s.enterprisedb.io/failoverQuorum="true"` in the `Cluster` resource.

!!! info
When this feature is out of the experimental phase, the annotation
`alpha.k8s.enterprisedb.io/failoverQuorum` will be replaced by a configuration option in
the `Cluster` resource.

### How it works

Before promoting a replica to primary, the operator performs a quorum check,
following the principles of the Dynamo `R + W > N` consistency model[^1].

In the quorum failover, these values assume the following meaning:

- `R` is the number of *promotable replicas* (read quorum);
- `W` is the number of replicas that must acknowledge the write before the
`COMMIT` is returned to the client (write quorum);
- `N` is the total number of potentially synchronous replicas;

*Promotable replicas* are replicas that have these properties:

- are part of the cluster;
- are able to report their state to the operator;
- are potentially synchronous;

If `R + W > N`, then we can be sure that among the promotable replicas there is
at least one that has confirmed all the synchronous commits, and we can safely
promote it to primary. If this is not the case, the controller will not promote
any replica to primary, and will wait for the situation to change.

Users can force a promotion of a replica to primary through the
`kubectl cnp promote` command even if the quorum check is failing.

!!! Warning
Manual promotion should only be used as a last resort. Before proceeding,
make sure you fully understand the risk of data loss and carefully consider the
consequences of prioritizing the resumption of write workloads for your
applications.

An additional CRD is used to track the quorum state of the cluster. A `Cluster`
with the quorum failover enabled will have a `FailoverQuorum` resource with the same
name as the `Cluster` resource. The `FailoverQuorum` CR is created by the
controller when the quorum failover is enabled, and it is updated by the primary
instance during its reconciliation loop, and read by the operator during quorum
checks. It is used to track the latest known configuration of the synchronous
replication.

!!! Important
Users should not modify the `FailoverQuorum` resource directly. During
PostgreSQL configuration changes, when it is not possible to determine the
configuration, the `FailoverQuorum` resource will be reset, preventing any
failover until the new configuration is applied.

The `FailoverQuorum` resource works in conjunction with PostgreSQL synchronous
replication.

!!! Warning
There is no guarantee that `COMMIT` operations returned to the
client but that have not been performed synchronously, such as those made
explicitly disabling synchronous replication with
`SET synchronous_commit TO local`, will be present on a promoted replica.

### Quorum Failover Example Scenarios

In the following scenarios, `R` is the number of promotable replicas, `W` is
the number of replicas that must acknowledge a write before commit, and `N` is
the total number of potentially synchronous replicas. The "Failover" column
indicates whether failover is allowed under quorum failover rules.

#### Scenario 1: Three-node cluster, failing pod(s)

A cluster with `instances: 3`, `synchronous.number=1`, and
`dataDurability=required`.

- If only the primary fails, two promotable replicas remain (R=2).
Since `R + W > N` (2 + 1 > 2), failover is allowed and safe.
- If both the primary and one replica fail, only one promotable replica
remains (R=1). Since `R + W = N` (1 + 1 = 2), failover is not allowed to
prevent possible data loss.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 2 | 1 | 2 | ✅ |
| 1 | 1 | 2 | ❌ |

#### Scenario 2: Three-node cluster, network partition

A cluster with `instances: 3`, `synchronous.number: 1`, and
`dataDurability: required` experiences a network partition.

- If the operator can communicate with the primary, no failover occurs. The
cluster can be impacted if the primary cannot reach any standby, since it
won't commit transactions due to synchronous replication requirements.
- If the operator cannot reach the primary but can reach both replicas (R=2),
failover is allowed. If the operator can reach only one replica (R=1),
failover is not allowed, as the synchronous one may be the other one.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 2 | 1 | 2 | ✅ |
| 1 | 1 | 2 | ❌ |

#### Scenario 3: Five-node cluster, network partition

A cluster with `instances: 5`, `synchronous.number=2`, and
`dataDurability=required` experiences a network partition.

- If the operator can communicate with the primary, no failover occurs. The
cluster can be impacted if the primary cannot reach at least two standbys,
as since it won't commit transactions due to synchronous replication
requirements.
- If the operator cannot reach the primary but can reach at least three
replicas (R=3), failover is allowed. If the operator can reach only two
replicas (R=2), failover is not allowed, as the synchronous one may be the
other one.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 3 | 2 | 4 | ✅ |
| 2 | 2 | 4 | ❌ |

#### Scenario 4: Three-node cluster with remote synchronous replicas

A cluster with `instances: 3` and remote synchronous replicas defined in
`standbyNamesPre` or `standbyNamesPost`. We assume that the primary is failing.

This scenario requires an important consideration. Replicas listed in
`standbyNamesPre` or `standbyNamesPost` are not counted in
`R` (they cannot be promoted), but are included in `N` (they may have received
synchronous writes). So, if
`synchronous.number <= len(standbyNamesPre) + len(standbyNamesPost)`, failover
is not possible, as no local replica can be guaranteed to have the required
data. The operator prevents such configurations during validation, but some
invalid configurations are shown below for clarity.

**Example configurations:**

Configuration #1 (valid):

```yaml
instances: 3
postgresql:
synchronous:
method: any
number: 2
standbyNamesPre:
- angus
```

In this configuration, when the primary fails, `R = 2` (the local replicas),
`W = 2`, and `N = 3` (2 local replicas + 1 remote), allowing failover.
In case of an additional replica failing (`R = 1`) failover is not allowed.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 3 | 2 | 4 | ✅ |
| 2 | 2 | 4 | ❌ |

Configuration #2 (invalid):

```yaml
instances: 3
postgresql:
synchronous:
method: any
number: 1
maxStandbyNamesFromCluster: 1
standbyNamesPre:
- angus
```

In this configuration, `R = 2` (the local replicas), `W = 1`, and `N = 3`
(2 local replicas + 1 remote).
Failover is not possible in this setup, so quorum failover can not be
enabled with this configuration.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 1 | 1 | 2 | ❌ |

Configuration #3 (invalid):

```yaml
instances: 3
postgresql:
synchronous:
method: any
number: 1
maxStandbyNamesFromCluster: 0
standbyNamesPre:
- angus
- malcolm
```

In this configuration, `R = 0` (the local replicas), `W = 1`, and `N = 2`
(0 local replicas + 2 remote).
Failover is not possible in this setup, so quorum failover can not be
enabled with this configuration.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 0 | 1 | 2 | ❌ |

#### Scenario 5: Three-node cluster, preferred data durability, network partition

Consider a cluster with `instances: 3`, `synchronous.number=1`, and
`dataDurability=preferred` that experiences a network partition.

- If the operator can communicate with both the primary and the API server,
the primary continues to operate, removing unreachable standbys from the
`synchronous_standby_names` set.
- If the primary cannot reach the operator or API server, a quorum check is
performed. The `FailoverQuorum` status cannot have changed, as the primary cannot
have received new configuration. If the operator can reach both replicas,
failover is allowed (`R=2`). If only one replica is reachable (`R=1`),
failover is not allowed.

| R | W | N | Failover |
| :-: | :-: | :-: | :------: |
| 2 | 1 | 2 | ✅ |
| 1 | 1 | 2 | ❌ |

[^1]&#x3A; [Dynamo: Amazon’s highly available key-value store](https://www.amazon.science/publications/dynamo-amazons-highly-available-key-value-store)
Loading
Loading