Skip to content

Secondary WAL failover store must be durable #20113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/current/v24.1/wal-failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -472,6 +472,12 @@ Store _A_ will failover to store _B_, store _B_ will failover to store _C_, and

However, the WAL failback operation will not cascade back until **all drives are available** - that is, if store _A_'s disk unstalls while store _B_ is still stalled, store _C_ will not failback to store _A_ until _B_ also becomes available again. In other words, _C_ must failback to _B_, which must then failback to _A_.

### 13. Can I use an ephemeral disk for the secondary storage device?

No, the secondary (failover) disk **must be durable and retain its data across VM or instance restarts**. Using an ephemeral volume (for example, the root volume of a cloud VM that is recreated on reboot) risks permanent data loss: if CockroachDB has failed over recent WAL entries to that disk and the disk is subsequently wiped, the node will start up with an incomplete [Raft log]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and will refuse to join the cluster. In this scenario the node must be treated as lost and replaced.

Always provision the failover disk with the same persistence guarantees as the primary store.

## Video demo: WAL failover

For a demo of WAL Failover in CockroachDB and what happens when you enable or disable it, play the following video:
Expand Down
6 changes: 6 additions & 0 deletions src/current/v24.3/wal-failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,12 @@ Store _A_ will failover to store _B_, store _B_ will failover to store _C_, and

However, the WAL failback operation will not cascade back until **all drives are available** - that is, if store _A_'s disk unstalls while store _B_ is still stalled, store _C_ will not failback to store _A_ until _B_ also becomes available again. In other words, _C_ must failback to _B_, which must then failback to _A_.

### 13. Can I use an ephemeral disk for the secondary storage device?

No, the secondary (failover) disk **must be durable and retain its data across VM or instance restarts**. Using an ephemeral volume (for example, the root volume of a cloud VM that is recreated on reboot) risks permanent data loss: if CockroachDB has failed over recent WAL entries to that disk and the disk is subsequently wiped, the node will start up with an incomplete [Raft log]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and will refuse to join the cluster. In this scenario the node must be treated as lost and replaced.

Always provision the failover disk with the same persistence guarantees as the primary store.

## Video demo: WAL failover

For a demo of WAL Failover in CockroachDB and what happens when you enable or disable it, play the following video:
Expand Down
6 changes: 6 additions & 0 deletions src/current/v25.2/wal-failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,12 @@ Store _A_ will failover to store _B_, store _B_ will failover to store _C_, and

However, the WAL failback operation will not cascade back until **all drives are available** - that is, if store _A_'s disk unstalls while store _B_ is still stalled, store _C_ will not failback to store _A_ until _B_ also becomes available again. In other words, _C_ must failback to _B_, which must then failback to _A_.

### 13. Can I use an ephemeral disk for the secondary storage device?

No, the secondary (failover) disk **must be durable and retain its data across VM or instance restarts**. Using an ephemeral volume (for example, the root volume of a cloud VM that is recreated on reboot) risks permanent data loss: if CockroachDB has failed over recent WAL entries to that disk and the disk is subsequently wiped, the node will start up with an incomplete [Raft log]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and will refuse to join the cluster. In this scenario the node must be treated as lost and replaced.

Always provision the failover disk with the same persistence guarantees as the primary store.

## Video demo: WAL failover

For a demo of WAL Failover in CockroachDB and what happens when you enable or disable it, play the following video:
Expand Down
6 changes: 6 additions & 0 deletions src/current/v25.3/wal-failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -470,6 +470,12 @@ Store _A_ will failover to store _B_, store _B_ will failover to store _C_, and

However, the WAL failback operation will not cascade back until **all drives are available** - that is, if store _A_'s disk unstalls while store _B_ is still stalled, store _C_ will not failback to store _A_ until _B_ also becomes available again. In other words, _C_ must failback to _B_, which must then failback to _A_.

### 13. Can I use an ephemeral disk for the secondary storage device?

No, the secondary (failover) disk **must be durable and retain its data across VM or instance restarts**. Using an ephemeral volume (for example, the root volume of a cloud VM that is recreated on reboot) risks permanent data loss: if CockroachDB has failed over recent WAL entries to that disk and the disk is subsequently wiped, the node will start up with an incomplete [Raft log]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) and will refuse to join the cluster. In this scenario the node must be treated as lost and replaced.

Always provision the failover disk with the same persistence guarantees as the primary store.

## Video demo: WAL failover

For a demo of WAL Failover in CockroachDB and what happens when you enable or disable it, play the following video:
Expand Down
Loading