From 4d532f61c9eef67480b7292e8a83c3e14bf3b16a Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Thu, 14 Aug 2025 17:16:49 -0400 Subject: [PATCH 1/2] Add more info re: WAL failover probe file, logging Fixes DOC-14573 Summary of changes: - Update the step-by-step sequence of actions when WAL failover is enabled - Introduce the 'probe file' used to assess primary store health - Include a sample log line for searchability - Clarify WAL failover scope: only WAL writes move to the secondary store, data files remain on the primary - Add more info on read behavior during a disk stall NB. Also removes some dangling incorrect references to WAL failover being in Preview --- src/current/_includes/v25.3/wal-failover-intro.md | 15 +++++++++++---- .../_includes/v25.3/wal-failover-metrics.md | 6 ++++++ src/current/v25.3/cockroach-start.md | 6 +----- 3 files changed, 18 insertions(+), 9 deletions(-) diff --git a/src/current/_includes/v25.3/wal-failover-intro.md b/src/current/_includes/v25.3/wal-failover-intro.md index 6d69ddc2fd8..2a6244d34db 100644 --- a/src/current/_includes/v25.3/wal-failover-intro.md +++ b/src/current/_includes/v25.3/wal-failover-intro.md @@ -2,8 +2,15 @@ On a CockroachDB [node]({% link {{ page.version.version }}/architecture/overview Failing over the WAL may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. For example, if the node's primary store is stalled, and the node can't read from or write to it, the node can still write to the WAL on another store. This can allow the node to continue to service requests during momentary unavailability of the underlying storage device. -When WAL failover is enabled, CockroachDB will take the the following actions: +When WAL failover is enabled, CockroachDB: -- At node startup, each store is assigned another store to be its failover destination. -- CockroachDB will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the [cluster setting `storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node will attempt to write WAL entries to a secondary store's volume. -- CockroachDB will update the [store status endpoint]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) at `/_status/stores` so you can monitor the store's status. +- Pairs each primary store with a secondary failover store at node startup. +- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store. +- Probes the primary store while failed over by `fsync`ing a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled. +- Switches back to the primary store once a probe `fsync` on its volume completes within [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables). If a probe `fsync` blocks longer than this duration, CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere. +- Exposes status at [`/_status/stores`]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) so you can monitor each store's health and failover state. + +{{site.data.alerts.callout_info}} +- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency. +- [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables) is chosen to bound long cloud disk stalls without flapping; tune with care. High tail-latency cloud volumes (for example, oversubscribed [AWS EBS gp3](https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html#gp3-ebs-volume-type)) are more prone to transient stalls. +{{site.data.alerts.end}} diff --git a/src/current/_includes/v25.3/wal-failover-metrics.md b/src/current/_includes/v25.3/wal-failover-metrics.md index 96d17a83d48..3315d1f8531 100644 --- a/src/current/_includes/v25.3/wal-failover-metrics.md +++ b/src/current/_includes/v25.3/wal-failover-metrics.md @@ -10,3 +10,9 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). + +In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume: + +~~~ +disk stall detected: sync on file probe-file has been ongoing for 40.0s +~~~ diff --git a/src/current/v25.3/cockroach-start.md b/src/current/v25.3/cockroach-start.md index c7c16cfde01..b6f7d0fdbb3 100644 --- a/src/current/v25.3/cockroach-start.md +++ b/src/current/v25.3/cockroach-start.md @@ -71,7 +71,7 @@ Flag | Description `--max-tsdb-memory` | Maximum memory capacity available to store temporary data for use by the time-series database to display metrics in the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}). Consider raising this value if your cluster is comprised of a large number of nodes where individual nodes have very limited memory available (e.g., under `8 GiB`). Insufficient memory capacity for the time-series database can constrain the ability of the DB Console to process the time-series queries used to render metrics for the entire cluster. This capacity constraint does not affect SQL query execution. This flag accepts numbers interpreted as bytes, size suffixes (e.g., `1GB` and `1GiB`) or a percentage of physical memory (e.g., `0.01`).

**Note:** The sum of `--cache`, `--max-sql-memory`, and `--max-tsdb-memory` should not exceed 75% of the memory available to the `cockroach` process.

**Default:** `0.01` (i.e., 1%) of physical memory or `64 MiB`, whichever is greater. `--pid-file` | The file to which the node's process ID will be written as soon as the node is ready to accept connections. When `--background` is used, this happens before the process detaches from the terminal. When this flag is not set, the process ID is not written to file. `--store`
`-s` | The file path to a storage device and, optionally, store attributes and maximum size. When using multiple storage devices for a node, this flag must be specified separately for each device, for example:

`--store=/mnt/ssd01 --store=/mnt/ssd02`

For more details, see [Store](#store) below. -`--wal-failover` | Used to configure [WAL failover](#write-ahead-log-wal-failover) on [nodes]({% link {{ page.version.version }}/architecture/overview.md %}#node) with [multiple stores](#store). To enable WAL failover, pass `--wal-failover=among-stores`. To disable, pass `--wal-failover=disabled` on [node restart]({% link {{ page.version.version }}/node-shutdown.md %}#stop-and-restart-a-node). This feature is in [preview]({% link {{page.version.version}}/cockroachdb-feature-availability.md %}#features-in-preview). +`--wal-failover` | Used to configure [WAL failover](#write-ahead-log-wal-failover) on [nodes]({% link {{ page.version.version }}/architecture/overview.md %}#node) with [multiple stores](#store). To enable WAL failover, pass `--wal-failover=among-stores`. To disable, pass `--wal-failover=disabled` on [node restart]({% link {{ page.version.version }}/node-shutdown.md %}#stop-and-restart-a-node). `--spatial-libs` | The location on disk where CockroachDB looks for [spatial]({% link {{ page.version.version }}/spatial-data-overview.md %}) libraries.

**Defaults:**
`--temp-dir` | The path of the node's temporary store directory. On node start up, the location for the temporary files is printed to the standard output.

**Default:** Subdirectory of the first [store](#store) @@ -237,10 +237,6 @@ Field | Description {% include {{ page.version.version }}/wal-failover-intro.md %} -{{site.data.alerts.callout_info}} -{% include feature-phases/preview.md %} -{{site.data.alerts.end}} - This page has basic instructions on how to enable WAL failover, disable WAL failover, and monitor WAL failover. For more detailed instructions showing how to use, test, and monitor WAL failover, as well as descriptions of how WAL failover works in multi-store configurations, see [WAL Failover]({% link {{ page.version.version }}/wal-failover.md %}). From 1fef6ce3d94ad6eb12302d70072b09057eccac03 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Wed, 20 Aug 2025 12:15:39 -0400 Subject: [PATCH 2/2] Update with sumeerbhola feedback (1) --- src/current/_includes/v25.3/wal-failover-intro.md | 7 +++---- src/current/_includes/v25.3/wal-failover-metrics.md | 6 ------ 2 files changed, 3 insertions(+), 10 deletions(-) diff --git a/src/current/_includes/v25.3/wal-failover-intro.md b/src/current/_includes/v25.3/wal-failover-intro.md index 2a6244d34db..0169badc895 100644 --- a/src/current/_includes/v25.3/wal-failover-intro.md +++ b/src/current/_includes/v25.3/wal-failover-intro.md @@ -5,12 +5,11 @@ Failing over the WAL may allow some operations against a store to continue to co When WAL failover is enabled, CockroachDB: - Pairs each primary store with a secondary failover store at node startup. -- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store. -- Probes the primary store while failed over by `fsync`ing a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled. -- Switches back to the primary store once a probe `fsync` on its volume completes within [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables). If a probe `fsync` blocks longer than this duration, CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere. +- Monitors latency of all write operations against the primary WAL. If any operation exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store. +- Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled. +- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds). If a probe `fsync` blocks longer than [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere. - Exposes status at [`/_status/stores`]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) so you can monitor each store's health and failover state. {{site.data.alerts.callout_info}} - WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency. -- [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables) is chosen to bound long cloud disk stalls without flapping; tune with care. High tail-latency cloud volumes (for example, oversubscribed [AWS EBS gp3](https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html#gp3-ebs-volume-type)) are more prone to transient stalls. {{site.data.alerts.end}} diff --git a/src/current/_includes/v25.3/wal-failover-metrics.md b/src/current/_includes/v25.3/wal-failover-metrics.md index 3315d1f8531..96d17a83d48 100644 --- a/src/current/_includes/v25.3/wal-failover-metrics.md +++ b/src/current/_includes/v25.3/wal-failover-metrics.md @@ -10,9 +10,3 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). - -In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume: - -~~~ -disk stall detected: sync on file probe-file has been ongoing for 40.0s -~~~