Add more info re: WAL failover probe file, logging #20129

rmloveland · 2025-08-14T21:19:09Z

Summary of changes:

Update the step-by-step sequence of actions when WAL failover is enabled
Introduce the 'probe file' used to assess primary store health
Include a sample log line for searchability
Clarify WAL failover scope: only WAL writes move to the secondary store, data files remain on the primary
Add more info on read behavior during a disk stall

NB. Also removes some dangling incorrect references to WAL failover being in Preview

netlify · 2025-08-14T21:19:29Z

✅ Deploy Preview for cockroachdb-api-docs canceled.

Name	Link
🔨 Latest commit	`1fef6ce`
🔍 Latest deploy log	https://app.netlify.com/projects/cockroachdb-api-docs/deploys/68a5f4c779acac0008d52bc5

netlify · 2025-08-14T21:19:30Z

✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name	Link
🔨 Latest commit	`1fef6ce`
🔍 Latest deploy log	https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/68a5f4c703a1870008731138

github-actions · 2025-08-14T21:19:38Z

Files changed:

src/current/_includes/v25.3/wal-failover-intro.md:

src/current/v25.3/cockroach-start.md

netlify · 2025-08-14T21:27:44Z

✅ Netlify Preview

Name	Link
🔨 Latest commit	`1fef6ce`
🔍 Latest deploy log	https://app.netlify.com/projects/cockroachdb-docs/deploys/68a5f4c7347cf70008b71678
😎 Deploy Preview	https://deploy-preview-20129--cockroachdb-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Fixes DOC-14573 Summary of changes: - Update the step-by-step sequence of actions when WAL failover is enabled - Introduce the 'probe file' used to assess primary store health - Include a sample log line for searchability - Clarify WAL failover scope: only WAL writes move to the secondary store, data files remain on the primary - Add more info on read behavior during a disk stall NB. Also removes some dangling incorrect references to WAL failover being in Preview

sumeerbhola

mostly nits

Reviewed 2 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

src/current/_includes/v25.3/wal-failover-intro.md line 8 at r1 (raw file):

- Pairs each primary store with a secondary failover store at node startup.
- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.

nit: monitors all write operations, not just fsync.

If any operation exceeds ...

src/current/_includes/v25.3/wal-failover-intro.md line 9 at r1 (raw file):

- Pairs each primary store with a secondary failover store at node startup.
- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
- Probes the primary store while failed over by `fsync`ing a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.

Prober also does a bunch of operations: create, write, fsync on the probe file.

It switches back when a full (remove, create, write, fsync) pass starts consuming < 25ms.

src/current/_includes/v25.3/wal-failover-intro.md line 15 at r1 (raw file):

{{site.data.alerts.callout_info}}
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency.
- [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables) is chosen to bound long cloud disk stalls without flapping; tune with care. High tail-latency cloud volumes (for example, oversubscribed [AWS EBS gp3](https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html#gp3-ebs-volume-type)) are more prone to transient stalls.

I'm not sure we want to say something about "tune with care". We have a recommended value for this when WAL failover is enabled and we don't really want people to deviate from that, and we do say strong things about it (as we should):

Additionally, you must set the value of the environment variable COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT to 40s. By default, CockroachDB detects prolonged stalls and crashes the node after 20s. With WAL failover enabled, CockroachDB should be able to survive stalls of up to 40s with minimal impact to the workload.

So we do even need this paragraph?

src/current/_includes/v25.3/wal-failover-metrics.md line 14 at r1 (raw file):

- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).

In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume:

These logs are independent of WAL failover. I don't know if we want to speak about them here.

cockroach-teamcity · 2025-08-19T22:36:25Z

This change is

rmloveland

thanks Sumeer! Updated based on your feedback - PTAL

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

src/current/_includes/v25.3/wal-failover-intro.md line 8 at r1 (raw file):
Updated to say:

Monitors latency of all write operations against the primary WAL. If any operation exceeds ...

src/current/_includes/v25.3/wal-failover-intro.md line 9 at r1 (raw file):
Updated to say:

Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume ...

Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds)

Trying to stay less specific about the exact ops performed and exact timing for full pass for ease of future doc maintenance as implementation changes happen. PTAL and let me know what you think, open to suggestions

src/current/_includes/v25.3/wal-failover-intro.md line 15 at r1 (raw file):

Previously, sumeerbhola wrote…

I'm not sure we want to say something about "tune with care". We have a recommended value for this when WAL failover is enabled and we don't really want people to deviate from that, and we do say strong things about it (as we should):
Additionally, you must set the value of the environment variable COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT to 40s. By default, CockroachDB detects prolonged stalls and crashes the node after 20s. With WAL failover enabled, CockroachDB should be able to survive stalls of up to 40s with minimal impact to the workload.
So we do even need this paragraph?

Removed the whole paragraph at this bullet

Left the other bullet re: caching since it seems useful?

but please let me know if by "this paragraph" you meant this whole bulleted list, can remove

src/current/_includes/v25.3/wal-failover-metrics.md line 14 at r1 (raw file):

Previously, sumeerbhola wrote…

These logs are independent of WAL failover. I don't know if we want to speak about them here.

Removed!

sumeerbhola

@sumeerbhola reviewed 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rmloveland)

src/current/_includes/v25.3/wal-failover-intro.md line 9 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Updated to say:

Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume ...

Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds)

Trying to stay less specific about the exact ops performed and exact timing for full pass for ease of future doc maintenance as implementation changes happen. PTAL and let me know what you think, open to suggestions

looks good

src/current/_includes/v25.3/wal-failover-intro.md line 15 at r1 (raw file):

Left the other bullet re: caching since it seems useful?

Agreed. That is useful.

rmloveland marked this pull request as draft August 14, 2025 21:19

rmloveland force-pushed the 20250814-DOC-14573-wal-failover-explain-log-output branch from 6692045 to 4d532f6 Compare August 19, 2025 15:17

rmloveland marked this pull request as ready for review August 19, 2025 15:18

rmloveland requested a review from sumeerbhola August 19, 2025 15:21

sumeerbhola requested changes Aug 19, 2025

View reviewed changes

Update with sumeerbhola feedback (1)

1fef6ce

rmloveland commented Aug 20, 2025

View reviewed changes

sumeerbhola approved these changes Aug 20, 2025

View reviewed changes

rmloveland requested a review from taroface August 20, 2025 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more info re: WAL failover probe file, logging #20129

Add more info re: WAL failover probe file, logging #20129

rmloveland commented Aug 14, 2025 •

edited

Loading

Uh oh!

netlify bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

netlify bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

netlify bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

sumeerbhola left a comment

Uh oh!

cockroach-teamcity commented Aug 19, 2025

Uh oh!

rmloveland left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

Uh oh!

Add more info re: WAL failover probe file, logging #20129

Are you sure you want to change the base?

Add more info re: WAL failover probe file, logging #20129

Conversation

rmloveland commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cockroachdb-api-docs canceled.

Uh oh!

netlify bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Uh oh!

github-actions bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files changed:

Uh oh!

netlify bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Netlify Preview

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

cockroach-teamcity commented Aug 19, 2025

Uh oh!

rmloveland left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rmloveland commented Aug 14, 2025 •

edited

Loading

netlify bot commented Aug 14, 2025 •

edited

Loading

netlify bot commented Aug 14, 2025 •

edited

Loading

github-actions bot commented Aug 14, 2025 •

edited

Loading

netlify bot commented Aug 14, 2025 •

edited

Loading