Skip to content

Add more info re: WAL failover probe file, logging #20129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rmloveland
Copy link
Contributor

@rmloveland rmloveland commented Aug 14, 2025

Fixes DOC-14573

Summary of changes:

  • Update the step-by-step sequence of actions when WAL failover is enabled

  • Introduce the 'probe file' used to assess primary store health

  • Include a sample log line for searchability

  • Clarify WAL failover scope: only WAL writes move to the secondary store, data files remain on the primary

  • Add more info on read behavior during a disk stall

NB. Also removes some dangling incorrect references to WAL failover being in Preview

@rmloveland rmloveland marked this pull request as draft August 14, 2025 21:19
Copy link

netlify bot commented Aug 14, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 1fef6ce
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/68a5f4c779acac0008d52bc5

Copy link

netlify bot commented Aug 14, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 1fef6ce
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/68a5f4c703a1870008731138

Copy link

github-actions bot commented Aug 14, 2025

Files changed:

Copy link

netlify bot commented Aug 14, 2025

Netlify Preview

Name Link
🔨 Latest commit 1fef6ce
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/68a5f4c7347cf70008b71678
😎 Deploy Preview https://deploy-preview-20129--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Fixes DOC-14573

Summary of changes:

- Update the step-by-step sequence of actions when WAL failover is enabled

- Introduce the 'probe file' used to assess primary store health

- Include a sample log line for searchability

- Clarify WAL failover scope: only WAL writes move to the secondary
  store, data files remain on the primary

- Add more info on read behavior during a disk stall

NB. Also removes some dangling incorrect references to WAL failover
being in Preview
@rmloveland rmloveland force-pushed the 20250814-DOC-14573-wal-failover-explain-log-output branch from 6692045 to 4d532f6 Compare August 19, 2025 15:17
@rmloveland rmloveland marked this pull request as ready for review August 19, 2025 15:18
@rmloveland rmloveland requested a review from sumeerbhola August 19, 2025 15:21
Copy link

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits

Reviewed 2 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


src/current/_includes/v25.3/wal-failover-intro.md line 8 at r1 (raw file):

- Pairs each primary store with a secondary failover store at node startup.
- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.

nit: monitors all write operations, not just fsync.

If any operation exceeds ...


src/current/_includes/v25.3/wal-failover-intro.md line 9 at r1 (raw file):

- Pairs each primary store with a secondary failover store at node startup.
- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
- Probes the primary store while failed over by `fsync`ing a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.

Prober also does a bunch of operations: create, write, fsync on the probe file.

It switches back when a full (remove, create, write, fsync) pass starts consuming < 25ms.


src/current/_includes/v25.3/wal-failover-intro.md line 15 at r1 (raw file):

{{site.data.alerts.callout_info}}
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency.
- [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables) is chosen to bound long cloud disk stalls without flapping; tune with care. High tail-latency cloud volumes (for example, oversubscribed [AWS EBS gp3](https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html#gp3-ebs-volume-type)) are more prone to transient stalls.

I'm not sure we want to say something about "tune with care". We have a recommended value for this when WAL failover is enabled and we don't really want people to deviate from that, and we do say strong things about it (as we should):

Additionally, you must set the value of the environment variable COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT to 40s. By default, CockroachDB detects prolonged stalls and crashes the node after 20s. With WAL failover enabled, CockroachDB should be able to survive stalls of up to 40s with minimal impact to the workload.

So we do even need this paragraph?


src/current/_includes/v25.3/wal-failover-metrics.md line 14 at r1 (raw file):

- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).

In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume:

These logs are independent of WAL failover. I don't know if we want to speak about them here.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@rmloveland rmloveland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Sumeer! Updated based on your feedback - PTAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


src/current/_includes/v25.3/wal-failover-intro.md line 8 at r1 (raw file):
Updated to say:

Monitors latency of all write operations against the primary WAL. If any operation exceeds ...


src/current/_includes/v25.3/wal-failover-intro.md line 9 at r1 (raw file):
Updated to say:

  • Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume ...
  • Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds)

Trying to stay less specific about the exact ops performed and exact timing for full pass for ease of future doc maintenance as implementation changes happen. PTAL and let me know what you think, open to suggestions


src/current/_includes/v25.3/wal-failover-intro.md line 15 at r1 (raw file):

Previously, sumeerbhola wrote…

I'm not sure we want to say something about "tune with care". We have a recommended value for this when WAL failover is enabled and we don't really want people to deviate from that, and we do say strong things about it (as we should):

Additionally, you must set the value of the environment variable COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT to 40s. By default, CockroachDB detects prolonged stalls and crashes the node after 20s. With WAL failover enabled, CockroachDB should be able to survive stalls of up to 40s with minimal impact to the workload.

So we do even need this paragraph?

Removed the whole paragraph at this bullet

Left the other bullet re: caching since it seems useful?

but please let me know if by "this paragraph" you meant this whole bulleted list, can remove


src/current/_includes/v25.3/wal-failover-metrics.md line 14 at r1 (raw file):

Previously, sumeerbhola wrote…

These logs are independent of WAL failover. I don't know if we want to speak about them here.

Removed!

Copy link

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@sumeerbhola reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rmloveland)


src/current/_includes/v25.3/wal-failover-intro.md line 9 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Updated to say:

  • Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume ...
  • Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds)

Trying to stay less specific about the exact ops performed and exact timing for full pass for ease of future doc maintenance as implementation changes happen. PTAL and let me know what you think, open to suggestions

looks good


src/current/_includes/v25.3/wal-failover-intro.md line 15 at r1 (raw file):

Left the other bullet re: caching since it seems useful?

Agreed. That is useful.

@rmloveland rmloveland requested a review from taroface August 20, 2025 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants