-
Notifications
You must be signed in to change notification settings - Fork 472
Add more info re: WAL failover probe file, logging #20129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add more info re: WAL failover probe file, logging #20129
Conversation
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
Files changed:
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
Fixes DOC-14573 Summary of changes: - Update the step-by-step sequence of actions when WAL failover is enabled - Introduce the 'probe file' used to assess primary store health - Include a sample log line for searchability - Clarify WAL failover scope: only WAL writes move to the secondary store, data files remain on the primary - Add more info on read behavior during a disk stall NB. Also removes some dangling incorrect references to WAL failover being in Preview
6692045
to
4d532f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly nits
Reviewed 2 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
src/current/_includes/v25.3/wal-failover-intro.md
line 8 at r1 (raw file):
- Pairs each primary store with a secondary failover store at node startup. - Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
nit: monitors all write operations, not just fsync.
If any operation exceeds ...
src/current/_includes/v25.3/wal-failover-intro.md
line 9 at r1 (raw file):
- Pairs each primary store with a secondary failover store at node startup. - Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store. - Probes the primary store while failed over by `fsync`ing a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.
Prober also does a bunch of operations: create, write, fsync on the probe file.
It switches back when a full (remove, create, write, fsync) pass starts consuming < 25ms.
src/current/_includes/v25.3/wal-failover-intro.md
line 15 at r1 (raw file):
{{site.data.alerts.callout_info}} - WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency. - [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables) is chosen to bound long cloud disk stalls without flapping; tune with care. High tail-latency cloud volumes (for example, oversubscribed [AWS EBS gp3](https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html#gp3-ebs-volume-type)) are more prone to transient stalls.
I'm not sure we want to say something about "tune with care". We have a recommended value for this when WAL failover is enabled and we don't really want people to deviate from that, and we do say strong things about it (as we should):
Additionally, you must set the value of the environment variable COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT to 40s. By default, CockroachDB detects prolonged stalls and crashes the node after 20s. With WAL failover enabled, CockroachDB should be able to survive stalls of up to 40s with minimal impact to the workload.
So we do even need this paragraph?
src/current/_includes/v25.3/wal-failover-metrics.md
line 14 at r1 (raw file):
- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume:
These logs are independent of WAL failover. I don't know if we want to speak about them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Sumeer! Updated based on your feedback - PTAL
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
src/current/_includes/v25.3/wal-failover-intro.md
line 8 at r1 (raw file):
Updated to say:
Monitors latency of all write operations against the primary WAL. If any operation exceeds ...
src/current/_includes/v25.3/wal-failover-intro.md
line 9 at r1 (raw file):
Updated to say:
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume ...
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds)
Trying to stay less specific about the exact ops performed and exact timing for full pass for ease of future doc maintenance as implementation changes happen. PTAL and let me know what you think, open to suggestions
src/current/_includes/v25.3/wal-failover-intro.md
line 15 at r1 (raw file):
Previously, sumeerbhola wrote…
I'm not sure we want to say something about "tune with care". We have a recommended value for this when WAL failover is enabled and we don't really want people to deviate from that, and we do say strong things about it (as we should):
Additionally, you must set the value of the environment variable COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT to 40s. By default, CockroachDB detects prolonged stalls and crashes the node after 20s. With WAL failover enabled, CockroachDB should be able to survive stalls of up to 40s with minimal impact to the workload.
So we do even need this paragraph?
Removed the whole paragraph at this bullet
Left the other bullet re: caching since it seems useful?
but please let me know if by "this paragraph" you meant this whole bulleted list, can remove
src/current/_includes/v25.3/wal-failover-metrics.md
line 14 at r1 (raw file):
Previously, sumeerbhola wrote…
These logs are independent of WAL failover. I don't know if we want to speak about them here.
Removed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola reviewed 2 of 2 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @rmloveland)
src/current/_includes/v25.3/wal-failover-intro.md
line 9 at r1 (raw file):
Previously, rmloveland (Rich Loveland) wrote…
Updated to say:
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume ...
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds)
Trying to stay less specific about the exact ops performed and exact timing for full pass for ease of future doc maintenance as implementation changes happen. PTAL and let me know what you think, open to suggestions
looks good
src/current/_includes/v25.3/wal-failover-intro.md
line 15 at r1 (raw file):
Left the other bullet re: caching since it seems useful?
Agreed. That is useful.
Fixes DOC-14573
Summary of changes:
Update the step-by-step sequence of actions when WAL failover is enabled
Introduce the 'probe file' used to assess primary store health
Include a sample log line for searchability
Clarify WAL failover scope: only WAL writes move to the secondary store, data files remain on the primary
Add more info on read behavior during a disk stall
NB. Also removes some dangling incorrect references to WAL failover being in Preview