Various and Sundry Cache V2 improvements#3525
Conversation
A site-local cache (Cache.EnableSiteLocalMode) is meant to appear to the federation as a client and fetch objects from other caches rather than directly from origins. The V1 (XRootD) cache achieves this by setting XRD_PELICANDIRECTORYQUERYMODE=cache (see commit 75d93f4). The V2 (persistent) cache had no equivalent: every upstream fetch unconditionally used WithCacheEmbeddedClientMode(), which routes the director query through /api/v1.0/director/origin/ and pulls straight from origins. A site-local V2 cache therefore ignored the federation's caches. Make WithCacheEmbeddedClientMode take a bool and gate it on a new useEmbeddedCacheMode() helper that returns false when site-local mode is enabled, so the director redirects the cache to other caches instead. All six embedded-fetch sites in the persistent cache are routed through it. Tests: - client: WithCacheEmbeddedClientMode(true) routes to the origin endpoint, (false) to the director's shortcut (cache) endpoint. - local_cache: useEmbeddedCacheMode() reflects Cache.EnableSiteLocalMode. - e2e: stand up a federation (director + origin + advertised V2 cache) plus a separate `pelican cache serve` child in site-local mode, then download through the site-local cache and assert (via Cache-Control: only-if-cached) that the upstream cache received the object — confirming the fetch routed through the cache, not the origin. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The V2 cache's periodic data-integrity scan re-reads and re-checksums every complete object on each cycle. For deployments whose underlying storage already guarantees at-rest integrity (e.g. ZFS with scrubbing), repeatedly re-reading every object is wasteful; an initial baseline checksum is still wanted, ideally compared against the origin's reported value. Add Cache.DataScanMode (default "all"). When set to "once", the data scan reads back and checksums each object's on-disk data exactly once: it records the checksum in the cache database and, when the object already carries an origin-reported checksum, the existing verify path compares the on-disk data against it. A new max-time CacheMetadata.DataVerified timestamp marks objects that have been verified; subsequent scans skip them without re-reading. The default "all" mode is unchanged and records no DataVerified timestamp (no extra metadata write per object per scan). Tests cover both modes: "once" verifies an object a single time then skips it, and "all" re-verifies on every scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "chaos monkey" for the V2 cache that deliberately corrupts or truncates a
cached object's on-disk data, to exercise the cache's integrity-detection paths
(the read-time AES-GCM check and the periodic data-integrity scan).
Because BadgerDB is single-process, a CLI cannot open the cache database while
the server holds it. The injection therefore runs in-process in the cache
server, exposed via an admin-authenticated endpoint
(POST /api/v1.0/cache/introspect/chaos) that `pelican cache chaos` drives
against a running cache. The endpoint is destructive, so it is registered only
when the new Cache.EnableChaosAPI parameter (hidden, default false) is set.
ChaosInjector (local_cache/chaos.go) wraps the live database and storage and
implements:
- CorruptBlock: flip the first N bytes of a block's encrypted on-disk
representation so its authentication tag fails.
- TruncateObject: drop trailing block(s) from a chunk file.
Both map a federation object (URL+ETag or instance hash) to its on-disk chunk
file and block offset; inline (in-database) objects are rejected.
Detection is not necessarily immediate: blocks still warm in the in-memory
caches keep reading until evicted; corruption is caught on the next cold read,
the data scan, or after a restart.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| numBytes = BlockTotalSize | ||
| } | ||
|
|
||
| f, err := os.OpenFile(chunkPath, os.O_RDWR, 0) |
| dropBytes = BlockTotalSize | ||
| } | ||
|
|
||
| f, err := os.OpenFile(chunkPath, os.O_RDWR, 0) |
| c.JSON(http.StatusBadRequest, gin.H{"error": "invalid bytes parameter"}) | ||
| return | ||
| } | ||
| result, err = injector.CorruptBlock(objectURL, etag, instance, uint32(block), int(nbytes)) |
| c.JSON(http.StatusBadRequest, gin.H{"error": "invalid bytes parameter"}) | ||
| return | ||
| } | ||
| result, err = injector.CorruptBlock(objectURL, etag, instance, uint32(block), int(nbytes)) |
| c.JSON(http.StatusBadRequest, gin.H{"error": "invalid drop-bytes parameter"}) | ||
| return | ||
| } | ||
| result, err = injector.TruncateObject(objectURL, etag, instance, int(chunk), drop) |
| components: ["cache"] | ||
| hidden: true | ||
| --- | ||
| name: Cache.DataScanMode |
There was a problem hiding this comment.
My one real worry in here: once mode assumes a fact Pelican can't see (that the underlying storage scrubs), and nothing warns the admin/operator when it isn't true. If someone sets once on a non-ZFS f/s (or even on ZFS with no scrub scheduled), subsequent bitrot never gets caught.
The more serious case is the no-origin-checksum path: if len(meta.Checksums) == 0 we compute the on-disk hash, record it as the baseline, mark it verified, and move on, with nothing to compare against. That permanently blesses a corrupt-at-ingest file which is exactly what bit CIT last week, no? So once is weakest in the case we care most about.
Could we at least (a) log a loud WARN and export a "scan mode = once" metric so a misconfigured cache is visible, and (b) keep a low-rate random re-sample even in once mode, so the floor is some at-rest detection rather than zero? Other ideas?
| default: true | ||
| components: ["cache"] | ||
| --- | ||
| name: Cache.EnableChaosAPI |
There was a problem hiding this comment.
Defaults look right. Off, hidden, admin-auth, staging-only in the docs.
Two small things: (1) Can enabling this trigger a persistent WARN and a visible "chaos enabled" status somewhere? It's a deliberate data-corruption feature, and the obvious failure mode is a staging config accidentally making it into production. (2) Does the tool reproduce the actual CIT failure (the ingest race: concurrent pulls, a timeout, partial-then-completed), or only corrupt already-cached objects at rest? If just the latter, it's exercising the path that didn't break.
There was a problem hiding this comment.
This is great, but just to be clear, seems like the "detect and work around" half of we discused yesterday, not the "prevent" half, correct? The defaults look right — disabled by default, the param's hidden, the endpoint needs admin auth, and the docs say staging-only.
The actual CIT issue was a bad file ingest, and nothing did a full-file or origin check before serving it; the scan only catches that after a job has already read corruption. Is the completion-path fix coming separately? (Don't mark an object serveable until a full-file checksum clears, against the origin's when we have one.) That's the piece that actually keeps corrupt data from getting to jobs.
IMHO the most important bit is the origin checksum. Verify-once and self-heal only work if the cache can get truth from there. What's our actual coverage across OSDF origins today? (I.e., the standing "Pelican object checksum" item.)
Two inline comments are below on the new knobs...
The POSIXv2 origin emits XRootD-style monitoring packets (user-login, f-stream
open/close) for each served transfer; the XRootD-based cache does so via its
XRootD process. The V2 persistent cache, having no XRootD, emitted none. Add
equivalent monitoring for completeness.
serveObject now emits a transfer event (via metrics.EmitTransferEvent) after a
GET is served, reporting the object path, bytes served, client IP, user agent,
project, and best-effort user/issuer attribution parsed from the (already
authorized) bearer token. Bytes served are tracked via the existing
trailerWriter and the no-store io.Copy path; 304s and other zero-byte responses
emit nothing.
Because the persistent cache has no XRootD to launch the monitoring shoveler,
cacheServeWithPersistentCache now starts it when Shoveler.Enable is set, mirroring
the POSIXv2 origin launcher, so the in-process packets reach the configured
collectors.
Tests:
- unit coverage of the emit helper (packet emitted when enabled; no-op when
disabled or zero-byte) and client-IP extraction;
- TestCacheMonitoringUDPCapture: a full handler-level end-to-end test that
serves a GET through serveObject, runs the real shoveler forwarding to a UDP
collector, and asserts the user-login and f-stream packets (including the
object path) are captured off the wire. It avoids the slow e2e_fed_tests
federation harness by building the cache with DeferConfig against a stub
federation and injecting a public namespace. (There was previously no
UDP-capture test for the origin either; the existing origin test only
inspects the internal channel.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
231040a to
ad5dba2
Compare
That was already done as part of #3477 (provided you're doing a full-object download; a partial object doesn't have the same protection). When doing a full GET, the cache object will never be served if it doesn't match the origin's checksum. This PR takes care of (eventually) the partial-object case: once the object copy is complete, it's eligible for a complete scan |
…afeguards)
CodeQL findings on the chaos API:
- Path traversal: validate that a (caller-supplied) instance hash is hex
before it is used to build a filesystem path, and verify the resolved
chunk-file path stays within its storage directory (safeChunkPath).
- Unbounded slice allocation: corrupt-block now uses a fixed-size stack
buffer (numBytes is already clamped to <= BlockTotalSize).
- Unchecked integer narrowing: the chaos handler validates each query
parameter against an explicit [min, max] range before converting to
uint32/int.
Review feedback on Cache.DataScanMode=once (couvares): "once" mode trusts the
underlying storage to detect bitrot after the initial check, with nothing
warning when that assumption is false.
- Emit a loud startup WARN and export pelican_cache_data_scan_mode_once when
"once" mode is active.
- Keep a floor of at-rest detection: the scan re-verifies ~1 in N
already-checked objects each cycle (new Cache.DataScanResampleInterval,
default 100 = ~1%; 0 disables). Documented that "once" cannot catch
corruption present at ingest — that needs a completion-path checksum,
which is independent of this setting.
Review feedback on Cache.EnableChaosAPI visibility: export
pelican_cache_chaos_api_enabled (in addition to the existing startup WARN) so a
staging config that leaks into production is observable.
Tests: resample floor (ResampleInterval=1 re-verifies every cycle).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This branch has a grab-bag of improvements in the (unreleased) Cache V2. Highlights: