Various and Sundry Cache V2 improvements by bbockelm · Pull Request #3525 · PelicanPlatform/pelican

bbockelm · 2026-06-19T01:15:56Z

This branch has a grab-bag of improvements in the (unreleased) Cache V2. Highlights:

Site-local mode now works for cache V2.
Addition of a new "verify once" data-scan mode. If you have ZFS data scrubbing enabled, there's no point in periodically recalculating the checksum: do it once and it should be fine subsequently.
Add a "chaos monkey" API that will corrupt cached files in admin-specified ways.
Add XRootD-compatible monitoring packets so we don't lose monitoring info when we start updating services.

A site-local cache (Cache.EnableSiteLocalMode) is meant to appear to the federation as a client and fetch objects from other caches rather than directly from origins. The V1 (XRootD) cache achieves this by setting XRD_PELICANDIRECTORYQUERYMODE=cache (see commit 75d93f4). The V2 (persistent) cache had no equivalent: every upstream fetch unconditionally used WithCacheEmbeddedClientMode(), which routes the director query through /api/v1.0/director/origin/ and pulls straight from origins. A site-local V2 cache therefore ignored the federation's caches. Make WithCacheEmbeddedClientMode take a bool and gate it on a new useEmbeddedCacheMode() helper that returns false when site-local mode is enabled, so the director redirects the cache to other caches instead. All six embedded-fetch sites in the persistent cache are routed through it. Tests: - client: WithCacheEmbeddedClientMode(true) routes to the origin endpoint, (false) to the director's shortcut (cache) endpoint. - local_cache: useEmbeddedCacheMode() reflects Cache.EnableSiteLocalMode. - e2e: stand up a federation (director + origin + advertised V2 cache) plus a separate `pelican cache serve` child in site-local mode, then download through the site-local cache and assert (via Cache-Control: only-if-cached) that the upstream cache received the object — confirming the fetch routed through the cache, not the origin. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The V2 cache's periodic data-integrity scan re-reads and re-checksums every complete object on each cycle. For deployments whose underlying storage already guarantees at-rest integrity (e.g. ZFS with scrubbing), repeatedly re-reading every object is wasteful; an initial baseline checksum is still wanted, ideally compared against the origin's reported value. Add Cache.DataScanMode (default "all"). When set to "once", the data scan reads back and checksums each object's on-disk data exactly once: it records the checksum in the cache database and, when the object already carries an origin-reported checksum, the existing verify path compares the on-disk data against it. A new max-time CacheMetadata.DataVerified timestamp marks objects that have been verified; subsequent scans skip them without re-reading. The default "all" mode is unchanged and records no DataVerified timestamp (no extra metadata write per object per scan). Tests cover both modes: "once" verifies an object a single time then skips it, and "all" re-verifies on every scan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a "chaos monkey" for the V2 cache that deliberately corrupts or truncates a cached object's on-disk data, to exercise the cache's integrity-detection paths (the read-time AES-GCM check and the periodic data-integrity scan). Because BadgerDB is single-process, a CLI cannot open the cache database while the server holds it. The injection therefore runs in-process in the cache server, exposed via an admin-authenticated endpoint (POST /api/v1.0/cache/introspect/chaos) that `pelican cache chaos` drives against a running cache. The endpoint is destructive, so it is registered only when the new Cache.EnableChaosAPI parameter (hidden, default false) is set. ChaosInjector (local_cache/chaos.go) wraps the live database and storage and implements: - CorruptBlock: flip the first N bytes of a block's encrypted on-disk representation so its authentication tag fails. - TruncateObject: drop trailing block(s) from a chunk file. Both map a federation object (URL+ETag or instance hash) to its on-disk chunk file and block offset; inline (in-database) objects are rejected. Detection is not necessarily immediate: blocks still warm in the in-memory caches keep reading until evicted; corruption is caught on the next cold read, the data scan, or after a restart. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

+		numBytes = BlockTotalSize
+	}
+
+	f, err := os.OpenFile(chunkPath, os.O_RDWR, 0)


+		dropBytes = BlockTotalSize
+	}
+
+	f, err := os.OpenFile(chunkPath, os.O_RDWR, 0)


+			c.JSON(http.StatusBadRequest, gin.H{"error": "invalid bytes parameter"})
+			return
+		}
+		result, err = injector.CorruptBlock(objectURL, etag, instance, uint32(block), int(nbytes))


+			c.JSON(http.StatusBadRequest, gin.H{"error": "invalid bytes parameter"})
+			return
+		}
+		result, err = injector.CorruptBlock(objectURL, etag, instance, uint32(block), int(nbytes))


+			c.JSON(http.StatusBadRequest, gin.H{"error": "invalid drop-bytes parameter"})
+			return
+		}
+		result, err = injector.TruncateObject(objectURL, etag, instance, int(chunk), drop)


couvares · 2026-06-19T10:27:30Z

+components: ["cache"]
+hidden: true
+---
+name: Cache.DataScanMode


My one real worry in here: once mode assumes a fact Pelican can't see (that the underlying storage scrubs), and nothing warns the admin/operator when it isn't true. If someone sets once on a non-ZFS f/s (or even on ZFS with no scrub scheduled), subsequent bitrot never gets caught.
The more serious case is the no-origin-checksum path: if len(meta.Checksums) == 0 we compute the on-disk hash, record it as the baseline, mark it verified, and move on, with nothing to compare against. That permanently blesses a corrupt-at-ingest file which is exactly what bit CIT last week, no? So once is weakest in the case we care most about.

Could we at least (a) log a loud WARN and export a "scan mode = once" metric so a misconfigured cache is visible, and (b) keep a low-rate random re-sample even in once mode, so the floor is some at-rest detection rather than zero? Other ideas?

couvares · 2026-06-19T10:31:29Z

 default: true
 components: ["cache"]
 ---
+name: Cache.EnableChaosAPI


Defaults look right. Off, hidden, admin-auth, staging-only in the docs.

Two small things: (1) Can enabling this trigger a persistent WARN and a visible "chaos enabled" status somewhere? It's a deliberate data-corruption feature, and the obvious failure mode is a staging config accidentally making it into production. (2) Does the tool reproduce the actual CIT failure (the ingest race: concurrent pulls, a timeout, partial-then-completed), or only corrupt already-cached objects at rest? If just the latter, it's exercising the path that didn't break.

couvares

This is great, but just to be clear, seems like the "detect and work around" half of we discused yesterday, not the "prevent" half, correct? The defaults look right — disabled by default, the param's hidden, the endpoint needs admin auth, and the docs say staging-only.

The actual CIT issue was a bad file ingest, and nothing did a full-file or origin check before serving it; the scan only catches that after a job has already read corruption. Is the completion-path fix coming separately? (Don't mark an object serveable until a full-file checksum clears, against the origin's when we have one.) That's the piece that actually keeps corrupt data from getting to jobs.

IMHO the most important bit is the origin checksum. Verify-once and self-heal only work if the cache can get truth from there. What's our actual coverage across OSDF origins today? (I.e., the standing "Pelican object checksum" item.)

Two inline comments are below on the new knobs...

The POSIXv2 origin emits XRootD-style monitoring packets (user-login, f-stream open/close) for each served transfer; the XRootD-based cache does so via its XRootD process. The V2 persistent cache, having no XRootD, emitted none. Add equivalent monitoring for completeness. serveObject now emits a transfer event (via metrics.EmitTransferEvent) after a GET is served, reporting the object path, bytes served, client IP, user agent, project, and best-effort user/issuer attribution parsed from the (already authorized) bearer token. Bytes served are tracked via the existing trailerWriter and the no-store io.Copy path; 304s and other zero-byte responses emit nothing. Because the persistent cache has no XRootD to launch the monitoring shoveler, cacheServeWithPersistentCache now starts it when Shoveler.Enable is set, mirroring the POSIXv2 origin launcher, so the in-process packets reach the configured collectors. Tests: - unit coverage of the emit helper (packet emitted when enabled; no-op when disabled or zero-byte) and client-IP extraction; - TestCacheMonitoringUDPCapture: a full handler-level end-to-end test that serves a GET through serveObject, runs the real shoveler forwarding to a UDP collector, and asserts the user-login and f-stream packets (including the object path) are captured off the wire. It avoids the slow e2e_fed_tests federation harness by building the cache with DeferConfig against a stub federation and injecting a public namespace. (There was previously no UDP-capture test for the origin either; the existing origin test only inspects the internal channel.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bbockelm · 2026-06-20T02:35:27Z

Is the completion-path fix coming separately? (Don't mark an object serveable until a full-file checksum clears, against the origin's when we have one.) That's the piece that actually keeps corrupt data from getting to jobs.

That was already done as part of #3477 (provided you're doing a full-object download; a partial object doesn't have the same protection). When doing a full GET, the cache object will never be served if it doesn't match the origin's checksum.

This PR takes care of (eventually) the partial-object case: once the object copy is complete, it's eligible for a complete scan

…afeguards) CodeQL findings on the chaos API: - Path traversal: validate that a (caller-supplied) instance hash is hex before it is used to build a filesystem path, and verify the resolved chunk-file path stays within its storage directory (safeChunkPath). - Unbounded slice allocation: corrupt-block now uses a fixed-size stack buffer (numBytes is already clamped to <= BlockTotalSize). - Unchecked integer narrowing: the chaos handler validates each query parameter against an explicit [min, max] range before converting to uint32/int. Review feedback on Cache.DataScanMode=once (couvares): "once" mode trusts the underlying storage to detect bitrot after the initial check, with nothing warning when that assumption is false. - Emit a loud startup WARN and export pelican_cache_data_scan_mode_once when "once" mode is active. - Keep a floor of at-rest detection: the scan re-verifies ~1 in N already-checked objects each cycle (new Cache.DataScanResampleInterval, default 100 = ~1%; 0 disables). Documented that "once" cannot catch corruption present at ingest — that needs a completion-path checksum, which is independent of this setting. Review feedback on Cache.EnableChaosAPI visibility: export pelican_cache_chaos_api_enabled (in addition to the existing startup WARN) so a staging config that leaks into production is observable. Tests: resample floor (ResampleInterval=1 re-verifies every cycle). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bbockelm and others added 3 commits June 18, 2026 20:11

github-advanced-security AI found potential problems Jun 19, 2026

View reviewed changes

couvares reviewed Jun 19, 2026

View reviewed changes

bbockelm force-pushed the cachev2-integrity-monitoring branch from 231040a to ad5dba2 Compare June 20, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various and Sundry Cache V2 improvements#3525

Various and Sundry Cache V2 improvements#3525
bbockelm wants to merge 5 commits into
PelicanPlatform:mainfrom
bbockelm:cachev2-integrity-monitoring

bbockelm commented Jun 19, 2026

Uh oh!

Uh oh!

couvares Jun 19, 2026

Uh oh!

couvares Jun 19, 2026

Uh oh!

couvares left a comment •

edited

Loading

Uh oh!

bbockelm commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bbockelm commented Jun 19, 2026

Uh oh!

Uh oh!

couvares Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

couvares Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

couvares left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbockelm commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

couvares left a comment •

edited

Loading