Skip to content

Various and Sundry Cache V2 improvements#3525

Draft
bbockelm wants to merge 5 commits into
PelicanPlatform:mainfrom
bbockelm:cachev2-integrity-monitoring
Draft

Various and Sundry Cache V2 improvements#3525
bbockelm wants to merge 5 commits into
PelicanPlatform:mainfrom
bbockelm:cachev2-integrity-monitoring

Conversation

@bbockelm

Copy link
Copy Markdown
Collaborator

This branch has a grab-bag of improvements in the (unreleased) Cache V2. Highlights:

  • Site-local mode now works for cache V2.
  • Addition of a new "verify once" data-scan mode. If you have ZFS data scrubbing enabled, there's no point in periodically recalculating the checksum: do it once and it should be fine subsequently.
  • Add a "chaos monkey" API that will corrupt cached files in admin-specified ways.
  • Add XRootD-compatible monitoring packets so we don't lose monitoring info when we start updating services.

bbockelm and others added 3 commits June 18, 2026 20:11
A site-local cache (Cache.EnableSiteLocalMode) is meant to appear to the
federation as a client and fetch objects from other caches rather than
directly from origins. The V1 (XRootD) cache achieves this by setting
XRD_PELICANDIRECTORYQUERYMODE=cache (see commit 75d93f4).

The V2 (persistent) cache had no equivalent: every upstream fetch
unconditionally used WithCacheEmbeddedClientMode(), which routes the
director query through /api/v1.0/director/origin/ and pulls straight from
origins. A site-local V2 cache therefore ignored the federation's caches.

Make WithCacheEmbeddedClientMode take a bool and gate it on a new
useEmbeddedCacheMode() helper that returns false when site-local mode is
enabled, so the director redirects the cache to other caches instead. All
six embedded-fetch sites in the persistent cache are routed through it.

Tests:
  - client: WithCacheEmbeddedClientMode(true) routes to the origin
    endpoint, (false) to the director's shortcut (cache) endpoint.
  - local_cache: useEmbeddedCacheMode() reflects Cache.EnableSiteLocalMode.
  - e2e: stand up a federation (director + origin + advertised V2 cache)
    plus a separate `pelican cache serve` child in site-local mode, then
    download through the site-local cache and assert (via
    Cache-Control: only-if-cached) that the upstream cache received the
    object — confirming the fetch routed through the cache, not the origin.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The V2 cache's periodic data-integrity scan re-reads and re-checksums every
complete object on each cycle. For deployments whose underlying storage already
guarantees at-rest integrity (e.g. ZFS with scrubbing), repeatedly re-reading
every object is wasteful; an initial baseline checksum is still wanted, ideally
compared against the origin's reported value.

Add Cache.DataScanMode (default "all"). When set to "once", the data scan reads
back and checksums each object's on-disk data exactly once: it records the
checksum in the cache database and, when the object already carries an
origin-reported checksum, the existing verify path compares the on-disk data
against it. A new max-time CacheMetadata.DataVerified timestamp marks objects
that have been verified; subsequent scans skip them without re-reading. The
default "all" mode is unchanged and records no DataVerified timestamp (no extra
metadata write per object per scan).

Tests cover both modes: "once" verifies an object a single time then skips it,
and "all" re-verifies on every scan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "chaos monkey" for the V2 cache that deliberately corrupts or truncates a
cached object's on-disk data, to exercise the cache's integrity-detection paths
(the read-time AES-GCM check and the periodic data-integrity scan).

Because BadgerDB is single-process, a CLI cannot open the cache database while
the server holds it. The injection therefore runs in-process in the cache
server, exposed via an admin-authenticated endpoint
(POST /api/v1.0/cache/introspect/chaos) that `pelican cache chaos` drives
against a running cache. The endpoint is destructive, so it is registered only
when the new Cache.EnableChaosAPI parameter (hidden, default false) is set.

ChaosInjector (local_cache/chaos.go) wraps the live database and storage and
implements:
  - CorruptBlock: flip the first N bytes of a block's encrypted on-disk
    representation so its authentication tag fails.
  - TruncateObject: drop trailing block(s) from a chunk file.
Both map a federation object (URL+ETag or instance hash) to its on-disk chunk
file and block offset; inline (in-database) objects are rejected.

Detection is not necessarily immediate: blocks still warm in the in-memory
caches keep reading until evicted; corruption is caught on the next cold read,
the data scan, or after a restart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread local_cache/chaos.go
numBytes = BlockTotalSize
}

f, err := os.OpenFile(chunkPath, os.O_RDWR, 0)
Comment thread local_cache/chaos.go Fixed
Comment thread local_cache/chaos.go
dropBytes = BlockTotalSize
}

f, err := os.OpenFile(chunkPath, os.O_RDWR, 0)
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid bytes parameter"})
return
}
result, err = injector.CorruptBlock(objectURL, etag, instance, uint32(block), int(nbytes))
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid bytes parameter"})
return
}
result, err = injector.CorruptBlock(objectURL, etag, instance, uint32(block), int(nbytes))
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid drop-bytes parameter"})
return
}
result, err = injector.TruncateObject(objectURL, etag, instance, int(chunk), drop)
Comment thread docs/parameters.yaml
components: ["cache"]
hidden: true
---
name: Cache.DataScanMode

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one real worry in here: once mode assumes a fact Pelican can't see (that the underlying storage scrubs), and nothing warns the admin/operator when it isn't true. If someone sets once on a non-ZFS f/s (or even on ZFS with no scrub scheduled), subsequent bitrot never gets caught.
The more serious case is the no-origin-checksum path: if len(meta.Checksums) == 0 we compute the on-disk hash, record it as the baseline, mark it verified, and move on, with nothing to compare against. That permanently blesses a corrupt-at-ingest file which is exactly what bit CIT last week, no? So once is weakest in the case we care most about.

Could we at least (a) log a loud WARN and export a "scan mode = once" metric so a misconfigured cache is visible, and (b) keep a low-rate random re-sample even in once mode, so the floor is some at-rest detection rather than zero? Other ideas?

Comment thread docs/parameters.yaml
default: true
components: ["cache"]
---
name: Cache.EnableChaosAPI

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults look right. Off, hidden, admin-auth, staging-only in the docs.

Two small things: (1) Can enabling this trigger a persistent WARN and a visible "chaos enabled" status somewhere? It's a deliberate data-corruption feature, and the obvious failure mode is a staging config accidentally making it into production. (2) Does the tool reproduce the actual CIT failure (the ingest race: concurrent pulls, a timeout, partial-then-completed), or only corrupt already-cached objects at rest? If just the latter, it's exercising the path that didn't break.

@couvares couvares left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, but just to be clear, seems like the "detect and work around" half of we discused yesterday, not the "prevent" half, correct? The defaults look right — disabled by default, the param's hidden, the endpoint needs admin auth, and the docs say staging-only.

The actual CIT issue was a bad file ingest, and nothing did a full-file or origin check before serving it; the scan only catches that after a job has already read corruption. Is the completion-path fix coming separately? (Don't mark an object serveable until a full-file checksum clears, against the origin's when we have one.) That's the piece that actually keeps corrupt data from getting to jobs.

IMHO the most important bit is the origin checksum. Verify-once and self-heal only work if the cache can get truth from there. What's our actual coverage across OSDF origins today? (I.e., the standing "Pelican object checksum" item.)

Two inline comments are below on the new knobs...

The POSIXv2 origin emits XRootD-style monitoring packets (user-login, f-stream
open/close) for each served transfer; the XRootD-based cache does so via its
XRootD process. The V2 persistent cache, having no XRootD, emitted none. Add
equivalent monitoring for completeness.

serveObject now emits a transfer event (via metrics.EmitTransferEvent) after a
GET is served, reporting the object path, bytes served, client IP, user agent,
project, and best-effort user/issuer attribution parsed from the (already
authorized) bearer token. Bytes served are tracked via the existing
trailerWriter and the no-store io.Copy path; 304s and other zero-byte responses
emit nothing.

Because the persistent cache has no XRootD to launch the monitoring shoveler,
cacheServeWithPersistentCache now starts it when Shoveler.Enable is set, mirroring
the POSIXv2 origin launcher, so the in-process packets reach the configured
collectors.

Tests:
  - unit coverage of the emit helper (packet emitted when enabled; no-op when
    disabled or zero-byte) and client-IP extraction;
  - TestCacheMonitoringUDPCapture: a full handler-level end-to-end test that
    serves a GET through serveObject, runs the real shoveler forwarding to a UDP
    collector, and asserts the user-login and f-stream packets (including the
    object path) are captured off the wire. It avoids the slow e2e_fed_tests
    federation harness by building the cache with DeferConfig against a stub
    federation and injecting a public namespace. (There was previously no
    UDP-capture test for the origin either; the existing origin test only
    inspects the internal channel.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bbockelm bbockelm force-pushed the cachev2-integrity-monitoring branch from 231040a to ad5dba2 Compare June 20, 2026 02:24
@bbockelm

Copy link
Copy Markdown
Collaborator Author

Is the completion-path fix coming separately? (Don't mark an object serveable until a full-file checksum clears, against the origin's when we have one.) That's the piece that actually keeps corrupt data from getting to jobs.

That was already done as part of #3477 (provided you're doing a full-object download; a partial object doesn't have the same protection). When doing a full GET, the cache object will never be served if it doesn't match the origin's checksum.

This PR takes care of (eventually) the partial-object case: once the object copy is complete, it's eligible for a complete scan

…afeguards)

CodeQL findings on the chaos API:
  - Path traversal: validate that a (caller-supplied) instance hash is hex
    before it is used to build a filesystem path, and verify the resolved
    chunk-file path stays within its storage directory (safeChunkPath).
  - Unbounded slice allocation: corrupt-block now uses a fixed-size stack
    buffer (numBytes is already clamped to <= BlockTotalSize).
  - Unchecked integer narrowing: the chaos handler validates each query
    parameter against an explicit [min, max] range before converting to
    uint32/int.

Review feedback on Cache.DataScanMode=once (couvares): "once" mode trusts the
underlying storage to detect bitrot after the initial check, with nothing
warning when that assumption is false.
  - Emit a loud startup WARN and export pelican_cache_data_scan_mode_once when
    "once" mode is active.
  - Keep a floor of at-rest detection: the scan re-verifies ~1 in N
    already-checked objects each cycle (new Cache.DataScanResampleInterval,
    default 100 = ~1%; 0 disables). Documented that "once" cannot catch
    corruption present at ingest — that needs a completion-path checksum,
    which is independent of this setting.

Review feedback on Cache.EnableChaosAPI visibility: export
pelican_cache_chaos_api_enabled (in addition to the existing startup WARN) so a
staging config that leaks into production is observable.

Tests: resample floor (ResampleInterval=1 re-verifies every cycle).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants