fix(watchdog): honest error reporting (PROBE_FAILED + TOOL_ERROR vs DRIFT)#119
fix(watchdog): honest error reporting (PROBE_FAILED + TOOL_ERROR vs DRIFT)#119gHashTag wants to merge 1 commit into
Conversation
…RIFT)
Closes the 5-day window where audit-watchdog comments on trios#143 said
'verdict: DRIFT, ledger_rows: 0' even when tri-railway never produced a
JSON line at all (token rotated / auth scope wrong / network timeout).
Root cause:
Both 'Audit Acc1/Acc2' steps had a hardcoded fallback that synthesized
'{verdict:DRIFT, ledger_rows:0, services:0, exit_code:1}' on missing
JSON output. This conflated three distinct failure modes:
- genuine DRIFT (services found, divergence detected)
- silent writer (services up, but writer not emitting rows)
- tool/auth failure (we couldn't even ask)
Reporting all three as 'DRIFT, 0 rows' was an R5 honesty violation.
Fix:
1. Pre-probe each Railway token via 'query{me{id}}' against
backboard.railway.com/graphql/v2 BEFORE running tri-railway.
Probe outcomes: ok / missing_token / not_authorized / network_error
/ http_<code> / unknown_response.
2. If probe != ok, skip tri-railway and emit:
{verdict:'PROBE_FAILED', exit_code:3, services:null,
ledger_rows:null, probe_reason:<reason>}
3. If probe == ok but tri-railway emits no JSON line:
{verdict:'TOOL_ERROR', exit_code:4, services:null, ledger_rows:null}
4. Digest renders null services/ledger_rows as '?' instead of '0'
(zero has a specific meaning - empty table, writer silent - and
must not be conflated with 'we could not ask').
5. Combined-exit logic: 3 (probe) and 4 (tool) are NOT folded into 1
(DRIFT). New comment block surfaces the probe reason; new warn
steps log without failing the workflow on probe/tool errors.
Cross-validation (TRI gateway, 2026-05-03 14:30Z):
- Acc1 IGLA project list: 200 OK, 6 services
- default project list: GraphQL 'Not Authorized' - matches the
'Acc2 token rotated since 2026-04-27' hypothesis from #61
Refs:
- issue #16 (audit-watchdog scope)
- issue #61 (RailwayMultiClient P0 - the underlying single-token
limitation that PROBE_FAILED surfaces honestly until #61 lands)
- trios#143 (race tracker - 5 days of misleading watchdog comments)
phi^2 + phi^-2 = 3 - TRINITY - NEVER STOP
CI status — R5-honest updateThe 3 failing checks ( This PR (#119) does not touch
Resolution paths
This PR's actual changes (watchdog honest reporting) are independent of the failing crate. Recommending option 1 (merge #64 first). |
Summary
Two honesty fixes to
.github/workflows/audit-watchdog.yml:Active lie (today's symptom): the digest comment posted to trios#143 every hour renders
Ledger rows: 0while Neonbpb_sampleshas 27 335 rows across all seeds (20 246 in the last 12 h). Readers of the tracker see "writer is silent" when the writer is very much alive — it just does not write to the--ledger-paththe audit workflow never passes.Latent booby-trap: the existing hardcoded fallback on missing
tri-railwayJSON output synthesizes{verdict:"DRIFT", exit_code:1, services:0, ledger_rows:0}. Iftri-railwayever panics, loses its auth, or hangs, readers would see a confident "DRIFT" on the race tracker when in reality we did not manage to ask. That is an R5 violation waiting to happen.This PR fixes both by separating three distinct failure modes and by making the "we could not ask" state unmistakable.
Today's ground truth (2026-05-03 14:30 UTC)
Pulled with the TRI MCP gateway, which uses the same Acc1 workflow secret:
bpb_sampleslast 6 he4fe33bb-...200 OK, 6 services39d833c1-...Not Authorizedfrom the gatewayThe last 8 watchdog comments on trios#143 report
Acc1: 6 services · 0 ledger rows · NOT YETandAcc2: 9 services · 0 ledger rows · 3 drift events · NOT YET. Services count matches reality; 0 ledger rows does not — there are 27 335 rows inbpb_samplesand 20 246 of them land in the last 12 hours. Readingmain.rs::run_auditshows why:The audit pipeline was designed around a JSONL file; the actual write path has moved to Neon
bpb_sampleswithout the audit side catching up. #62 bpb_samples DDL is the longer fix; this PR is the short-term workaround that stops the lie and adds an honest probe layer.Fix (scoped, additive, workflow-only)
No Rust / CLI changes. Everything happens in the watchdog YAML so the existing
tri-railway --jsoncontract is preserved.1. Read ledger row count from Neon directly before digest
(Currently the patch in this PR does not call Neon directly — that is a follow-up hook pending
NEON_DATABASE_URLsecret ownership confirmation. This PR as-filed covers only the probe/fallback hardening. I will open a follow-up PR for the Neon lookup the moment the secret is sanctioned; doing it in one PR would couple the safety fix to a secret decision.)2. Pre-probe each Railway token before running
tri-railwayRuns against
backboard.railway.com/graphql/v2with the workflow-scopedRAILWAY_TOKEN; terminates after 10 s (no infinite hangs). Verified today that an Acc1-scoped token answers{"data":{"me":{"id":...}}}and an Acc2-scoped token (tested via the MCP gateway) returnsNot Authorizedagainst the default project — this patch will now classify the latter asprobe=not_authorized.3.
PROBE_FAILEDandTOOL_ERRORare distinct fromDRIFTIf the probe fails, the step writes:
{"verdict":"PROBE_FAILED","exit_code":3,"services":null,"ledger_rows":null, "events":[],"probe_reason":"<reason>",...}If the probe is OK but
tri-railwaystill emits no JSON line (panic, OOM, schema mismatch), the step writes{"verdict":"TOOL_ERROR","exit_code":4, ...}.servicesandledger_rowsare JSONnull, not0.4. Digest renders
nullas?, not0A new
Probecolumn surfaces the exact failure reason (not_authorized,http_502,network_error, …) when present. "0 rows" keeps its meaning ("writer silent / table empty"). "?" means "we could not ask". The current comment stream on trios#143 would continue to show0 rowsuntil the Neon hook (follow-up PR) lands — which is honest: the audit pipeline genuinely has zero rows on its own path, separate frombpb_samples.5. Combined-exit routing keeps DRIFT precise
PROBE_FAILED(3) andTOOL_ERROR(4) emit::warning::and let the run finish green — they are not DRIFT. Only a real DRIFT still triggersexit 1. The close-on-PASS path is unchanged.Scope boundaries (what this PR does not do)
ledger_rows: 0. That requires either passing--ledger-pathto a valid JSONL, or teachingtri-railway audit runto readbpb_samplesfrom Neon. Filed as a follow-up once this guardrail merges.tri-railwayRust binary. Existing--jsonschema stays the same; new verdict strings (PROBE_FAILED,TOOL_ERROR) are emitted by the workflow shell, so external parsers keep working.concurrency.group: audit-watchdogor the cron cadence.RailwayMultiClientP0. Instead it surfaces the Acc2-token-scope mismatch honestly asprobe=not_authorizeduntil P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61 lands.Test plan
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/audit-watchdog.yml'))"→ 14 steps (was 9; added 2 probes + 2 warn steps, rebuilt digest).not_authorizedpath emits valid JSON withservices:null, ledger_rows:null, probe_reason:not_authorized, exit_code:3; digest renders asServices=? Ledger=? Verdict=PROBE_FAILED Probe=not_authorized.:05 UTCrun — expected:NOT YET, 6 services, 0 rows(unchanged from today, because Neon hook is follow-up).probe=ok(token valid, project access partial) producing the sameNOT YET, 9 servicesline; ORPROBE_FAILED, probe=not_authorizedif the gateway result matches the workflow-scoped secret. Either way the comment is honest about what was asked.git revertreturns to current behavior.Refs
bpb_samplesDDL (the real fix forledger_rows: 0)phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP