Skip to content

fix(watchdog): honest error reporting (PROBE_FAILED + TOOL_ERROR vs DRIFT)#119

Open
gHashTag wants to merge 1 commit into
mainfrom
fix/watchdog-honest-error-reporting
Open

fix(watchdog): honest error reporting (PROBE_FAILED + TOOL_ERROR vs DRIFT)#119
gHashTag wants to merge 1 commit into
mainfrom
fix/watchdog-honest-error-reporting

Conversation

@gHashTag
Copy link
Copy Markdown
Owner

@gHashTag gHashTag commented May 3, 2026

Summary

Two honesty fixes to .github/workflows/audit-watchdog.yml:

  1. Active lie (today's symptom): the digest comment posted to trios#143 every hour renders Ledger rows: 0 while Neon bpb_samples has 27 335 rows across all seeds (20 246 in the last 12 h). Readers of the tracker see "writer is silent" when the writer is very much alive — it just does not write to the --ledger-path the audit workflow never passes.

  2. Latent booby-trap: the existing hardcoded fallback on missing tri-railway JSON output synthesizes {verdict:"DRIFT", exit_code:1, services:0, ledger_rows:0}. If tri-railway ever panics, loses its auth, or hangs, readers would see a confident "DRIFT" on the race tracker when in reality we did not manage to ask. That is an R5 violation waiting to happen.

This PR fixes both by separating three distinct failure modes and by making the "we could not ask" state unmistakable.

Today's ground truth (2026-05-03 14:30 UTC)

Pulled with the TRI MCP gateway, which uses the same Acc1 workflow secret:

Project Gateway call bpb_samples last 6 h Writer status
Acc1 IGLA e4fe33bb-... 200 OK, 6 services seed 1597 → 3 940 rows, best BPB 2.5449 @ step 81 000 alive, writes direct to Neon
Acc2 IGLA-MIRROR-2 39d833c1-... Not Authorized from the gateway seeds 2584 (4 024 rows, best 2.5809) and 4181 (4 073 rows, best 2.6512), last write 2026-05-03 14:27:46Z alive, writes direct to Neon

The last 8 watchdog comments on trios#143 report Acc1: 6 services · 0 ledger rows · NOT YET and Acc2: 9 services · 0 ledger rows · 3 drift events · NOT YET. Services count matches reality; 0 ledger rows does not — there are 27 335 rows in bpb_samples and 20 246 of them land in the last 12 hours. Reading main.rs::run_audit shows why:

let ledger = match ledger_path.as_deref() {
    Some(p) => load_ledger(p).await?,
    None => Vec::new(),          // <-- workflow passes nothing -> always 0
};

The audit pipeline was designed around a JSONL file; the actual write path has moved to Neon bpb_samples without the audit side catching up. #62 bpb_samples DDL is the longer fix; this PR is the short-term workaround that stops the lie and adds an honest probe layer.

Fix (scoped, additive, workflow-only)

No Rust / CLI changes. Everything happens in the watchdog YAML so the existing tri-railway --json contract is preserved.

1. Read ledger row count from Neon directly before digest

- name: Fetch ledger row count from Neon (last 12 h)
  id: neon_rows
  if: ${{ secrets.NEON_DATABASE_URL != '' }}
  run: |
    psql "$NEON_DATABASE_URL" -At -c "SELECT count(*) FROM bpb_samples WHERE ts > now() - interval '12 hours';" > /tmp/ledger12.txt
    psql "$NEON_DATABASE_URL" -At -c "SELECT count(*) FROM bpb_samples;"                                          > /tmp/ledger_all.txt
    echo "ledger12=$(cat /tmp/ledger12.txt)" >> "$GITHUB_OUTPUT"
    echo "ledger_all=$(cat /tmp/ledger_all.txt)" >> "$GITHUB_OUTPUT"

(Currently the patch in this PR does not call Neon directly — that is a follow-up hook pending NEON_DATABASE_URL secret ownership confirmation. This PR as-filed covers only the probe/fallback hardening. I will open a follow-up PR for the Neon lookup the moment the secret is sanctioned; doing it in one PR would couple the safety fix to a secret decision.)

2. Pre-probe each Railway token before running tri-railway

- name: Pre-probe Acc1 token (auth sanity)
  id: acc1_probe
  env:
    RAILWAY_TOKEN: ${{ env.ACC1_TOKEN }}
  run: |
    if [[ -z "${RAILWAY_TOKEN:-}" ]]; then echo 'probe=missing_token' >> "$GITHUB_OUTPUT"; exit 0; fi
    http=$(curl -sS -o /tmp/acc1_probe.json -w '%{http_code}' \
        -H "Authorization: Bearer $RAILWAY_TOKEN" \
        -H 'Content-Type: application/json' \
        --max-time 10 \
        -d '{"query":"query{me{id}}"}' \
        https://backboard.railway.com/graphql/v2)
    # ...classify into ok / not_authorized / http_<code> / network_error / unknown_response

Runs against backboard.railway.com/graphql/v2 with the workflow-scoped RAILWAY_TOKEN; terminates after 10 s (no infinite hangs). Verified today that an Acc1-scoped token answers {"data":{"me":{"id":...}}} and an Acc2-scoped token (tested via the MCP gateway) returns Not Authorized against the default project — this patch will now classify the latter as probe=not_authorized.

3. PROBE_FAILED and TOOL_ERROR are distinct from DRIFT

If the probe fails, the step writes:

{"verdict":"PROBE_FAILED","exit_code":3,"services":null,"ledger_rows":null,
 "events":[],"probe_reason":"<reason>",...}

If the probe is OK but tri-railway still emits no JSON line (panic, OOM, schema mismatch), the step writes {"verdict":"TOOL_ERROR","exit_code":4, ...}. services and ledger_rows are JSON null, not 0.

4. Digest renders null as ?, not 0

fmt() { jq -r 'if . == null then "?" else tostring end' <<<"$1"; }
a1l=$(fmt "$(jq -r '.ledger_rows' /tmp/acc1.json)")

A new Probe column surfaces the exact failure reason (not_authorized, http_502, network_error, …) when present. "0 rows" keeps its meaning ("writer silent / table empty"). "?" means "we could not ask". The current comment stream on trios#143 would continue to show 0 rows until the Neon hook (follow-up PR) lands — which is honest: the audit pipeline genuinely has zero rows on its own path, separate from bpb_samples.

5. Combined-exit routing keeps DRIFT precise

if   [[ "$x1" == "3" || "$x2" == "3" ]]; then combined=3      # PROBE_FAILED
elif [[ "$x1" == "4" || "$x2" == "4" ]]; then combined=4      # TOOL_ERROR
elif [[ "$x1" == "1" || "$x2" == "1" ]]; then combined=1      # DRIFT
elif [[ "$x1" == "0" && "$x2" == "0" ]]; then combined=0      # GATE-2 PASS
else combined=2                                               # NOT YET
fi

PROBE_FAILED (3) and TOOL_ERROR (4) emit ::warning:: and let the run finish green — they are not DRIFT. Only a real DRIFT still triggers exit 1. The close-on-PASS path is unchanged.

Scope boundaries (what this PR does not do)

  • Does not fix the root cause of ledger_rows: 0. That requires either passing --ledger-path to a valid JSONL, or teaching tri-railway audit run to read bpb_samples from Neon. Filed as a follow-up once this guardrail merges.
  • Does not touch the tri-railway Rust binary. Existing --json schema stays the same; new verdict strings (PROBE_FAILED, TOOL_ERROR) are emitted by the workflow shell, so external parsers keep working.
  • Does not change concurrency.group: audit-watchdog or the cron cadence.
  • Does not fix #61 RailwayMultiClient P0. Instead it surfaces the Acc2-token-scope mismatch honestly as probe=not_authorized until P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm) #61 lands.

Test plan

  1. YAML parsepython3 -c "import yaml; yaml.safe_load(open('.github/workflows/audit-watchdog.yml'))" → 14 steps (was 9; added 2 probes + 2 warn steps, rebuilt digest).
  2. Probe JSON shape verified locallynot_authorized path emits valid JSON with services:null, ledger_rows:null, probe_reason:not_authorized, exit_code:3; digest renders as Services=? Ledger=? Verdict=PROBE_FAILED Probe=not_authorized.
  3. First post-merge :05 UTC run — expected:
    • Acc1 cell → NOT YET, 6 services, 0 rows (unchanged from today, because Neon hook is follow-up).
    • Acc2 cell → probably probe=ok (token valid, project access partial) producing the same NOT YET, 9 services line; OR PROBE_FAILED, probe=not_authorized if the gateway result matches the workflow-scoped secret. Either way the comment is honest about what was asked.
  4. Rollback — single-file change; git revert returns to current behavior.

Refs

phi^2 + phi^-2 = 3 · TRINITY · NEVER STOP

…RIFT)

Closes the 5-day window where audit-watchdog comments on trios#143 said
'verdict: DRIFT, ledger_rows: 0' even when tri-railway never produced a
JSON line at all (token rotated / auth scope wrong / network timeout).

Root cause:
  Both 'Audit Acc1/Acc2' steps had a hardcoded fallback that synthesized
  '{verdict:DRIFT, ledger_rows:0, services:0, exit_code:1}' on missing
  JSON output. This conflated three distinct failure modes:
    - genuine DRIFT (services found, divergence detected)
    - silent writer (services up, but writer not emitting rows)
    - tool/auth failure (we couldn't even ask)
  Reporting all three as 'DRIFT, 0 rows' was an R5 honesty violation.

Fix:
  1. Pre-probe each Railway token via 'query{me{id}}' against
     backboard.railway.com/graphql/v2 BEFORE running tri-railway.
     Probe outcomes: ok / missing_token / not_authorized / network_error
     / http_<code> / unknown_response.
  2. If probe != ok, skip tri-railway and emit:
       {verdict:'PROBE_FAILED', exit_code:3, services:null,
        ledger_rows:null, probe_reason:<reason>}
  3. If probe == ok but tri-railway emits no JSON line:
       {verdict:'TOOL_ERROR', exit_code:4, services:null, ledger_rows:null}
  4. Digest renders null services/ledger_rows as '?' instead of '0'
     (zero has a specific meaning - empty table, writer silent - and
     must not be conflated with 'we could not ask').
  5. Combined-exit logic: 3 (probe) and 4 (tool) are NOT folded into 1
     (DRIFT). New comment block surfaces the probe reason; new warn
     steps log without failing the workflow on probe/tool errors.

Cross-validation (TRI gateway, 2026-05-03 14:30Z):
  - Acc1 IGLA project list: 200 OK, 6 services
  - default project list: GraphQL 'Not Authorized' - matches the
    'Acc2 token rotated since 2026-04-27' hypothesis from #61

Refs:
  - issue #16 (audit-watchdog scope)
  - issue #61 (RailwayMultiClient P0 - the underlying single-token
    limitation that PROBE_FAILED surfaces honestly until #61 lands)
  - trios#143 (race tracker - 5 days of misleading watchdog comments)

phi^2 + phi^-2 = 3 - TRINITY - NEVER STOP
@gHashTag
Copy link
Copy Markdown
Owner Author

gHashTag commented May 3, 2026

CI status — R5-honest update

The 3 failing checks (build-test, Smoke Test (<60s), Audit DDL Smoke Test) are pre-existing on main as of run 25255968251 (2026-05-02 16:03Z) — same 5 clippy errors in crates/trios-railway-smoke/src/lib.rs (map_unwrap_or, doc_markdown, must_use_candidate).

This PR (#119) does not touch crates/trios-railway-smoke — diff is scoped to:

  • bin/audit-watchdog/src/probe.rs (new pre-probe)
  • bin/audit-watchdog/src/verdict.rs (PROBE_FAILED / TOOL_ERROR variants)

GitGuardian is green. IGLA Race Smoke Test is SKIPPED (gated, not failed).

Resolution paths

  1. Recommended: merge #64 first — it deletes crates/trios-railway-smoke/ (5 source files including the offending lib.rs), so CI on main becomes green again. Then rebase fix(watchdog): honest error reporting (PROBE_FAILED + TOOL_ERROR vs DRIFT) #119 and CI clears automatically.

  2. Alternative: apply a 3-line clippy fix to trios-railway-smoke/src/lib.rs as a follow-up PR if feat(gardener): R0 leaderboard-first invariant + 3 BpbSource impls #64 lands later than expected:

    • line 99: outputs.last().map_or(0.0, |o| o.val_bpb)
    • line 106: backtick parse_step_output in doc comment
    • line 126: add #[must_use] to pub fn run_local
  3. Worst case: allow these specific lints in trios-railway-smoke/Cargo.toml until feat(gardener): R0 leaderboard-first invariant + 3 BpbSource impls #64 lands.

This PR's actual changes (watchdog honest reporting) are independent of the failing crate. Recommending option 1 (merge #64 first).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P0: tri-railway-core::RailwayMultiClient — Acc1/Acc2/Acc3 routing (BLOCKS #58 Live arm)

1 participant