Skip to content

feat(events): IOJournalRecorder context manager (M3 / slice 2 of #517)#534

Merged
Q00 merged 1 commit into
Q00:mainfrom
shaun0927:feat/517-2-io-journal-recorder
May 2, 2026
Merged

feat(events): IOJournalRecorder context manager (M3 / slice 2 of #517)#534
Q00 merged 1 commit into
Q00:mainfrom
shaun0927:feat/517-2-io-journal-recorder

Conversation

@shaun0927
Copy link
Copy Markdown
Collaborator

Summary

Second slice of #517: introduces the IOJournalRecorder async context manager that wraps an outbound LLM or tool call and emits the paired started/returned journal events with shared call_id, duration, hashing, privacy-aware preview shaping, and exception capture.

The recorder makes per-adapter wiring a 4-line addition rather than duplicating factory + hashing + privacy + timing boilerplate at every call site. Subsequent slices (Anthropic / LiteLLM / Claude Code / Codex CLI / Gemini CLI / OpenCode / MCP tool dispatch) just need to pass a recorder where appropriate and call record_completion / record_result inside the context block.

Stack notice. Depends on #532 (slice 1 — I/O Journal foundation).

Usage example

async with recorder.record_llm_call(
    model_id="claude-opus-4",
    prompt_text=prompt_str,
    caller="anthropic_adapter",
    max_tokens=2048,
) as call:
    response = await client.messages.create(**kwargs)
    call.record_completion(
        completion_text=response_text,
        finish_reason="stop",
        token_count_in=120,
        token_count_out=80,
    )

Design choices baked in

  • Always opt-in. event_store=None produces a no-op recorder that yields the same shape but emits nothing. Adapters that have not yet adopted the journal pass None and continue to work unchanged.
  • The recorder owns the boilerplate. call_id, timing (duration_ms), prompt/result hashing (sha256:), and privacy-aware preview shaping all happen inside the recorder. Callers only provide payload text and metadata.
  • Exception-aware. On an exception inside the context block the recorder re-raises but still emits the paired *.returned event with is_error=True and the exception type name in error_kind. Projections see the failure rather than a half-open call.
  • Observational-first. A broken EventStore (append() raises) does NOT propagate the failure. The journal stays out of the way; LLM/tool calls never fail because the recorder could not persist.
  • Duck-typed event_store. Protocol with one append(event) coroutine. No import from ouroboros.persistence, so the helper is cheap to test with a list-backed fake.

Changes

  • src/ouroboros/events/io_recorder.py — new module, ~330 LOC.
  • tests/unit/events/test_io_recorder.py — 8 cases.

Verification

Check Result
uv run ruff check src/ouroboros/events/io_recorder.py tests/unit/events/test_io_recorder.py clean (after import-order auto-fix)
uv run ruff format ... no diff
uv run pytest tests/unit/events/test_io_recorder.py tests/unit/events/test_io_events.py 38 passed

Pre-merge checklist

  • record_llm_call emits paired requested/returned with shared call_id
  • Completion fields filled by the caller propagate to the returned event
  • Exception inside the block → is_error=True + error_kind populated, exception re-raised
  • record_tool_call mirrors the LLM shape for tool dispatch
  • event_store=None → no events emitted (recorder is idle)
  • Privacy switch resolved from env or explicit override
  • Broken EventStore does not propagate (observational-first)
  • No emission site introduced — adapters adopt in follow-up PRs
  • CI: ruff + format + targeted pytest all green

Post-merge checklist

Rollback

The change is purely additive: a new module + tests. No existing behaviour or schema is touched. Rollback steps:

  1. Revert this PR. The new helper disappears; no follow-up adapter PR has merged yet, so no caller depends on it.
  2. No data, schema, or runtime behaviour change to undo.

Stack: depends on #532.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 5d7eb35 for PR #534

Review record: f5b49864-a309-4821-97c0-b595356c7b8e

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/events/io_recorder.py:348 | BLOCKING | _append() swallows every append failure, and record_llm_call() / record_tool_call() persist the start and return events as two independent best-effort writes. If the first append fails and the second later succeeds, the journal will contain an orphaned *.returned event with no matching request/start payload. That breaks the pairing invariant this module documents and defeats the “reconstructable from the journal alone” contract on transient EventStore failures. The current tests only cover “both appends fail” and miss this partial-write case. |

Non-blocking Suggestions

None.

Design Notes

The event factories are straightforward and the recorder API is a reasonable adapter boundary, but the persistence strategy undermines the journal’s core reliability goal. Pair integrity needs to be preserved explicitly, or partial persistence needs to be surfaced and handled rather than silently ignored.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator Author

Follow-up architectural decisions for #517 slices 3–7

This PR (#534) is the shared primitive every per-adapter migration consumes — IOJournalRecorder. Slices 3 (Anthropic) and 4 (LiteLLM) have already landed in #535 / #536 and proved the wiring shape. The remaining slices below all reuse the same pattern, but each adapter has its own set of small architectural decisions that benefit from being on the record.

This PR (slice 2) is purely additive — it ships the recorder helper plus tests. The decisions below land in subsequent slices, not in this PR.


Slice 5 — Claude Code adapter (claude_code_adapter.py)

Goal. ClaudeCodeAdapter wraps every outbound call in IOJournalRecorder.record_llm_call, identical pattern to the Anthropic slice (#535).

Open decisions:

  1. Where is "the LLM call" in Claude Code? Unlike the direct Anthropic SDK, Claude Code drives a CLI subprocess. The "call boundary" is debatable:

    • Option A: Wrap each subprocess invocation (one journal entry per CLI run).
    • Option B: Parse the streamed Claude Code transcript and emit one entry per turn.
    • Recommendation: A. The subprocess invocation is the natural atomic unit; turn-level granularity is a follow-up if usage demands.
  2. Prompt hashing. Claude Code sends a system prompt + tool catalog + history on each call. Hash the entire kwargs envelope, or only the user-facing prompt?

    • Recommendation: hash the kwargs envelope. Identical kwargs across runs collapse to the same hash; history changes are part of "this is a different call".
  3. Token counts. Claude Code's CLI does not always echo usage in a stable shape. Open: read the SDK-style usage from the subprocess return, or accept None?


Slice 6 — Codex CLI adapter (codex_cli_adapter.py)

Goal. Same as slice 5 but for Codex.

Open decisions:

  1. Coordinating with feat(orchestrator): Agent OS runtime_profile (Codex backend, supersedes #488) #505. PR feat(orchestrator): Agent OS runtime_profile (Codex backend, supersedes #488) #505 (open) introduces runtime_profile for Codex. The journal recorder integration must coexist:

  2. Streaming responses. Codex CLI streams; the recorder needs a single completion text snapshot at the end. Open: should the recorder also expose a streaming record_token API (Tier-3) or stick with one snapshot?

    • Recommendation: single snapshot for slice 6. Streaming hooks are a follow-up if downstream projectors need per-token granularity.

Slice 7 — Gemini CLI adapter (gemini_cli_adapter.py)

Goal. Same shape as slice 5 / 6.

Open decisions:

  1. finish_reason vocabulary. Gemini's vocabulary differs from Anthropic / OpenAI ("STOP", "MAX_TOKENS", "SAFETY", …). Per the foundation slice (feat(events): I/O Journal foundation (M3 / slice 1 of #517) #532), finish_reason is opaque — provider strings pass through verbatim. The Gemini slice must not normalise.

  2. Tool-call interleaving. Gemini CLI mixes function calls and text in the response. The recorder's completion_text is a single string; multimodal/tool blocks need a serialisation choice.


Slice 8 — OpenCode adapter (opencode_adapter.py)

Goal. Same shape; the trickiest of the four because OpenCode has both a plugin and subprocess mode (per should_dispatch_via_plugin).

Open decisions:

  1. Per-mode wiring. Should the recorder wrap the plugin path, the subprocess path, or both?

    • Option A: Subprocess path only — the plugin path emits its own envelopes that already become _subagent events.
    • Option B: Both paths so the journal is mode-agnostic.
    • Recommendation: A. The plugin envelope is a different abstraction; double-wrapping introduces redundant events.
  2. Caller string. When the adapter is invoked from a _subagent envelope, the recorder's caller field could be "opencode_adapter" (mechanical) or "opencode_subagent" (semantic). Pin one convention so projections can group consistently.


Slice 9 — MCP tool dispatch (mcp/tools/execution_handlers.py or central dispatch)

Goal. Wrap every tool dispatch in IOJournalRecorder.record_tool_call so tool.call.started / tool.call.returned events land alongside llm.call.*.

Open decisions:

  1. Where is "the dispatch boundary"? MCP tool calls flow through several layers (handler → bridge → MCP server → tool). The recorder needs one canonical wrap point. Which?

    • Option A: At the handler level (one wrap per tools/call invocation; misses inter-handler tool calls).
    • Option B: At the bridge level (every external MCP call wraps).
    • Recommendation: B. Aligns with the M3 north-star "explain a retry from the journal alone" — every outbound call must be journaled.
  2. tool_name taxonomy. Should tool_name be the bare MCP name (fs.read), the prefixed name (<server>.fs.read), or both (one canonical, one as extra)?

    • Recommendation: prefixed canonical; bare in extra for backward compat. Aligns with how mcp_tool_prefix already labels tools elsewhere.
  3. Args sanitisation. args_preview may include sensitive paths or credentials. The privacy switch from feat(events): I/O Journal foundation (M3 / slice 1 of #517) #532 already covers this, but the dispatch path must respect operator config — open: does the recorder receive the args verbatim and let the privacy switch handle redaction, or sanitise before passing?

    • Recommendation: verbatim. Privacy is a single-source-of-truth concern; the dispatch should not double-sanitise.

Slice 10 — M3 acceptance scenario (closes #517)

Goal. A single integration test that fixates the M3 invariant: "Why did the evaluator retry?" must be answerable from the journal alone — no log files needed.

Open decisions:

  1. Fixture shape. What workflow does the fixture run?

    • Option A: A scripted fake evaluator that retries deterministically, exercises both LLM and tool calls, and asserts the journal contains the expected story.
    • Option B: A real evaluation against a small repo, with the LLM call mocked to fail-then-succeed.
    • Recommendation: A. Determinism > fidelity for an acceptance test.
  2. Assertion shape. What does the test assert?

    • Option A: Event count + types in order.
    • Option B: Each retry has a control.directive.emitted RETRY adjacent to an llm.call.returned with the matching prompt_hash.
    • Recommendation: B. That's the M3 north star written as code.

Decision summary table

Slice Hard dependency Open architectural decisions
5 (Claude Code) #532 + #534 (1) call boundary (subprocess vs turn), (2) prompt hash payload, (3) token-count parsing
6 (Codex CLI) #532 + #534 + #505 (1) runtime_profile model_id, (2) streaming vs snapshot
7 (Gemini CLI) #532 + #534 (1) finish_reason opacity (already locked), (2) multimodal serialisation
8 (OpenCode) #532 + #534 (1) plugin vs subprocess wrap, (2) caller string convention
9 (MCP tool dispatch) #532 + #534 (1) wrap layer (handler / bridge), (2) tool_name taxonomy, (3) args sanitisation policy
10 (acceptance) All of 1–9 (1) fixture shape, (2) assertion shape

This PR (slice 2) does not require any of the above to be decided yet. The recorder is intentionally provider-agnostic; each adapter slice picks its answers locally.


If maintainers prefer this folded into the PR body itself rather than a follow-up comment, happy to push an edit.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 118e19b for PR #534

Review record: f101ba22-486a-4f8a-bd86-803e1878efcd

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/events/io_recorder.py:199 | BLOCKING | IOJournalRecorder shapes previews before calling the event factories, but each factory reshapes the same field again (src/ouroboros/events/io.py:282, 335, 390, 455). That breaks the recorder contract in two production cases: a caller-supplied preview_cap above 256 is silently truncated back to the factory default, and an explicit recorder privacy override like privacy=PrivacyMode.ON is ignored if the process env is OUROBOROS_IO_JOURNAL_PREVIEWS=redacted because the factory applies env-based redaction a second time. The current tests only cover the privacy=OFF case, which passes because None survives the second pass, so this regression would ship unnoticed. |

Non-blocking Suggestions

None.

Design Notes

The split between pure event factories and an async recorder helper is sensible and keeps adapter integrations small. The main architectural flaw is that preview/privacy policy is implemented at both layers, which creates conflicting sources of truth.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator Author

Autopilot follow-up: pushed review-addressing commits and confirmed the GitHub check suite is green on the latest head. Current merge state is clean; remaining CHANGES_REQUESTED status appears to be awaiting fresh bot/human re-review rather than failing CI.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 9461344 for PR #534

Review record: ff043e7d-fc6d-4cb2-98cd-74dd62e7f634

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/events/io_recorder.py:222 | BLOCKING | is_error is derived only from an exception escaping the context manager, but the yielded record objects do not let callers explicitly mark a handled failure. That means any adapter/tool dispatcher that catches an exception, receives a provider-level error response, or otherwise continues after a failed call will still emit llm.call.returned / tool.call.returned with is_error=False (and for LLM there is no error_kind field at all). Since this recorder is the shared primitive for later adapter migrations, it will silently misclassify real failed calls as successes and break the journal’s stated goal of explaining retries from persisted events alone. |

Non-blocking Suggestions

None.

Design Notes

The split between pure event factories and the async recorder helper is clean, and the privacy/hash/timing concerns are well isolated. The main gap is contract expressiveness: the recorder currently models only “success” or “uncaught exception,” which is too narrow for a foundation layer that future adapters are supposed to share.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator Author

Autopilot follow-up: latest head 4b4370f includes explicit handled-error recording support (record_error / is_error / error_kind) with local recorder tests passing. Requesting fresh bot review on current head.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit d65f11d for PR #534

Review record: ea93fada-cc2c-4c4a-b8aa-1399463980e4

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

  • src/ouroboros/events/io_recorder.py:331 [warning] record_tool_call's exception path is part of the recorder's core contract, but this PR never tests that a raised tool exception both propagates and still emits tool.call.returned with is_error=True and error_kind. Because the tool path is duplicated rather than shared with the LLM path, this can regress silently and leave the journal unable to explain tool-call failures, which is exactly the M3 invariant this primitive is supposed to guarantee.

Non-blocking Suggestions

| 1 | src/ouroboros/events/io_recorder.py:108 | Refactoring | record_error() on both record handle types cannot attach extra metadata, even though the returned events support extra. If adapters later need structured handled-error context, callers will have to mutate .extra directly. |

Design Notes

The slice is well-factored overall: schema shaping lives in events/io.py, and the recorder cleanly centralizes hashing, privacy, timing, and best-effort persistence. The main design risk is drift between the duplicated LLM and tool recorder paths, which makes mirrored failure-path tests especially important.

Policy Notes

  • No in-scope blocking findings remained after policy filtering; downgraded verdict accordingly.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Copy link
Copy Markdown
Owner

@Q00 Q00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed IOJournalRecorder. The context-manager API keeps adapter integrations small, pairs started/returned events cleanly, and handles error paths without creating orphan returned events.

@Q00 Q00 force-pushed the feat/517-2-io-journal-recorder branch from d65f11d to 0b4da0b Compare May 2, 2026 19:52
@Q00 Q00 merged commit 72f63b5 into Q00:main May 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M3 I/O Journal — tool.call.* and llm.call.* event categories

2 participants