Skip to content

feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2)#16

Closed
mcp-tool-shop wants to merge 1 commit into
mainfrom
feat/front-door-swe-ablation
Closed

feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2)#16
mcp-tool-shop wants to merge 1 commit into
mainfrom
feat/front-door-swe-ablation

Conversation

@mcp-tool-shop

Copy link
Copy Markdown
Member

What

Implements the experiment documented in cli/front-door/EVAL.md §2 — does a good front door actually help a coding agent, and at what cost? — as a runnable harness. This is distinct from the verifier self-eval in eval.mjs (which only proves the verifier behaves correctly).

Design (per EVAL.md)

Three arms, with the agent / model / tool-schema / prompts PINNED so only the front door varies:

  • A — repo as-is (baseline)
  • B — repo + front-door (real-world delta)
  • C — docs-stripped repo + front-door (isolates the front door's marginal contribution)

Grading is code execution — hidden fail-to-pass + pass-to-pass (Jimenez et al. arXiv:2310.06770) — never an LLM grade; the agent never sees the hidden tests. Reports resolution rate (Wilson CI, primary) and cost (tokens, steps), with paired McNemar + bootstrapped cost deltas and a power section sized for a ~2-4pp effect (Gloaguen et al. arXiv:2602.11988 found context can lower success and add 20-23% cost — measured, not assumed). Any qualitative call would use a different model family with the generator's reasoning hidden (Huang et al. arXiv:2310.01798; Kambhampati et al. arXiv:2402.01817).

Scope — honest partial run

The harness is complete and runnable. A synthetic, execution-graded self-validation (n=60) ran end-to-end with real code execution and is published as a receipt (receipts/front-door-ablation.{json,md}):

Arm Resolved Rate (95% Wilson) Mean tokens
A repo as-is 38/60 63.3% [50.7, 74.4] 30.4
B + front-door 53/60 88.3% [77.8, 94.2] 42.4
C docs-stripped + front-door 42/60 70.0% [57.5, 80.1] 35.6

A→B +25pp (McNemar exact p≈0.0001, +12 tokens/inst overhead); B→C −18.3pp (strip risk); A→C +6.7pp (n.s.).

This proves the machinery, NOT the real front-door effect. The receipt carries measuresFrontDoorEffect: false, and the n=60 run is deliberately underpowered (achieved power 0.07 for 3pp; needs N≈2178). The real-effect run needs SWE-bench Verified + an external agent + per-instance containers + a model — those seams are shipped (parseSweBenchVerified, makeCommandAgent, makeCommandExecutor) and the CLI prints the contract + exits rather than faking a number. Full real-run contract + what was/wasn't covered + standards-compliance + compensators: cli/front-door/ABLATION.md.

Verification

  • 56 new tests in tests/front-door/ablation/; full suite 185/185 pass
  • biome check clean on new files (LF in index); tsc --noEmit clean
  • Receipt is byte-replayable (deterministic agent + seeded bootstrap)

Note on overlap

The same harness commit is also the foundation of the separately-developed feat/front-door-v2.1 branch (which adds doctest + MCP slices and a v2.1.0 release). This PR is the focused, single-purpose ablation deliverable; coordinate merge order as preferred.

🤖 Generated with Claude Code

…VAL.md §2)

Implements the experiment documented in cli/front-door/EVAL.md §2 — does a good
front door actually help a coding agent, and at what cost? — as a runnable
harness, distinct from the verifier self-eval in eval.mjs.

Three arms with the agent/model/tool-schema/prompts PINNED so only the front
door varies: A repo as-is, B repo + front-door, C docs-stripped + front-door.
Grading is code execution (hidden fail-to-pass + pass-to-pass), never an LLM
grade; the agent never sees the hidden tests. Reports resolution rate (Wilson)
and cost (tokens, steps), with paired McNemar + bootstrapped cost deltas and a
power section sized for a ~2-4pp effect.

cli/front-door/ablation/: pin (PIN_PER_STEP + andon + cross-family judge),
arms, grade, execute (node + command executors), agent (deterministic lookup
reference agent + real external-agent hook), corpus (synthetic micro-bench +
SWE-bench Verified parser), stats (probit/normalCdf/Wilson/McNemar/power/
seeded bootstrap), runner, CLI (front-door ablation run).

Partial run, honestly scoped: a synthetic, execution-graded self-validation
(n=60) ran end-to-end with real code execution and is published as a receipt
(receipts/front-door-ablation.{json,md}). It proves the machinery, NOT the
real front-door effect — the receipt carries measuresFrontDoorEffect:false and
the n=60 run is deliberately underpowered (achieved power 0.07 for 3pp; needs
N~2178). The real-effect run needs SWE-bench Verified + an external agent +
per-instance containers + a model; those seams are shipped and the CLI prints
the contract + exits rather than faking a number. See cli/front-door/ABLATION.md
for the real-run contract, what was/wasn't covered, and standards compliance.

56 new tests; vitest run + lint clean (new files LF in index).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mcp-tool-shop

Copy link
Copy Markdown
Member Author

Superseded by #15 (feat/front-door-v2.1): its ablation content is byte-identical to this PR (verified empty diff across cli/front-door/ablation, ABLATION.md, EVAL.md). Closing to avoid a duplicate merge; the ablation harness lands via #15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant