feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2) by mcp-tool-shop · Pull Request #16 · mcp-tool-shop-org/site-theme

mcp-tool-shop · 2026-06-16T09:22:02Z

What

Implements the experiment documented in cli/front-door/EVAL.md §2 — does a good front door actually help a coding agent, and at what cost? — as a runnable harness. This is distinct from the verifier self-eval in eval.mjs (which only proves the verifier behaves correctly).

Design (per EVAL.md)

Three arms, with the agent / model / tool-schema / prompts PINNED so only the front door varies:

A — repo as-is (baseline)
B — repo + front-door (real-world delta)
C — docs-stripped repo + front-door (isolates the front door's marginal contribution)

Grading is code execution — hidden fail-to-pass + pass-to-pass (Jimenez et al. arXiv:2310.06770) — never an LLM grade; the agent never sees the hidden tests. Reports resolution rate (Wilson CI, primary) and cost (tokens, steps), with paired McNemar + bootstrapped cost deltas and a power section sized for a ~2-4pp effect (Gloaguen et al. arXiv:2602.11988 found context can lower success and add 20-23% cost — measured, not assumed). Any qualitative call would use a different model family with the generator's reasoning hidden (Huang et al. arXiv:2310.01798; Kambhampati et al. arXiv:2402.01817).

Scope — honest partial run

The harness is complete and runnable. A synthetic, execution-graded self-validation (n=60) ran end-to-end with real code execution and is published as a receipt (receipts/front-door-ablation.{json,md}):

Arm	Resolved	Rate (95% Wilson)	Mean tokens
A repo as-is	38/60	63.3% [50.7, 74.4]	30.4
B + front-door	53/60	88.3% [77.8, 94.2]	42.4
C docs-stripped + front-door	42/60	70.0% [57.5, 80.1]	35.6

A→B +25pp (McNemar exact p≈0.0001, +12 tokens/inst overhead); B→C −18.3pp (strip risk); A→C +6.7pp (n.s.).

This proves the machinery, NOT the real front-door effect. The receipt carries measuresFrontDoorEffect: false, and the n=60 run is deliberately underpowered (achieved power 0.07 for 3pp; needs N≈2178). The real-effect run needs SWE-bench Verified + an external agent + per-instance containers + a model — those seams are shipped (parseSweBenchVerified, makeCommandAgent, makeCommandExecutor) and the CLI prints the contract + exits rather than faking a number. Full real-run contract + what was/wasn't covered + standards-compliance + compensators: cli/front-door/ABLATION.md.

Verification

56 new tests in tests/front-door/ablation/; full suite 185/185 pass
biome check clean on new files (LF in index); tsc --noEmit clean
Receipt is byte-replayable (deterministic agent + seeded bootstrap)

Note on overlap

The same harness commit is also the foundation of the separately-developed feat/front-door-v2.1 branch (which adds doctest + MCP slices and a v2.1.0 release). This PR is the focused, single-purpose ablation deliverable; coordinate merge order as preferred.

🤖 Generated with Claude Code

…VAL.md §2) Implements the experiment documented in cli/front-door/EVAL.md §2 — does a good front door actually help a coding agent, and at what cost? — as a runnable harness, distinct from the verifier self-eval in eval.mjs. Three arms with the agent/model/tool-schema/prompts PINNED so only the front door varies: A repo as-is, B repo + front-door, C docs-stripped + front-door. Grading is code execution (hidden fail-to-pass + pass-to-pass), never an LLM grade; the agent never sees the hidden tests. Reports resolution rate (Wilson) and cost (tokens, steps), with paired McNemar + bootstrapped cost deltas and a power section sized for a ~2-4pp effect. cli/front-door/ablation/: pin (PIN_PER_STEP + andon + cross-family judge), arms, grade, execute (node + command executors), agent (deterministic lookup reference agent + real external-agent hook), corpus (synthetic micro-bench + SWE-bench Verified parser), stats (probit/normalCdf/Wilson/McNemar/power/ seeded bootstrap), runner, CLI (front-door ablation run). Partial run, honestly scoped: a synthetic, execution-graded self-validation (n=60) ran end-to-end with real code execution and is published as a receipt (receipts/front-door-ablation.{json,md}). It proves the machinery, NOT the real front-door effect — the receipt carries measuresFrontDoorEffect:false and the n=60 run is deliberately underpowered (achieved power 0.07 for 3pp; needs N~2178). The real-effect run needs SWE-bench Verified + an external agent + per-instance containers + a model; those seams are shipped and the CLI prints the contract + exits rather than faking a number. See cli/front-door/ABLATION.md for the real-run contract, what was/wasn't covered, and standards compliance. 56 new tests; vitest run + lint clean (new files LF in index). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mcp-tool-shop · 2026-06-16T09:39:21Z

Superseded by #15 (feat/front-door-v2.1): its ablation content is byte-identical to this PR (verified empty diff across cli/front-door/ablation, ABLATION.md, EVAL.md). Closing to avoid a duplicate merge; the ablation harness lands via #15.

mcp-tool-shop closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2)#16

feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2)#16
mcp-tool-shop wants to merge 1 commit into
mainfrom
feat/front-door-swe-ablation

mcp-tool-shop commented Jun 16, 2026

Uh oh!

mcp-tool-shop commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcp-tool-shop commented Jun 16, 2026

What

Design (per EVAL.md)

Scope — honest partial run

Verification

Note on overlap

Uh oh!

mcp-tool-shop commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant