feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2)#16
Closed
mcp-tool-shop wants to merge 1 commit into
Closed
feat(front-door): three-arm SWE-bench docs-on/off ablation harness (EVAL.md §2)#16mcp-tool-shop wants to merge 1 commit into
mcp-tool-shop wants to merge 1 commit into
Conversation
…VAL.md §2)
Implements the experiment documented in cli/front-door/EVAL.md §2 — does a good
front door actually help a coding agent, and at what cost? — as a runnable
harness, distinct from the verifier self-eval in eval.mjs.
Three arms with the agent/model/tool-schema/prompts PINNED so only the front
door varies: A repo as-is, B repo + front-door, C docs-stripped + front-door.
Grading is code execution (hidden fail-to-pass + pass-to-pass), never an LLM
grade; the agent never sees the hidden tests. Reports resolution rate (Wilson)
and cost (tokens, steps), with paired McNemar + bootstrapped cost deltas and a
power section sized for a ~2-4pp effect.
cli/front-door/ablation/: pin (PIN_PER_STEP + andon + cross-family judge),
arms, grade, execute (node + command executors), agent (deterministic lookup
reference agent + real external-agent hook), corpus (synthetic micro-bench +
SWE-bench Verified parser), stats (probit/normalCdf/Wilson/McNemar/power/
seeded bootstrap), runner, CLI (front-door ablation run).
Partial run, honestly scoped: a synthetic, execution-graded self-validation
(n=60) ran end-to-end with real code execution and is published as a receipt
(receipts/front-door-ablation.{json,md}). It proves the machinery, NOT the
real front-door effect — the receipt carries measuresFrontDoorEffect:false and
the n=60 run is deliberately underpowered (achieved power 0.07 for 3pp; needs
N~2178). The real-effect run needs SWE-bench Verified + an external agent +
per-instance containers + a model; those seams are shipped and the CLI prints
the contract + exits rather than faking a number. See cli/front-door/ABLATION.md
for the real-run contract, what was/wasn't covered, and standards compliance.
56 new tests; vitest run + lint clean (new files LF in index).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implements the experiment documented in
cli/front-door/EVAL.md§2 — does a good front door actually help a coding agent, and at what cost? — as a runnable harness. This is distinct from the verifier self-eval ineval.mjs(which only proves the verifier behaves correctly).Design (per EVAL.md)
Three arms, with the agent / model / tool-schema / prompts PINNED so only the front door varies:
Grading is code execution — hidden fail-to-pass + pass-to-pass (Jimenez et al. arXiv:2310.06770) — never an LLM grade; the agent never sees the hidden tests. Reports resolution rate (Wilson CI, primary) and cost (tokens, steps), with paired McNemar + bootstrapped cost deltas and a power section sized for a ~2-4pp effect (Gloaguen et al. arXiv:2602.11988 found context can lower success and add 20-23% cost — measured, not assumed). Any qualitative call would use a different model family with the generator's reasoning hidden (Huang et al. arXiv:2310.01798; Kambhampati et al. arXiv:2402.01817).
Scope — honest partial run
The harness is complete and runnable. A synthetic, execution-graded self-validation (n=60) ran end-to-end with real code execution and is published as a receipt (
receipts/front-door-ablation.{json,md}):A→B +25pp (McNemar exact p≈0.0001, +12 tokens/inst overhead); B→C −18.3pp (strip risk); A→C +6.7pp (n.s.).
This proves the machinery, NOT the real front-door effect. The receipt carries
measuresFrontDoorEffect: false, and the n=60 run is deliberately underpowered (achieved power 0.07 for 3pp; needs N≈2178). The real-effect run needs SWE-bench Verified + an external agent + per-instance containers + a model — those seams are shipped (parseSweBenchVerified,makeCommandAgent,makeCommandExecutor) and the CLI prints the contract + exits rather than faking a number. Full real-run contract + what was/wasn't covered + standards-compliance + compensators:cli/front-door/ABLATION.md.Verification
tests/front-door/ablation/; full suite 185/185 passbiome checkclean on new files (LF in index);tsc --noEmitcleanNote on overlap
The same harness commit is also the foundation of the separately-developed
feat/front-door-v2.1branch (which adds doctest + MCP slices and a v2.1.0 release). This PR is the focused, single-purpose ablation deliverable; coordinate merge order as preferred.🤖 Generated with Claude Code