Skip to content

dicnunz/mentor-worker-benchmark

mentor-worker-benchmark

CI GitHub Pages

Mentor worker benchmark visual

mentor-worker-benchmark is a fully local benchmark for measuring whether a mentor LLM improves a worker LLM on deterministic, objectively scored coding tasks.

Core docs:

If this benchmark saves you local eval time, the smallest support path is the $5 Codex run receipt: https://nicdunz.gumroad.com/l/smrimu.

For self-serve browser/account/public-action control templates around eval approvals, proof capture, handoffs, and go/no-go checks, use Agent Browser Operator OS: https://nicdunz.gumroad.com/l/agent-browser-operator-os.

For a written no-call audit of a local mentor/worker eval workflow, use the paid setup audit path:

Redacted configs, result summaries, and public repo links only. Do not paste API keys, provider credentials, private transcripts, auth files, or personal data. No call required. The browser operator kit is self-serve material only; it does not include Chrome plugin repair, guaranteed automation, account access, custom setup, calls, or posting without human approval.

  • Inference is local via Ollama (no paid APIs required).
  • Scoring is objective: generated patches are applied, then pytest decides pass/fail.
  • Outputs are reproducible artifacts (results.json, markdown leaderboard, optional static docs page).

Fastest Credible First Run

If you want the shortest path from clone to a verified artifact:

python3 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.lock
python -m mentor_worker_benchmark setup
./scripts/run_local_verification.sh

That path gives you:

  • a backend preflight record
  • a resumable benchmark artifact plus checkpoint log under results/
  • a verified community (not official) submission zip under submissions/

Motivation

Many “AI collaboration” evaluations are hard to verify and easy to game. This project tests a narrower, auditable question:

When a mentor can only send natural-language guidance, does the worker solve more tasks than the worker alone?

The benchmark includes controls and ablations (baseline worker-only and dummy-mentor control), plus guardrails against mentor cheating and unsafe patches.

Benchmark Construct

  • The worker sees the task prompt, a workspace snapshot, failing pytest output, and test files.
  • The benchmark therefore measures test-driven repair ability, not blind synthesis from a hidden oracle.
  • Tasks are deterministic local Python repair tasks.
  • No internet access is allowed or required during execution.
  • The evaluation oracle is the local pytest suite bundled with each task.

What It Measures (And What It Doesn’t)

What it measures:

  • Task success rate on deterministic Python repair tasks scored by unit tests.
  • Mentorship lift: mentored pass rate minus worker-only baseline.
  • Control performance with non-informative mentor advice.
  • Mentor constraint violation rate.

What it does not measure:

  • Open-ended coding quality beyond test coverage.
  • Subjective style or readability judgments.
  • Real-world long-horizon software engineering workflows.

Open-benchmark limitation:

  • The task corpus, tests, and submission bundles are public, so leaderboard overfitting is possible. Treat leaderboard results as transparent benchmark behavior on an open corpus, not as performance on a hidden holdout.

Task Pack and Splits

Default pack: task_pack_v2 (mini-repo realism).

task_pack_v2 contains 473 active deterministic tasks after exact-family deduplication for split independence.

Source provenance:

  • 652 source tasks were generated and audited.
  • 38 exact duplicate families were detected in the source corpus.
  • The active release keeps one representative per exact family.

Categories:

  1. string_regex_parsing
  2. ds_algo
  3. file_io_serialization
  4. concurrency_basics
  5. numerical_edge_cases
  6. multi_file_mini_module
  7. mini_repo_bugfix
  8. mini_repo_feature
  9. mini_repo_cli
  10. mini_repo_tool_sim

Splits:

  • train: 265
  • dev: 104
  • test: 104
  • quick: 30 curated eval tasks (balanced fast profile)

Default benchmark behavior runs dev+test unless overridden. task_pack_v1 is still available via --task-pack task_pack_v1.

Why v2 is more realistic:

  • Tasks include multi-module interactions and integration-style failures.
  • CLI behavior and tool-output parsing patterns mimic real local workflows.
  • Worker context now includes a concise file tree plus size-limited file excerpts.

Install

Python 3.11+ is required.

python3 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements.lock

Local Setup (Ollama)

python -m mentor_worker_benchmark setup

Default models (pulled if missing):

  • llama3.1:8b
  • qwen2.5-coder:7b
  • mistral:7b
  • phi3:mini
  • gemma2:9b

If Ollama is installed but not running, start it with the desktop app or:

ollama serve

Local Verification (Recommended On This Machine)

Use the sanctioned local verification path before any headline/publication run.

  • It is a single-seed local release-health check.
  • It is intended to be operationally viable on this 16 GB MacBook Air.
  • It is not a headline benchmark publication path.
./scripts/run_local_verification.sh

Default local verification profile:

  • task_pack_v2
  • suite=dev10
  • seed=1337
  • worker=phi3:mini
  • mentor=phi3:mini
  • run-modes=worker_only,mentor_worker
  • backend preflight required before the run starts

What the script leaves behind:

  • preflight JSON for backend stability
  • results JSON with pass/fail outcomes, compute budget, environment, and failure accounting
  • checkpoint JSONL for resumable reruns of the same config
  • run log plus a verified community submission zip

How to read the core numbers:

  • Baseline: worker-only pass rate
  • Mentored: same worker with mentor guidance
  • Lift: Mentored - Baseline
  • Model Errors and Timeouts: backend failures that should be interpreted alongside pass rate

Manual equivalent:

python -m mentor_worker_benchmark preflight \
  --models phi3:mini \
  --model-timeout 30 \
  --attempts 2

python -m mentor_worker_benchmark run \
  --task-pack task_pack_v2 \
  --suite dev10 \
  --mentor-model phi3:mini \
  --worker-model phi3:mini \
  --run-modes worker_only,mentor_worker \
  --repro \
  --max-turns 3 \
  --model-timeout 180 \
  --test-timeout 8 \
  --model-retries 1 \
  --model-retry-backoff 1.0 \
  --worker-num-predict 512 \
  --mentor-num-predict 256 \
  --seed 1337 \
  --results-path results/local_verification_dev10.json

--repro fixes key generation/runtime settings (temperature, top_p, seeds, max tokens, max turns). If you rerun the same command with the same --results-path on the same benchmark code/task-pack revision, completed units are resumed from <results-stem>.checkpoint.jsonl. If you change suite, seed, models, run modes, or benchmark revision, use a new --results-path.

Run With OpenAI SOTA Models

You can run remote models by switching provider(s) from the default ollama to openai.

Requirements:

  • Set OPENAI_API_KEY in your environment.
  • Pick explicit model names (for example gpt-5, gpt-5-mini, o4-mini) using role flags.

Examples:

# Use OpenAI for both mentor and worker.
python -m mentor_worker_benchmark run \
  --provider openai \
  --mentor-model gpt-5 \
  --worker-model gpt-5-mini \
  --suite quick \
  --run-modes worker_only,mentor_worker \
  --repro \
  --max-turns 3 \
  --model-timeout 180 \
  --test-timeout 8 \
  --worker-num-predict 512 \
  --mentor-num-predict 256
# Hybrid run: local worker via Ollama, remote mentor via OpenAI.
python -m mentor_worker_benchmark run \
  --provider ollama \
  --mentor-provider openai \
  --worker-provider ollama \
  --mentor-model gpt-5 \
  --worker-model phi3:mini \
  --suite quick \
  --run-modes worker_only,mentor_worker \
  --repro \
  --max-turns 3 \
  --model-timeout 180 \
  --test-timeout 8 \
  --worker-num-predict 512 \
  --mentor-num-predict 256
# Optional reasoning hint for supported OpenAI models.
python -m mentor_worker_benchmark run \
  --provider openai \
  --mentor-model gpt-5 \
  --worker-model gpt-5-mini \
  --reasoning-level medium

Warning:

  • OpenAI runs are not free. You are responsible for API cost and rate limits.
  • Large suites can trigger throttling; start with --suite quick and the reproducible quick profile shown above.

Official Suites

Standardized scripts (macOS/Linux):

./scripts/run_local_verification.sh
./scripts/run_official_quick.sh
./scripts/run_official_dev.sh
./scripts/run_official_dev_v1.sh

Operational policy:

  • ./scripts/run_local_verification.sh is the sanctioned local release-health path on this machine.
  • ./scripts/run_official_dev.sh and headline dev/dev50/test publication paths remain scientifically valid, but they are not a practical default local gate on this hardware.
  • ./scripts/run_official_quick.sh produces an official sanity artifact; it is still heavier than the sanctioned local verification path.

run_official_dev_v1.sh accepts TASK_SUITE=dev|dev50|test (default dev50).

Each script runs a fixed-suite benchmark configuration, exports a submission bundle, and verifies it. Headline policy:

  • Headline official baseline numbers come from dev/dev50/test suites.
  • Official dev10/quick runs are sanity checks for harness health and error-rate visibility, not headline performance claims.
  • Headline suites run three deterministic seeds by default: 1337, 2026, 9001.

How To Interpret Headline Numbers

  • Headline Baseline, Mentored, and Lift are multi-seed means (not single-point pass rates).
  • Confidence intervals are 95% bootstrap CIs computed deterministically from task-family outcomes.
  • A sig lift marker means the 95% lift CI excludes 0.
  • Sanity suites (quick/dev10) remain harness-health checks, not headline claims.

Sanity Check (No Model Calls)

Validate task harness and starters without Ollama:

python -m mentor_worker_benchmark sanity --task-pack task_pack_v1 --suite quick --seed 1337
python -m mentor_worker_benchmark sanity --task-pack task_pack_v2 --suite quick --seed 1337
python -m mentor_worker_benchmark.tasks.task_pack_v1.validate
python -m mentor_worker_benchmark.tasks.task_pack_v2.validate
python -m mentor_worker_benchmark provenance --task-pack task_pack_v2

CLI

python -m mentor_worker_benchmark setup [--models default|m1,m2] [--skip-pull]
python -m mentor_worker_benchmark preflight [--models m1,m2] [--model-timeout 30] [--attempts 2]
python -m mentor_worker_benchmark run [--task-pack task_pack_v2|task_pack_v1] [--suite quick|dev10|dev50|dev|test|all] [--seed 1337|--seeds 1337,2026,9001] [--model-timeout N] [--test-timeout M] [--repro] [--debug]
python -m mentor_worker_benchmark run --task-pack-path /abs/path/to/pack --suite dev
python -m mentor_worker_benchmark sanity [--task-pack task_pack_v2|task_pack_v1] [--suite quick|dev10|dev50|dev|test|all]
python -m mentor_worker_benchmark leaderboard --results results/results.json --output results/leaderboard.md
python -m mentor_worker_benchmark compare --before before.json --after after.json
python -m mentor_worker_benchmark analyze --results results/results.json --out results/analysis.json
python -m mentor_worker_benchmark export --results results/results.json --out submissions/<name>.zip [--official]
python -m mentor_worker_benchmark verify --submission submissions/<name>.zip
python -m mentor_worker_benchmark curate --task-pack task_pack_v1 --seed 1337
python -m mentor_worker_benchmark provenance --task-pack task_pack_v2 [--fail-on-overlap]

Convenience:

make setup
make quick

Current Snapshot

Use docs/post_ready_summary.md for a compact, shareable summary generated from leaderboard/summary.json. Use the live leaderboard or docs/leaderboard.md for the full verified submission table.

Generated files:

  • results/results.json
  • results/<stem>.checkpoint.jsonl (append-only resume log; source of truth for interrupted single-seed runs)
  • results/<stem>.seed-<seed>.json (written for completed seeds before multi-seed final merge)
  • results/analysis.json (from analyze, or bundled automatically during export)
  • results/leaderboard.md
  • docs/index.html (optional GitHub Pages view)

Checkpointing And Resume

  • Resume unit: (seed, mode, task_id, worker_model, mentor_model).
  • Checkpoints are append-only JSONL files stored next to --results-path.
  • Re-running the same command with the same --results-path and the same checkpoint metadata skips already completed units deterministically.
  • Checkpoint metadata includes the benchmark git commit and task-pack metadata, so resume does not cross code/task-pack revisions.
  • Multi-seed runs write per-seed partial JSON artifacts before the final merged results.json.
  • Not resumable until completion: final merged multi-seed results.json, exported submission ZIPs, and leaderboard.md.
  • benchmark_wall_time_seconds reflects accumulated completed run time across resumed units; checkpointing.session_wall_time_seconds records the current invocation.

How To Submit Results

  1. Run ./scripts/run_local_verification.sh first.
  2. If you are producing a publication/leaderboard artifact, run an official suite (./scripts/run_official_quick.sh or ./scripts/run_official_dev.sh).
  3. If needed, manually export:
python -m mentor_worker_benchmark export \
  --results results/results.json \
  --out submissions/my_run.zip
  1. Verify your bundle:
python -m mentor_worker_benchmark verify --submission submissions/my_run.zip
  1. Open a submission issue and attach/link the zip.

Submission details and maintainer verification flow are documented in docs/SUBMIT_RESULTS.md. Pack registry/data-card guidance is documented in docs/PACKS.md.

Community Leaderboard Automation

Repository structure:

  • submissions/: working area for exported bundles and archived publication bundles.
  • submissions/archive/: tracked historical/public submission bundles grouped by task-pack version.
  • leaderboard/submissions/: normalized per-submission JSON extracted from verified bundles.
  • leaderboard/summary.json: aggregated view used by docs.

Automation:

  • PRs touching submission bundles under submissions/ trigger .github/workflows/submissions-pr.yml.
  • CI verifies each changed submission zip.
  • CI regenerates leaderboard/summary.json, docs/leaderboard.md, and docs/index.html.
  • For same-repo PRs, CI auto-commits refreshed artifacts back to the PR branch.

docs/index.html now includes:

  • headline official baselines plus official sanity runs,
  • pack filter (task_pack_v1 / task_pack_v2),
  • suite filter (quick / dev10 / dev50 / dev / test),
  • explicit community (not official) labeling.

Official Baselines

Currently tracked official publication archive on this branch:

Local health artifacts produced by ./scripts/run_local_verification.sh are intentionally not treated as headline baselines and are ignored by the public leaderboard regeneration step.

Lightweight Leaderboard Publishing

Generate markdown + static HTML:

python scripts/publish_leaderboard.py \
  --results results/results.json \
  --markdown-out results/leaderboard.md \
  --html-out docs/single_run.html

Enable GitHub Pages (repo settings):

  1. Open Settings → Pages.
  2. Set Source to “Deploy from a branch”.
  3. Select main branch and /docs folder.
  4. Save. GitHub publishes docs/index.html (community leaderboard).

Regenerate community artifacts from tracked submission bundles under submissions/ (recursive, including submissions/archive/...; local scratch bundles such as submissions/local_* and submissions/tmp_* are ignored):

python scripts/build_community_leaderboard.py --strict

Methodology Guardrails

  • Mentor can only send natural-language guidance; code-like mentor output is blocked/sanitized and logged.
  • Worker must return a unified diff patch; patch format and paths are validated.
  • Patch application forbids traversal outside the task workspace.
  • Tests run in isolated temp directories with a configurable per-test timeout and network disabled.
  • Run metadata logs environment and provenance (Python, platform, Ollama version/model tags, git commit hash).
  • Strict reproducibility claims should only be made when the backend preflight passes and the same model digests are used.

Compute Budget

Each run writes a compute_budget manifest in results.json and in exported submission_manifest.json:

  • max_turns
  • timeout_seconds (legacy alias for model timeout)
  • model_timeout_seconds
  • test_timeout_seconds
  • total_model_calls_attempted
  • total_tokens_estimate (or explicit "unavailable")
  • total_wall_time_seconds

Test Strength Gates

Task-pack validation now includes deterministic test-strength gates to make results harder to game:

  • Static test strength heuristics per task:
    • assertion count (AST-based),
    • edge-case keyword coverage,
    • negative-test presence (e.g., exception expectations),
    • multi-file interaction signal from test/source imports.
  • Counterexample mutation harness:
    • runs tests on the starter task workspace,
    • applies a deterministic wrong patch to the likely target module/function,
    • verifies tests fail on the wrong patch.
  • Strict mode (--strict) fails validation when:
    • too many tasks are mutation-skipped,
    • non-allowlisted tasks do not fail under the wrong patch,
    • low-strength scores exceed conservative policy thresholds.

Allowlists live in each pack (strength_allowlist.json) to explicitly grandfather legacy tasks while keeping strict checks transparent.

What this does not guarantee:

  • It is not a full mutation-testing framework and does not prove complete behavioral coverage.
  • It reduces obvious weak-test/trivial-task failure modes but cannot eliminate every possible benchmark gaming path.

Provenance & Limitations

task_pack_v2 includes generated provenance artifacts:

  • mentor_worker_benchmark/tasks/task_pack_v2/provenance.json
  • mentor_worker_benchmark/tasks/task_pack_v2/PROVENANCE.md

They are generated by in-repo scripts and include:

  • Generator version, git commit hash, and seed.
  • A contamination-risk checklist with explicit did/did-not-do statements.
  • Intra-pack overlap scan (hashed token/char n-gram cosine on prompt+tests).
  • Originality marker scan for obvious external-source references in task files.

Regenerate and re-check:

python -m mentor_worker_benchmark provenance --task-pack task_pack_v2

No-overclaim policy:

  • See docs/benchmark_policy.md for allowed/disallowed claims and responsible citation guidance.
  • We intentionally do not claim zero contamination risk; models may have seen similar patterns during pretraining.

Cite / Reference

Use this short reference when citing the benchmark:

dicnunz. mentor-worker-benchmark: Local benchmark for mentor-worker LLM collaboration on objective coding tasks. GitHub repository, 2026. https://github.com/dicnunz/mentor-worker-benchmark

BibTeX:

@misc{dicnunz_mentor_worker_benchmark_2026,
  author = {dicnunz},
  title = {mentor-worker-benchmark: Local benchmark for mentor-worker LLM collaboration on objective coding tasks},
  year = {2026},
  howpublished = {\url{https://github.com/dicnunz/mentor-worker-benchmark}},
  note = {Accessed: 2026-03-01}
}

Quality Gates

task_pack_v1 includes an automated curation pipeline used to keep the base 300-task corpus credible before composing v2.

Run:

python -m mentor_worker_benchmark curate --task-pack task_pack_v1 --seed 1337

What curate does:

  • Detects near-duplicates using hashed token/character n-gram cosine similarity.
  • Flags trivial tasks (low test depth, short starter code, and phi3 one-turn pass checks).
  • Flags ambiguity (missing explicit prompt I/O examples, weak edge-case/invalid-input test coverage).
  • Rebalances difficulty to target distribution (easy 35%, medium 45%, hard 20%) with DEV calibration.
  • Runs DEV one-turn worker-only calibration on phi3:mini and qwen2.5-coder:7b before/after.
  • Regenerates flagged tasks deterministically while preserving category/split and exact split counts.

Curation artifacts:

  • results/curation_report.json
  • results/curation_report.md

Adding or Updating Tasks

See CONTRIBUTING.md for task authoring standards, split rules, and validation commands.

License

MIT (LICENSE)

All benchmark task content under mentor_worker_benchmark/tasks/task_pack_v1 and mentor_worker_benchmark/tasks/task_pack_v2 is synthetic in-repo content and is MIT-licensed as part of this repository.

About

Local benchmark for mentor-worker LLM collaboration on deterministic coding tasks.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages