Skip to content

feat(signum-evolve): add v1 candidate review layer#115

Merged
t3chn merged 1 commit into
mainfrom
codex/signum-evolve-v1-candidate-review
May 16, 2026
Merged

feat(signum-evolve): add v1 candidate review layer#115
t3chn merged 1 commit into
mainfrom
codex/signum-evolve-v1-candidate-review

Conversation

@t3chn
Copy link
Copy Markdown
Contributor

@t3chn t3chn commented May 16, 2026

Summary

  • Adds signum-evolve v1 candidate review evidence for offline policy catalog experiments.
  • Adds bounded multi-prefix scope candidates through maxMutationDepth while preserving v0 default behavior.
  • Archives deterministic catalog_diff.json per candidate and surfaces compact diff/ranking metadata in the leaderboard.
  • Extends adoption bundles with catalog diff evidence and checklist coverage.
  • Adds a v1 smoke test for multi-prefix candidate generation, diff output, export, and scanner/catalog immutability.

Issue: Closes #114

Why

  • The v0 loop proved candidate generation, eval, compare, replay, and export.
  • Reviewers now need clearer candidate review evidence before any future adoption PR: what changed in the catalog, how candidates rank, and whether critical rules stayed untouched.

What changed

  • experiments/signum_evolve/candidate.py can generate deterministic prefix combinations when config sets maxMutationDepth > 1.
  • experiments/signum_evolve/catalog_diff.py records rule-level catalog diffs without touching source catalogs.
  • experiments/signum_evolve/report.py adds stable rank/score fields and compact catalog diff metadata to leaderboards.
  • experiments/signum_evolve/export.py includes catalog diff evidence in adoption bundles.
  • experiments/signum_evolve/configs/evolve.v1.json enables depth 2 for v1 experiments.

Review focus

  • Confirm v0 compatibility: evolve.v0.json still defaults to one scope mutation per candidate.
  • Confirm v1 remains scope-only and non-critical through existing mutation validation.
  • Confirm catalog diff output is deterministic and reviewable.
  • Confirm no runtime scanner/catalog/Codex/CI behavior changed.

Test plan

  • python3 -m py_compile experiments/signum_evolve/*.py
  • bash tests/test-signum-evolve-v0.sh
  • bash tests/test-signum-evolve-replay.sh
  • bash tests/test-signum-evolve-v1.sh
  • bash tests/test-policy-scanner-evals.sh
  • bash tests/test-policy-scanner-eval-compare.sh
  • bash tests/test-codex-prompt-evals.sh
  • bash tests/test-codex-prompt-eval-compare.sh
  • python3 evals/policy_scanner/run_policy_scanner_eval.py --repo-root . --json-output /tmp/policy-current-v1.json
  • python3 evals/policy_scanner/compare_policy_scanner_eval.py --baseline evals/policy_scanner/baselines/current.json --candidate /tmp/policy-current-v1.json
  • PATH=/opt/homebrew/bin:$PATH bash scripts/run-deterministic-tests.sh

Not changed

  • Scanner behavior
  • Source policy rule catalog
  • Claude overlay runtime
  • Codex prompt
  • CI wiring
  • Eval baselines
  • Candidate auto-apply behavior

Risks

  • narrow - Candidate ranking fields are new review metadata; future changes should preserve deterministic ordering for archived run comparisons.

Rollout / migration

  • N/A. This is an offline experiment harness change only.

Breaking changes

  • None.

Follow-ups

  • Add richer candidate scoring only after this review layer is accepted.
  • Keep future mutation families separate from this PR.

Merge strategy recommendation

  • Recommended: squash
  • Reason: this is one logical feature slice with one commit.

Why:
- Reviewers need more than fixture pass/fail when assessing candidate policy catalogs from signum-evolve.
- Issue #114 tracks adding bounded candidate review evidence without changing scanner behavior or auto-applying candidate catalogs.

What changed:
- Add bounded multi-prefix scope candidates through maxMutationDepth while preserving the v0 default depth of 1.
- Add deterministic catalog_diff.json archives, compact leaderboard diff metadata, candidate ranking scores, and adoption bundle diff reporting.
- Add evolve.v1.json and a v1 smoke test covering multi-prefix output, catalog diff export, and source scanner/catalog immutability.

Testing:
- python3 -m py_compile experiments/signum_evolve/*.py
- bash tests/test-signum-evolve-v0.sh
- bash tests/test-signum-evolve-replay.sh
- bash tests/test-signum-evolve-v1.sh
- bash tests/test-policy-scanner-evals.sh
- bash tests/test-policy-scanner-eval-compare.sh
- bash tests/test-codex-prompt-evals.sh
- bash tests/test-codex-prompt-eval-compare.sh
- python3 evals/policy_scanner/run_policy_scanner_eval.py --repo-root . --json-output /tmp/policy-current-v1.json
- python3 evals/policy_scanner/compare_policy_scanner_eval.py --baseline evals/policy_scanner/baselines/current.json --candidate /tmp/policy-current-v1.json
- PATH=/opt/homebrew/bin:/Users/vi/.codex/tmp/arg0/codex-arg0IyheAr:/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home/bin:/Users/vi/Library/Android/sdk/platform-tools:/Users/vi/Library/Android/sdk/emulator:/Users/vi/.antigravity/antigravity/bin:/Users/vi/.agents/bin:/Users/vi/.opencode/bin:/Users/vi/.local/bin:/Users/vi/go/bin:/opt/homebrew/opt/libpq/bin:/Users/vi/.local/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/pkg/env/active/bin:/opt/pmk/env/global/bin:/Library/Apple/usr/bin:/usr/local/go/bin:/Users/vi/.local/bin:/Users/vi/.cargo/bin:/Users/vi/Library/Application Support/JetBrains/Toolbox/scripts:/Applications/Codex.app/Contents/Resources bash scripts/run-deterministic-tests.sh

Risk:
- narrow - Changes are isolated to the offline experiment harness, but candidate ordering and leaderboard scoring should be preserved deliberately because reviewers may compare archived runs.

Constraint: Do not modify scanner/catalog/Codex/runtime/CI behavior, eval baselines, or auto-apply candidates.
Related: #114
@github-actions github-actions Bot added the intake/pass PR intake passed label May 16, 2026
@t3chn t3chn marked this pull request as ready for review May 16, 2026 14:18
@t3chn t3chn merged commit d0deda1 into main May 16, 2026
4 checks passed
@t3chn t3chn deleted the codex/signum-evolve-v1-candidate-review branch May 16, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

intake/pass PR intake passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement signum-evolve v1 candidate review layer

1 participant