feat(signum-evolve): add v1 candidate review layer#115
Merged
Conversation
Why: - Reviewers need more than fixture pass/fail when assessing candidate policy catalogs from signum-evolve. - Issue #114 tracks adding bounded candidate review evidence without changing scanner behavior or auto-applying candidate catalogs. What changed: - Add bounded multi-prefix scope candidates through maxMutationDepth while preserving the v0 default depth of 1. - Add deterministic catalog_diff.json archives, compact leaderboard diff metadata, candidate ranking scores, and adoption bundle diff reporting. - Add evolve.v1.json and a v1 smoke test covering multi-prefix output, catalog diff export, and source scanner/catalog immutability. Testing: - python3 -m py_compile experiments/signum_evolve/*.py - bash tests/test-signum-evolve-v0.sh - bash tests/test-signum-evolve-replay.sh - bash tests/test-signum-evolve-v1.sh - bash tests/test-policy-scanner-evals.sh - bash tests/test-policy-scanner-eval-compare.sh - bash tests/test-codex-prompt-evals.sh - bash tests/test-codex-prompt-eval-compare.sh - python3 evals/policy_scanner/run_policy_scanner_eval.py --repo-root . --json-output /tmp/policy-current-v1.json - python3 evals/policy_scanner/compare_policy_scanner_eval.py --baseline evals/policy_scanner/baselines/current.json --candidate /tmp/policy-current-v1.json - PATH=/opt/homebrew/bin:/Users/vi/.codex/tmp/arg0/codex-arg0IyheAr:/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home/bin:/Users/vi/Library/Android/sdk/platform-tools:/Users/vi/Library/Android/sdk/emulator:/Users/vi/.antigravity/antigravity/bin:/Users/vi/.agents/bin:/Users/vi/.opencode/bin:/Users/vi/.local/bin:/Users/vi/go/bin:/opt/homebrew/opt/libpq/bin:/Users/vi/.local/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/opt/pkg/env/active/bin:/opt/pmk/env/global/bin:/Library/Apple/usr/bin:/usr/local/go/bin:/Users/vi/.local/bin:/Users/vi/.cargo/bin:/Users/vi/Library/Application Support/JetBrains/Toolbox/scripts:/Applications/Codex.app/Contents/Resources bash scripts/run-deterministic-tests.sh Risk: - narrow - Changes are isolated to the offline experiment harness, but candidate ordering and leaderboard scoring should be preserved deliberately because reviewers may compare archived runs. Constraint: Do not modify scanner/catalog/Codex/runtime/CI behavior, eval baselines, or auto-apply candidates. Related: #114
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
signum-evolve v1candidate review evidence for offline policy catalog experiments.maxMutationDepthwhile preserving v0 default behavior.catalog_diff.jsonper candidate and surfaces compact diff/ranking metadata in the leaderboard.Issue: Closes #114
Why
What changed
experiments/signum_evolve/candidate.pycan generate deterministic prefix combinations when config setsmaxMutationDepth > 1.experiments/signum_evolve/catalog_diff.pyrecords rule-level catalog diffs without touching source catalogs.experiments/signum_evolve/report.pyadds stable rank/score fields and compact catalog diff metadata to leaderboards.experiments/signum_evolve/export.pyincludes catalog diff evidence in adoption bundles.experiments/signum_evolve/configs/evolve.v1.jsonenables depth 2 for v1 experiments.Review focus
evolve.v0.jsonstill defaults to one scope mutation per candidate.Test plan
python3 -m py_compile experiments/signum_evolve/*.pybash tests/test-signum-evolve-v0.shbash tests/test-signum-evolve-replay.shbash tests/test-signum-evolve-v1.shbash tests/test-policy-scanner-evals.shbash tests/test-policy-scanner-eval-compare.shbash tests/test-codex-prompt-evals.shbash tests/test-codex-prompt-eval-compare.shpython3 evals/policy_scanner/run_policy_scanner_eval.py --repo-root . --json-output /tmp/policy-current-v1.jsonpython3 evals/policy_scanner/compare_policy_scanner_eval.py --baseline evals/policy_scanner/baselines/current.json --candidate /tmp/policy-current-v1.jsonPATH=/opt/homebrew/bin:$PATH bash scripts/run-deterministic-tests.shNot changed
Risks
Rollout / migration
Breaking changes
Follow-ups
Merge strategy recommendation