Context
SPEC-03 defines Dimension 5 (Semantic Quality, weight 0.20) as an LLM review of the diff that evaluates code quality, readability, and adherence to conventions.
Current Behavior
runner.py hardcodes dimensions["semantic"] = 0.8. No LLM call is made.
Expected Behavior
After a mutation, send the diff to an LLM (via the Reflector or a dedicated scorer) to get a semantic quality assessment. Parse the response into a 0.0-1.0 score.
Impact
- 20% of the composite score is constant, providing no signal
- All experiments get the same semantic score regardless of quality
- Noted as intentional in IMPL-08 — placeholder until evolution layer LLM review is wired up
References
atlas-specs/03-EVALUATION.md — Dimension 5: Semantic Quality
atlas/evaluation/runner.py — dimensions["semantic"] = 0.8
atlas/evolution/reflector.py — could be extended to also produce a semantic score
Context
SPEC-03 defines Dimension 5 (Semantic Quality, weight 0.20) as an LLM review of the diff that evaluates code quality, readability, and adherence to conventions.
Current Behavior
runner.pyhardcodesdimensions["semantic"] = 0.8. No LLM call is made.Expected Behavior
After a mutation, send the diff to an LLM (via the Reflector or a dedicated scorer) to get a semantic quality assessment. Parse the response into a 0.0-1.0 score.
Impact
References
atlas-specs/03-EVALUATION.md— Dimension 5: Semantic Qualityatlas/evaluation/runner.py—dimensions["semantic"] = 0.8atlas/evolution/reflector.py— could be extended to also produce a semantic score