Skip to content

Semantic quality score is a hardcoded 0.8 placeholder #16

@DarlingtonDeveloper

Description

@DarlingtonDeveloper

Context

SPEC-03 defines Dimension 5 (Semantic Quality, weight 0.20) as an LLM review of the diff that evaluates code quality, readability, and adherence to conventions.

Current Behavior

runner.py hardcodes dimensions["semantic"] = 0.8. No LLM call is made.

Expected Behavior

After a mutation, send the diff to an LLM (via the Reflector or a dedicated scorer) to get a semantic quality assessment. Parse the response into a 0.0-1.0 score.

Impact

  • 20% of the composite score is constant, providing no signal
  • All experiments get the same semantic score regardless of quality
  • Noted as intentional in IMPL-08 — placeholder until evolution layer LLM review is wired up

References

  • atlas-specs/03-EVALUATION.md — Dimension 5: Semantic Quality
  • atlas/evaluation/runner.pydimensions["semantic"] = 0.8
  • atlas/evolution/reflector.py — could be extended to also produce a semantic score

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions