LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison by alessandralanz · Pull Request #265 · lightspeed-core/lightspeed-evaluation

alessandralanz · 2026-06-26T21:22:32Z

Description

Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:

97 OKP RAG benchmark conversations covering single-turn knowledge, multi-turn retention, edge cases, and
negative/off-topic queries
Baseline comparison script (compare_against_baseline) with --check-only mode for CI gating
System config for PR-gate evaluation using gpt-4o-mini as the judge LLM
Version-controlled baseline snapshot (102 evaluations)
A/B/C comparison script for comparing multiple evaluation runs

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude (Claude Code CLI)
Generated by: N/A

Related Tickets & Documents

Related Issue #
Closes # LCORE-2802

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.
make pre-commit passes all checks (pylint, pyright, ruff, black, etc.)
- GitLab CI pipeline runs 97 conversations producing 506 evaluations against a live lightspeed-stack instance with OKP enabled
- Baseline comparison script validates current results against stored baseline

Summary by CodeRabbit

New Features
- Added regression gate comparison tools for evaluating runs against baselines, including a three-run A/B/C flow.
- Added terminal/markdown reporting plus structured summaries for regression outcomes (including gate verdicts and per-metric deltas).
- Introduced a complete LCORE regression gate system configuration (judge + embeddings, metrics, thresholds, and result storage).
Tests
- Added unit and CLI-style integration tests for A/B/C and baseline comparison logic, including report output and exit-code behavior.
Documentation
- Added module-level documentation for the regression scripts.

…. Checked against a live OKP image with 85% pass rate

… Guide's dataset sizing and distribution recommendations

…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py

coderabbitai · 2026-06-26T21:22:44Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1d1ea876-ad88-4a29-aca5-53e861861b04

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Adds a regression gate config plus baseline and three-run CLI comparison scripts for evaluation summaries. The scripts compute metric deltas, classify PASS/WARN/FAIL outcomes, generate markdown reports, and are covered by new pytest fixtures and integration tests.

Changes

Regression gate updates

Layer / File(s)	Summary
Gate config and package doc `config/lcore_regression/system-config-pr-gate.yaml`, `script/regression/__init__.py`	Defines the PR gate runtime settings, metric thresholds, result storage, visualization defaults, logging, and the regression package docstring.
Baseline loading and deltas `script/regression/compare_against_baseline.py`, `tests/script/conftest.py`, `tests/script/test_compare_against_baseline.py`	Loads one summary JSON per directory, computes metric deltas, and adds shared summary fixtures plus loading and delta tests.
Baseline reporting and entrypoint `script/regression/compare_against_baseline.py`, `tests/script/test_compare_against_baseline.py`	Formats the baseline markdown report and wires check-only and fail-on-critical exit handling, with markdown tests.
A/B/C parsing and verdicts `script/regression/compare_abc_runs.py`, `tests/script/test_compare_abc_runs.py`	Parses Run A/B/C inputs, loads summaries, and derives OKP/PR/total PASS/WARN/FAIL verdicts.
A/B/C markdown and integration `script/regression/compare_abc_runs.py`, `tests/script/test_compare_abc_runs.py`	Builds the A/B/C markdown report and exercises CLI output writing, missing-run handling, and baseline fallback.

Sequence Diagram(s)

Baseline comparison flow

sequenceDiagram
  participant main
  participant find_and_load_summary
  participant compute_metric_deltas
  participant generate_markdown_summary
  main->>find_and_load_summary: load baseline and current summary JSON
  main->>compute_metric_deltas: compute metric deltas
  main->>generate_markdown_summary: build the markdown summary

A/B/C regression attribution flow

sequenceDiagram
  participant main
  participant find_and_load_summary
  participant compute_metric_deltas
  participant determine_gate_verdict
  participant generate_abc_markdown_summary
  main->>find_and_load_summary: load Run C and optional Run A/B summaries
  main->>compute_metric_deltas: compute available pairwise deltas
  main->>determine_gate_verdict: evaluate OKP, PR, and total deltas
  main->>generate_abc_markdown_summary: build the markdown report

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main additions: OKP RAG regression benchmarks and baseline comparison tooling.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.

In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.

In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff47a6 and 5d59325.

📒 Files selected for processing (9)

baselines/lcore_regression/current_baseline_summary.json
config/lcore_regression/system-config-pr-gate.yaml
eval_data/lcore_regression/okp_rag_quality.yaml
script/regression/__init__.py
script/regression/compare_abc_runs.py
script/regression/compare_against_baseline.py
tests/script/conftest.py
tests/script/test_compare_abc_runs.py
tests/script/test_compare_against_baseline.py

…exts

…t metrics

…tdout, verdict logic, test determinism

…s with 4 context/retrieval metrics, and remove judge_panel config since response-quality judging no longer needed

xmican10 · 2026-07-01T07:07:50Z

Thanks!! I came across your PR, and I'm not sure if it's ready yet, but I'm wondering a few things:

What is the source of these QnAs? Is it our private GitLab repo?
Since OKP is a paid Red Hat capability, I'm wondering if keeping the QnAs and expected responses in a public repo is a risk. Even if the answers seem generic, they could reveal internal knowledge base structure, coverage, and quality characteristics. Should these live in a private repo instead?
The data files are huge, can't we move it away from the ls-eval tooling? Or create even more separation and move it all to a separate repo?

cc: @asamal4 @Anxhela21

alessandralanz added 5 commits June 26, 2026 16:24

Add OKP RAG quality benchmarks and system config for LCORE regression…

ff73802

…. Checked against a live OKP image with 85% pass rate

Add baseline comparison script for regression gating

7e29a9a

Revised benchmarks to adhere to Evaluation Data Collection Standard +…

74768c9

… Guide's dataset sizing and distribution recommendations

Add version-controlled baseline for regression gating

5e1f772

Add three run A/B/C comparison script and regression tests. Attribute…

5d59325

…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread script/regression/compare_abc_runs.py

Comment thread script/regression/compare_against_baseline.py

Comment thread script/regression/compare_against_baseline.py Outdated

Comment thread tests/script/test_compare_against_baseline.py Outdated

alessandralanz added 7 commits June 29, 2026 13:18

LCORE-2802: Update baseline from 102 to 506 evaluations

c03b462

Remove context-requiring metrics from turns where RAG returns no cont…

261f0b0

…exts

Update baseline to 491 evaluations after removing inapplicable contex…

f67c2d8

…t metrics

fix coderabbit review findings: missing-metric handling, check-only s…

7d4c4f2

…tdout, verdict logic, test determinism

switch to context-only RAGAS metrics, replace response-quality metric…

6f44c8b

…s with 4 context/retrieval metrics, and remove judge_panel config since response-quality judging no longer needed

Increase max_threads to 4 and disable skip_on_failure

e7ab2ae

update baseline bc we updated config

6ee6bb9

alessandralanz marked this pull request as draft July 1, 2026 13:10

alessandralanz added 2 commits July 1, 2026 10:28

remove OKP eval data from public repo and strip baseline to aggregates

c82599a

Merge branch 'lightspeed-core:main' into lcore-regression

b382b29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
alessandralanz wants to merge 14 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression

alessandralanz commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xmican10 commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

alessandralanz commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Baseline comparison flow

A/B/C regression attribution flow

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xmican10 commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alessandralanz commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

xmican10 commented Jul 1, 2026 •

edited

Loading