Skip to content

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265

Draft
alessandralanz wants to merge 14 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression
Draft

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
alessandralanz wants to merge 14 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression

Conversation

@alessandralanz

@alessandralanz alessandralanz commented Jun 26, 2026

Copy link
Copy Markdown

Description

Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:

  • 97 OKP RAG benchmark conversations covering single-turn knowledge, multi-turn retention, edge cases, and
    negative/off-topic queries
  • Baseline comparison script (compare_against_baseline) with --check-only mode for CI gating
  • System config for PR-gate evaluation using gpt-4o-mini as the judge LLM
  • Version-controlled baseline snapshot (102 evaluations)
  • A/B/C comparison script for comparing multiple evaluation runs

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude (Claude Code CLI)
  • Generated by: N/A

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.

  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

  • make pre-commit passes all checks (pylint, pyright, ruff, black, etc.)

    • GitLab CI pipeline runs 97 conversations producing 506 evaluations against a live lightspeed-stack instance with OKP enabled
    • Baseline comparison script validates current results against stored baseline

Summary by CodeRabbit

  • New Features
    • Added regression gate comparison tools for evaluating runs against baselines, including a three-run A/B/C flow.
    • Added terminal/markdown reporting plus structured summaries for regression outcomes (including gate verdicts and per-metric deltas).
    • Introduced a complete LCORE regression gate system configuration (judge + embeddings, metrics, thresholds, and result storage).
  • Tests
    • Added unit and CLI-style integration tests for A/B/C and baseline comparison logic, including report output and exit-code behavior.
  • Documentation
    • Added module-level documentation for the regression scripts.

…. Checked against a live OKP image with 85% pass rate
… Guide's dataset sizing and distribution recommendations
…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1d1ea876-ad88-4a29-aca5-53e861861b04

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds a regression gate config plus baseline and three-run CLI comparison scripts for evaluation summaries. The scripts compute metric deltas, classify PASS/WARN/FAIL outcomes, generate markdown reports, and are covered by new pytest fixtures and integration tests.

Changes

Regression gate updates

Layer / File(s) Summary
Gate config and package doc
config/lcore_regression/system-config-pr-gate.yaml, script/regression/__init__.py
Defines the PR gate runtime settings, metric thresholds, result storage, visualization defaults, logging, and the regression package docstring.
Baseline loading and deltas
script/regression/compare_against_baseline.py, tests/script/conftest.py, tests/script/test_compare_against_baseline.py
Loads one summary JSON per directory, computes metric deltas, and adds shared summary fixtures plus loading and delta tests.
Baseline reporting and entrypoint
script/regression/compare_against_baseline.py, tests/script/test_compare_against_baseline.py
Formats the baseline markdown report and wires check-only and fail-on-critical exit handling, with markdown tests.
A/B/C parsing and verdicts
script/regression/compare_abc_runs.py, tests/script/test_compare_abc_runs.py
Parses Run A/B/C inputs, loads summaries, and derives OKP/PR/total PASS/WARN/FAIL verdicts.
A/B/C markdown and integration
script/regression/compare_abc_runs.py, tests/script/test_compare_abc_runs.py
Builds the A/B/C markdown report and exercises CLI output writing, missing-run handling, and baseline fallback.

Sequence Diagram(s)

Baseline comparison flow

sequenceDiagram
  participant main
  participant find_and_load_summary
  participant compute_metric_deltas
  participant generate_markdown_summary
  main->>find_and_load_summary: load baseline and current summary JSON
  main->>compute_metric_deltas: compute metric deltas
  main->>generate_markdown_summary: build the markdown summary
Loading

A/B/C regression attribution flow

sequenceDiagram
  participant main
  participant find_and_load_summary
  participant compute_metric_deltas
  participant determine_gate_verdict
  participant generate_abc_markdown_summary
  main->>find_and_load_summary: load Run C and optional Run A/B summaries
  main->>compute_metric_deltas: compute available pairwise deltas
  main->>determine_gate_verdict: evaluate OKP, PR, and total deltas
  main->>generate_abc_markdown_summary: build the markdown report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main additions: OKP RAG regression benchmarks and baseline comparison tooling.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.

In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.

In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff47a6 and 5d59325.

📒 Files selected for processing (9)
  • baselines/lcore_regression/current_baseline_summary.json
  • config/lcore_regression/system-config-pr-gate.yaml
  • eval_data/lcore_regression/okp_rag_quality.yaml
  • script/regression/__init__.py
  • script/regression/compare_abc_runs.py
  • script/regression/compare_against_baseline.py
  • tests/script/conftest.py
  • tests/script/test_compare_abc_runs.py
  • tests/script/test_compare_against_baseline.py

Comment thread script/regression/compare_abc_runs.py
Comment thread script/regression/compare_against_baseline.py
Comment thread script/regression/compare_against_baseline.py Outdated
Comment thread tests/script/test_compare_against_baseline.py Outdated
@xmican10

xmican10 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Thanks!! I came across your PR, and I'm not sure if it's ready yet, but I'm wondering a few things:

  • What is the source of these QnAs? Is it our private GitLab repo?
  • Since OKP is a paid Red Hat capability, I'm wondering if keeping the QnAs and expected responses in a public repo is a risk. Even if the answers seem generic, they could reveal internal knowledge base structure, coverage, and quality characteristics. Should these live in a private repo instead?
  • The data files are huge, can't we move it away from the ls-eval tooling? Or create even more separation and move it all to a separate repo?

cc: @asamal4 @Anxhela21

@alessandralanz alessandralanz marked this pull request as draft July 1, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants