LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265alessandralanz wants to merge 14 commits into
Conversation
…. Checked against a live OKP image with 85% pass rate
… Guide's dataset sizing and distribution recommendations
…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds a regression gate config plus baseline and three-run CLI comparison scripts for evaluation summaries. The scripts compute metric deltas, classify PASS/WARN/FAIL outcomes, generate markdown reports, and are covered by new pytest fixtures and integration tests. ChangesRegression gate updates
Sequence Diagram(s)Baseline comparison flowsequenceDiagram
participant main
participant find_and_load_summary
participant compute_metric_deltas
participant generate_markdown_summary
main->>find_and_load_summary: load baseline and current summary JSON
main->>compute_metric_deltas: compute metric deltas
main->>generate_markdown_summary: build the markdown summary
A/B/C regression attribution flowsequenceDiagram
participant main
participant find_and_load_summary
participant compute_metric_deltas
participant determine_gate_verdict
participant generate_abc_markdown_summary
main->>find_and_load_summary: load Run C and optional Run A/B summaries
main->>compute_metric_deltas: compute available pairwise deltas
main->>determine_gate_verdict: evaluate OKP, PR, and total deltas
main->>generate_abc_markdown_summary: build the markdown report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.
In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.
In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc
📒 Files selected for processing (9)
baselines/lcore_regression/current_baseline_summary.jsonconfig/lcore_regression/system-config-pr-gate.yamleval_data/lcore_regression/okp_rag_quality.yamlscript/regression/__init__.pyscript/regression/compare_abc_runs.pyscript/regression/compare_against_baseline.pytests/script/conftest.pytests/script/test_compare_abc_runs.pytests/script/test_compare_against_baseline.py
…tdout, verdict logic, test determinism
…s with 4 context/retrieval metrics, and remove judge_panel config since response-quality judging no longer needed
|
Thanks!! I came across your PR, and I'm not sure if it's ready yet, but I'm wondering a few things:
cc: @asamal4 @Anxhela21 |
Description
Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:
negative/off-topic queries
compare_against_baseline) with--check-onlymode for CI gatinggpt-4o-minias the judge LLMType of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.
make pre-commitpasses all checks (pylint, pyright, ruff, black, etc.)lightspeed-stackinstance with OKP enabledSummary by CodeRabbit