Research for evaluating and improving agentic approaches by sebastianwessel · Pull Request #149 · eggai-tech/qualops

sebastianwessel · 2026-05-08T11:12:20Z

No description provided.

github-actions · 2026-05-08T11:13:03Z

QualOps Code Quality Analysis

Status: ✅ PASSED - No issues found

Summary

Total Issues: 0
Critical: 0 🔴
High: 0 🟠
Medium: 0 🟡
Low: 0 🟢
Files Analyzed: 0

No issues found in the analyzed code.

📊 Full Report

View detailed report

Powered by QualOps

valdis · 2026-05-08T12:30:13Z

+| **TRAJECT-Bench** | 2025 | Trajectory-quality metrics over outcomes. |
+| **Holistic Agent Leaderboard** ([HAL](https://arxiv.org/pdf/2510.11977)) | 2025 | Variance-decomposed reporting; good model for our internal dashboards. |
+
+The single most influential idea for QualOps is **SWE-bench's "tests as oracle"**: apply the agent's patch, run FAIL_TO_PASS + PASS_TO_PASS, classify. Fully deterministic, fully outcome-based, ignores how the agent got there, resists most reward hacking. We adopt the *methodology* directly for the Fix stage in Part 6.


SWE-bench evaluates LLM's ability to solve known issues, not it's ability to identify issues that have not yet been spotted. that is why we are using the CRB (Code Revew Benchmark https://codereview.withmartian.com/)

valdis · 2026-05-08T12:34:02Z

+1. **Per-PR (fast tier, ~3–5 min)** — Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; runs as a required GitHub check. Posts a PR comment with the diff vs. main.
+2. **Nightly / weekly (slow tier, ~30–60 min)** — Langfuse experiment over a 100–200 item dataset, running the full pipeline, with LLM-as-judge scorers on the final report and tool-call F1 scorers on each stage. Plus a quarterly **Inspect AI** capability eval against held-out fixture repos.
+
+This gives developers fast PR feedback and the team a slower, deeper truth.


We are already using LangFuse, currently we populate langfuses dataset by running a command in qualops npm run eval:upload:all

Yeah, it comes from the research to be more generic, and not 100% bound to QualOps. The idea here is, that we can use this as some baseline, improve, and transform into a general pattern/approach/doc - similar to what we have for ground truth https://egg-ai.atlassian.net/wiki/spaces/PAD/pages/259325954/Ground+Truth

That's why the expected audience for this research is set to devs + less technicals.

valdis · 2026-05-08T12:46:06Z

+- Full execution context is versioned: prompt + model + temperature + tool list + retrieval config. A prompt that worked on Sonnet 4.0 may regress on 4.7.
+- Two-axis A/B: by prompt version and by traffic slice.
+
+### 5.5 Automated prompt optimization


I think this is where we need to focus for a bit. for this to become realistic we need to simplify tha adding of identified eval cases and then streamline experimentation with prompts.
A caviat with the prompts is that in some cases we construct them at run time thus we likely need a way to extract the prompts from qualops so that they can be experimented with

valdis · 2026-05-08T12:49:34Z

+
+For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more.
+
+### 5.9 Sub-agent decomposition and Skills


That is something that we already support - an agent can run subagents with different prompts and then process the returned results. Here we need to do some learning on what the exact prompts should be

valdis · 2026-05-08T12:50:31Z

+- **Verifier model trick.** Run Fix with Sonnet, Judge with Opus. Different models reduce shared blindspots.
+- **Test execution as ground-truth critic.** For Fix outputs, the actual unit-test result is the highest-quality verifier you will ever have. Use it.
+
+### 5.12 Fine-tuning, distillation, DPO


fine tuning is out of scope for now

valdis · 2026-05-08T13:01:42Z

+
+We are deliberately **not migrating away from Langfuse** to LangSmith or Braintrust. The marginal UX wins do not justify a closed-source migration for a small team given Langfuse's current feature set.
+
+### 6.4 Two-tier eval cadence


I think we need to distinguish eval purpose.

every PR - smoke test, running qualops on a small subset of evals to see if our recent changes have not broken the overall flow

regression testing - running a full set of evals to see if a prompt change or model change performs well enough. Needed on PR's that change qualops default models or default prompts.

research experiemnts - running a set of experimental prompts on a set of models against a subset of evals to automate experimentation and collect statistics so we can understand which direction to take as we develop

valdis · 2026-05-08T13:07:11Z

+| Per-stage tracing in code (Analyze/Review/Fix/Report/Judge as distinct spans) | **Mostly have** | Audit; ensure consistent span names + attributes |
+| Versioned prompts in repo (not in UI) | **Have** (`evals/qualopsrc/`) | None |
+| Prompt-as-code promotion infra (content hashes, dev → staging → prod gates) | **Partial** (presets exist; gating infra implicit) | Add explicit promotion workflow + version pinning |
+| A starter golden set of real PRs with labels | **Partial** (CRB datasets exist, internal labels TBD) | Label 50 internal PRs with finding-level + fix-level annotations |


I think we also need to consider when do we add/remove items to/from the golden set because over time the collection might grow quite large and we likely don't want to spend all our tokens on running the evals

doc: agentic research eval

1eb187a

sebastianwessel marked this pull request as draft May 8, 2026 11:12

valdis reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research for evaluating and improving agentic approaches#149

Research for evaluating and improving agentic approaches#149
sebastianwessel wants to merge 1 commit into
mainfrom
reasearch_agentic_evals

sebastianwessel commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

valdis May 8, 2026

Uh oh!

valdis May 8, 2026

Uh oh!

sebastianwessel May 8, 2026

Uh oh!

valdis May 8, 2026

Uh oh!

valdis May 8, 2026

Uh oh!

valdis May 8, 2026 •

edited

Loading

Uh oh!

valdis May 8, 2026

Uh oh!

valdis May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more.

		### 5.9 Sub-agent decomposition and Skills


		We are deliberately not migrating away from Langfuse to LangSmith or Braintrust. The marginal UX wins do not justify a closed-source migration for a small team given Langfuse's current feature set.

		### 6.4 Two-tier eval cadence

Conversation

sebastianwessel commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

QualOps Code Quality Analysis

Summary

📊 Full Report

Uh oh!

valdis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

valdis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

sebastianwessel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

valdis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

valdis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

valdis May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valdis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

valdis May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

valdis May 8, 2026 •

edited

Loading