Skip to content

Research for evaluating and improving agentic approaches#149

Draft
sebastianwessel wants to merge 1 commit into
mainfrom
reasearch_agentic_evals
Draft

Research for evaluating and improving agentic approaches#149
sebastianwessel wants to merge 1 commit into
mainfrom
reasearch_agentic_evals

Conversation

@sebastianwessel
Copy link
Copy Markdown
Contributor

No description provided.

@sebastianwessel sebastianwessel marked this pull request as draft May 8, 2026 11:12
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

QualOps Code Quality Analysis

Status: ✅ PASSED - No issues found

Summary

  • Total Issues: 0
  • Critical: 0 🔴
  • High: 0 🟠
  • Medium: 0 🟡
  • Low: 0 🟢
  • Files Analyzed: 0

No issues found in the analyzed code.

📊 Full Report

View detailed report


Powered by QualOps

| **TRAJECT-Bench** | 2025 | Trajectory-quality metrics over outcomes. |
| **Holistic Agent Leaderboard** ([HAL](https://arxiv.org/pdf/2510.11977)) | 2025 | Variance-decomposed reporting; good model for our internal dashboards. |

The single most influential idea for QualOps is **SWE-bench's "tests as oracle"**: apply the agent's patch, run FAIL_TO_PASS + PASS_TO_PASS, classify. Fully deterministic, fully outcome-based, ignores how the agent got there, resists most reward hacking. We adopt the *methodology* directly for the Fix stage in Part 6.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SWE-bench evaluates LLM's ability to solve known issues, not it's ability to identify issues that have not yet been spotted. that is why we are using the CRB (Code Revew Benchmark https://codereview.withmartian.com/)

1. **Per-PR (fast tier, ~3–5 min)** — Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; runs as a required GitHub check. Posts a PR comment with the diff vs. main.
2. **Nightly / weekly (slow tier, ~30–60 min)** — Langfuse experiment over a 100–200 item dataset, running the full pipeline, with LLM-as-judge scorers on the final report and tool-call F1 scorers on each stage. Plus a quarterly **Inspect AI** capability eval against held-out fixture repos.

This gives developers fast PR feedback and the team a slower, deeper truth.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already using LangFuse, currently we populate langfuses dataset by running a command in qualops npm run eval:upload:all

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it comes from the research to be more generic, and not 100% bound to QualOps. The idea here is, that we can use this as some baseline, improve, and transform into a general pattern/approach/doc - similar to what we have for ground truth https://egg-ai.atlassian.net/wiki/spaces/PAD/pages/259325954/Ground+Truth

That's why the expected audience for this research is set to devs + less technicals.

- Full execution context is versioned: prompt + model + temperature + tool list + retrieval config. A prompt that worked on Sonnet 4.0 may regress on 4.7.
- Two-axis A/B: by prompt version and by traffic slice.

### 5.5 Automated prompt optimization
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where we need to focus for a bit. for this to become realistic we need to simplify tha adding of identified eval cases and then streamline experimentation with prompts.
A caviat with the prompts is that in some cases we construct them at run time thus we likely need a way to extract the prompts from qualops so that they can be experimented with


For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more.

### 5.9 Sub-agent decomposition and Skills
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is something that we already support - an agent can run subagents with different prompts and then process the returned results. Here we need to do some learning on what the exact prompts should be

- **Verifier model trick.** Run Fix with Sonnet, Judge with Opus. Different models reduce shared blindspots.
- **Test execution as ground-truth critic.** For Fix outputs, the actual unit-test result is the highest-quality verifier you will ever have. Use it.

### 5.12 Fine-tuning, distillation, DPO
Copy link
Copy Markdown
Collaborator

@valdis valdis May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine tuning is out of scope for now


We are deliberately **not migrating away from Langfuse** to LangSmith or Braintrust. The marginal UX wins do not justify a closed-source migration for a small team given Langfuse's current feature set.

### 6.4 Two-tier eval cadence
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to distinguish eval purpose.

  1. every PR - smoke test, running qualops on a small subset of evals to see if our recent changes have not broken the overall flow
  2. regression testing - running a full set of evals to see if a prompt change or model change performs well enough. Needed on PR's that change qualops default models or default prompts.
  3. research experiemnts - running a set of experimental prompts on a set of models against a subset of evals to automate experimentation and collect statistics so we can understand which direction to take as we develop

| Per-stage tracing in code (Analyze/Review/Fix/Report/Judge as distinct spans) | **Mostly have** | Audit; ensure consistent span names + attributes |
| Versioned prompts in repo (not in UI) | **Have** (`evals/qualopsrc/`) | None |
| Prompt-as-code promotion infra (content hashes, dev → staging → prod gates) | **Partial** (presets exist; gating infra implicit) | Add explicit promotion workflow + version pinning |
| A starter golden set of real PRs with labels | **Partial** (CRB datasets exist, internal labels TBD) | Label 50 internal PRs with finding-level + fix-level annotations |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to consider when do we add/remove items to/from the golden set because over time the collection might grow quite large and we likely don't want to spend all our tokens on running the evals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants