Research for evaluating and improving agentic approaches#149
Research for evaluating and improving agentic approaches#149sebastianwessel wants to merge 1 commit into
Conversation
QualOps Code Quality AnalysisStatus: ✅ PASSED - No issues found Summary
No issues found in the analyzed code. 📊 Full ReportPowered by QualOps |
| | **TRAJECT-Bench** | 2025 | Trajectory-quality metrics over outcomes. | | ||
| | **Holistic Agent Leaderboard** ([HAL](https://arxiv.org/pdf/2510.11977)) | 2025 | Variance-decomposed reporting; good model for our internal dashboards. | | ||
|
|
||
| The single most influential idea for QualOps is **SWE-bench's "tests as oracle"**: apply the agent's patch, run FAIL_TO_PASS + PASS_TO_PASS, classify. Fully deterministic, fully outcome-based, ignores how the agent got there, resists most reward hacking. We adopt the *methodology* directly for the Fix stage in Part 6. |
There was a problem hiding this comment.
SWE-bench evaluates LLM's ability to solve known issues, not it's ability to identify issues that have not yet been spotted. that is why we are using the CRB (Code Revew Benchmark https://codereview.withmartian.com/)
| 1. **Per-PR (fast tier, ~3–5 min)** — Promptfoo YAML with ~30 small assertions on the output of each pipeline stage; runs as a required GitHub check. Posts a PR comment with the diff vs. main. | ||
| 2. **Nightly / weekly (slow tier, ~30–60 min)** — Langfuse experiment over a 100–200 item dataset, running the full pipeline, with LLM-as-judge scorers on the final report and tool-call F1 scorers on each stage. Plus a quarterly **Inspect AI** capability eval against held-out fixture repos. | ||
|
|
||
| This gives developers fast PR feedback and the team a slower, deeper truth. |
There was a problem hiding this comment.
We are already using LangFuse, currently we populate langfuses dataset by running a command in qualops npm run eval:upload:all
There was a problem hiding this comment.
Yeah, it comes from the research to be more generic, and not 100% bound to QualOps. The idea here is, that we can use this as some baseline, improve, and transform into a general pattern/approach/doc - similar to what we have for ground truth https://egg-ai.atlassian.net/wiki/spaces/PAD/pages/259325954/Ground+Truth
That's why the expected audience for this research is set to devs + less technicals.
| - Full execution context is versioned: prompt + model + temperature + tool list + retrieval config. A prompt that worked on Sonnet 4.0 may regress on 4.7. | ||
| - Two-axis A/B: by prompt version and by traffic slice. | ||
|
|
||
| ### 5.5 Automated prompt optimization |
There was a problem hiding this comment.
I think this is where we need to focus for a bit. for this to become realistic we need to simplify tha adding of identified eval cases and then streamline experimentation with prompts.
A caviat with the prompts is that in some cases we construct them at run time thus we likely need a way to extract the prompts from qualops so that they can be experimented with
|
|
||
| For QualOps: never feed the entire repo. Feed (a) the diff, (b) the immediate symbol context (callers/callees of touched symbols), (c) project conventions for the relevant language, and nothing else. Add a tool the agent can call when it needs more. | ||
|
|
||
| ### 5.9 Sub-agent decomposition and Skills |
There was a problem hiding this comment.
That is something that we already support - an agent can run subagents with different prompts and then process the returned results. Here we need to do some learning on what the exact prompts should be
| - **Verifier model trick.** Run Fix with Sonnet, Judge with Opus. Different models reduce shared blindspots. | ||
| - **Test execution as ground-truth critic.** For Fix outputs, the actual unit-test result is the highest-quality verifier you will ever have. Use it. | ||
|
|
||
| ### 5.12 Fine-tuning, distillation, DPO |
There was a problem hiding this comment.
fine tuning is out of scope for now
|
|
||
| We are deliberately **not migrating away from Langfuse** to LangSmith or Braintrust. The marginal UX wins do not justify a closed-source migration for a small team given Langfuse's current feature set. | ||
|
|
||
| ### 6.4 Two-tier eval cadence |
There was a problem hiding this comment.
I think we need to distinguish eval purpose.
- every PR - smoke test, running qualops on a small subset of evals to see if our recent changes have not broken the overall flow
- regression testing - running a full set of evals to see if a prompt change or model change performs well enough. Needed on PR's that change qualops default models or default prompts.
- research experiemnts - running a set of experimental prompts on a set of models against a subset of evals to automate experimentation and collect statistics so we can understand which direction to take as we develop
| | Per-stage tracing in code (Analyze/Review/Fix/Report/Judge as distinct spans) | **Mostly have** | Audit; ensure consistent span names + attributes | | ||
| | Versioned prompts in repo (not in UI) | **Have** (`evals/qualopsrc/`) | None | | ||
| | Prompt-as-code promotion infra (content hashes, dev → staging → prod gates) | **Partial** (presets exist; gating infra implicit) | Add explicit promotion workflow + version pinning | | ||
| | A starter golden set of real PRs with labels | **Partial** (CRB datasets exist, internal labels TBD) | Label 50 internal PRs with finding-level + fix-level annotations | |
There was a problem hiding this comment.
I think we also need to consider when do we add/remove items to/from the golden set because over time the collection might grow quite large and we likely don't want to spend all our tokens on running the evals
No description provided.