Fixes while trying by samuelcolvin · Pull Request #1 · liquidos-ai/autoagents-bench

samuelcolvin · 2026-02-23T12:25:41Z

This includes some fixes while looking at your benchmarks. I stopped because I think there are bigger issues:

using the actual openai API introduces way to much variability for these benchmarks to be meaningful, to make this work, you need to create a dedicated openai dummy endpoint with a set latency, and point to that instead of real openai
when I ran the benchmarks, pydantic-ai consistently had better performance than langchain directly, I assume your results are a result of api latency variability
claude found lots of issues when comparing the langchain and pydantic-ai benchmarks, see below

Claude conversation about fairness of benchmarks:


╭─── Claude Code v2.1.50 ─────────────────────────────────────────────────────────────────────────╮
│                             │ Tips for getting started                                          │
│     Welcome back Samuel!    │ Run /init to create a CLAUDE.md file with instructions for Claude │
│                             │ ───────────────────────────────────────────────────────────────── │
│                             │ Recent activity                                                   │
│                             │ No recent activity                                                │
│              ✻              │                                                                   │
│              |              │                                                                   │
│             ▟█▙             │                                                                   │
│           ▐▛███▜▌           │                                                                   │
│          ▝▜█████▛▘          │                                                                   │
│            ▘▘ ▝▝            │                                                                   │
│    Opus 4.6 · Claude Max    │                                                                   │
│   ~/code/autoagents-bench   │                                                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯

❯ review @_src/pydantic_ai_bench.py and @_src/langchain_bench.py and check that the two
  benchmarks are a fair comparison of the two libraries
  ⎿  Read _src/pydantic_ai_bench.py (404 lines)
  ⎿  Read _src/langchain_bench.py (433 lines)

⏺ I've reviewed both benchmarks carefully. Here are the fairness issues I found:

  Fairness Issues

  1. Temperature Mismatch (Major)

  - PydanticAI (line 200): ModelSettings(temperature=0.2)
  - LangChain (line 246): ChatOpenAI(model=config.model) — no temperature set, defaults to ~1.0

  This heavily biases the determinism rate in PydanticAI's favor. Lower temperature = more
  consistent outputs.

  2. Missing Error Handling in LangChain Worker (Major)

  - PydanticAI (lines 263-268): Worker wraps agent.run() in try/except, gracefully marking
  failures.
  - LangChain (lines 296-298): No try/except around run_call(). An exception will propagate up
  through asyncio.gather and crash the entire benchmark.

  4. LLM-Only Mode Asymmetry (Moderate)

  - PydanticAI: Uses a full Agent(...) with system prompt, structured output schema, and the agent
  loop — even in LLM-only mode.
  - LangChain: Uses a lightweight structured_llm.ainvoke(prompt) — a single direct call with no
  agent overhead.

  This makes PydanticAI look artificially slower in LLM-only mode. For a fair comparison,
  PydanticAI's LLM-only mode should use the model directly (e.g., model.request(...) or
  equivalent), or LangChain should also go through its agent abstraction.

  5. Tool Mode Output Handling (Moderate)

  - PydanticAI: Uses output_type=FloatResponse — the framework guarantees structured JSON output
  from the LLM.
  - LangChain: Uses AgentExecutor which returns free-form text, then parses it with extract_value()
   (JSON parse + regex fallback). The extra prompt hint ('Return JSON only in the format {"value":
  <number>}.' at line 286) compensates but isn't equivalent.

  These are measuring different framework capabilities, which is fine if intentional, but the regex
   fallback in LangChain is more lenient and could inflate its success rate.

  6. System Prompt Inconsistency (Minor)

  - PydanticAI: Always sends a system prompt ("You are a helpful assistant. Use tools when
  available.")
  - LangChain LLM mode: No system prompt at all
  - LangChain tool mode: System prompt ("You are a helpful assistant.") — without the "Use tools"
  hint

  Recommendations

  1. Set the same temperature on both (e.g., temperature=0.2 on both, or make it configurable via
  benchmark.yaml).
  2. Add try/except in the LangChain worker to match PydanticAI's error handling.
  3. Equalize LLM-only mode — either both go through their full agent abstraction, or both use a
  direct model call.
  4. Use consistent system prompts across both benchmarks.
  5. Extract shared code — BenchmarkConfig, BenchmarkResult, TimingBreakdown, percentile,
  persist_result, load_config, etc. are duplicated verbatim. A shared module would prevent drift
  and make it easier to keep the benchmarks aligned.

  Want me to fix these issues?

✻ Churned for 44s

samuelcolvin · 2026-03-12T15:53:40Z

any update here?

samuelcolvin added 2 commits February 23, 2026 12:11

fixes while running this code

6fd85e6

revert benchmark_results_tool.json

acc3d81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes while trying#1

Fixes while trying#1
samuelcolvin wants to merge 2 commits intoliquidos-ai:mainfrom
samuelcolvin:fixes-while-trying

samuelcolvin commented Feb 23, 2026 •

edited

Loading

Uh oh!

samuelcolvin commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samuelcolvin commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuelcolvin commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samuelcolvin commented Feb 23, 2026 •

edited

Loading