Skip to content

Fixes while trying#1

Open
samuelcolvin wants to merge 2 commits intoliquidos-ai:mainfrom
samuelcolvin:fixes-while-trying
Open

Fixes while trying#1
samuelcolvin wants to merge 2 commits intoliquidos-ai:mainfrom
samuelcolvin:fixes-while-trying

Conversation

@samuelcolvin
Copy link
Copy Markdown

@samuelcolvin samuelcolvin commented Feb 23, 2026

This includes some fixes while looking at your benchmarks. I stopped because I think there are bigger issues:

  1. using the actual openai API introduces way to much variability for these benchmarks to be meaningful, to make this work, you need to create a dedicated openai dummy endpoint with a set latency, and point to that instead of real openai
  2. when I ran the benchmarks, pydantic-ai consistently had better performance than langchain directly, I assume your results are a result of api latency variability
  3. claude found lots of issues when comparing the langchain and pydantic-ai benchmarks, see below

Claude conversation about fairness of benchmarks:


╭─── Claude Code v2.1.50 ─────────────────────────────────────────────────────────────────────────╮
│                             │ Tips for getting started                                          │
│     Welcome back Samuel!    │ Run /init to create a CLAUDE.md file with instructions for Claude │
│                             │ ───────────────────────────────────────────────────────────────── │
│                             │ Recent activity                                                   │
│                             │ No recent activity                                                │
│              ✻              │                                                                   │
│              |              │                                                                   │
│             ▟█▙             │                                                                   │
│           ▐▛███▜▌           │                                                                   │
│          ▝▜█████▛▘          │                                                                   │
│            ▘▘ ▝▝            │                                                                   │
│    Opus 4.6 · Claude Max    │                                                                   │
│   ~/code/autoagents-bench   │                                                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯

❯ review @_src/pydantic_ai_bench.py and @_src/langchain_bench.py and check that the two
  benchmarks are a fair comparison of the two libraries
  ⎿  Read _src/pydantic_ai_bench.py (404 lines)
  ⎿  Read _src/langchain_bench.py (433 lines)

⏺ I've reviewed both benchmarks carefully. Here are the fairness issues I found:

  Fairness Issues

  1. Temperature Mismatch (Major)

  - PydanticAI (line 200): ModelSettings(temperature=0.2)
  - LangChain (line 246): ChatOpenAI(model=config.model) — no temperature set, defaults to ~1.0

  This heavily biases the determinism rate in PydanticAI's favor. Lower temperature = more
  consistent outputs.

  2. Missing Error Handling in LangChain Worker (Major)

  - PydanticAI (lines 263-268): Worker wraps agent.run() in try/except, gracefully marking
  failures.
  - LangChain (lines 296-298): No try/except around run_call(). An exception will propagate up
  through asyncio.gather and crash the entire benchmark.

  4. LLM-Only Mode Asymmetry (Moderate)

  - PydanticAI: Uses a full Agent(...) with system prompt, structured output schema, and the agent
  loop — even in LLM-only mode.
  - LangChain: Uses a lightweight structured_llm.ainvoke(prompt) — a single direct call with no
  agent overhead.

  This makes PydanticAI look artificially slower in LLM-only mode. For a fair comparison,
  PydanticAI's LLM-only mode should use the model directly (e.g., model.request(...) or
  equivalent), or LangChain should also go through its agent abstraction.

  5. Tool Mode Output Handling (Moderate)

  - PydanticAI: Uses output_type=FloatResponse — the framework guarantees structured JSON output
  from the LLM.
  - LangChain: Uses AgentExecutor which returns free-form text, then parses it with extract_value()
   (JSON parse + regex fallback). The extra prompt hint ('Return JSON only in the format {"value":
  <number>}.' at line 286) compensates but isn't equivalent.

  These are measuring different framework capabilities, which is fine if intentional, but the regex
   fallback in LangChain is more lenient and could inflate its success rate.

  6. System Prompt Inconsistency (Minor)

  - PydanticAI: Always sends a system prompt ("You are a helpful assistant. Use tools when
  available.")
  - LangChain LLM mode: No system prompt at all
  - LangChain tool mode: System prompt ("You are a helpful assistant.") — without the "Use tools"
  hint

  Recommendations

  1. Set the same temperature on both (e.g., temperature=0.2 on both, or make it configurable via
  benchmark.yaml).
  2. Add try/except in the LangChain worker to match PydanticAI's error handling.
  3. Equalize LLM-only mode — either both go through their full agent abstraction, or both use a
  direct model call.
  4. Use consistent system prompts across both benchmarks.
  5. Extract shared code — BenchmarkConfig, BenchmarkResult, TimingBreakdown, percentile,
  persist_result, load_config, etc. are duplicated verbatim. A shared module would prevent drift
  and make it easier to keep the benchmarks aligned.

  Want me to fix these issues?

✻ Churned for 44s

@samuelcolvin
Copy link
Copy Markdown
Author

any update here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant