Skip to content

Cost tracking discrepancies: SDK vs LiteLLM proxy virtual key costs diverge by agent type #603

@simonrosenberg

Description

@simonrosenberg

Summary

During PR validation runs for OpenHands/software-agent-sdk#2656, we observed three distinct cost tracking failures depending on agent type. No single cost source works correctly across all agent types.

Evidence

Runs from 2026-04-02 validating pr/acp-node22-and-defer-init-main (SDK commit 217c454), eval_limit=5:

K8s Job Benchmark Agent Type Model
eval-23899965586-claude-son swebench acp-claude claude-sonnet-4-5-20250929
eval-23899965217-claude-son swebench acp-gemini claude-sonnet-4-5-20250929
eval-23899966017-claude-4-6 swebench default claude-4.6-opus
eval-23899971488-claude-son swebenchmultimodal acp-claude claude-sonnet-4-5-20250929
eval-23899973560-claude-son swebenchmultimodal acp-gemini claude-sonnet-4-5-20250929
eval-23899975488-claude-4-6 swebenchmultimodal default claude-4.6-opus

GCS results: gs://openhands-evaluation-results/{benchmark}/{model_slug}/{eval_run_id}/

Bug 1: acp-gemini — SDK reports $0 cost, only proxy tracks spend

Gemini CLI does not report cost or token usage back to the SDK. metrics.costs and metrics.token_usages are empty arrays. metrics.accumulated_cost is $0.00.

Meanwhile, the LiteLLM proxy virtual key correctly tracks $7–13 per instance via test_result.proxy_cost.

Impact: Without proxy tracking, gemini costs are completely invisible. Also notable: gemini runs are ~50× more expensive than equivalent acp-claude runs on the same instances.

Instance                                    SDK Cost    Proxy Cost
django__django-12155                        $0.0000     $7.08
django__django-13279                        $0.0000     $12.26
django__django-14434                        $0.0000     $10.18
scikit-learn__scikit-learn-13439            $0.0000     $7.49
scikit-learn__scikit-learn-25232            $0.0000     $13.55

Bug 2: default (OpenHands agent) — Proxy reports $0, only SDK tracks cost

The default OpenHands agent has full per-turn SDK cost breakdowns (via LiteLLM response headers), but test_result.proxy_cost is $0.00 for every instance.

Root cause: Virtual keys are only created and injected for ACP agents (in benchmarks/utils/acp.py), not for the default agent path. The default agent uses the base API key directly, bypassing proxy spend tracking.

Instance                                    SDK Cost    Proxy Cost
django__django-12155                        $0.2047     $0.00
django__django-13279                        $0.5177     $0.00
django__django-14434                        $0.4271     $0.00
scikit-learn__scikit-learn-13439            $0.0988     $0.00
scikit-learn__scikit-learn-25232            $0.4557     $0.00

Bug 3: acp-claude — SDK overestimates cost by 5–38% vs proxy

Both sources report non-zero costs, but they diverge. SDK cost (from UsageUpdate.cost reported by claude-agent-acp) is consistently higher than the proxy-tracked cost.

Possible cause: claude-agent-acp reports list price while the proxy applies prompt caching discounts.

Instance                                    SDK Cost    Proxy Cost    Diff
django__django-14434                        $0.2106     $0.1978       -6.1%
django__django-13279                        $0.4767     $0.2931       -38.5%
scikit-learn__scikit-learn-13439            $0.1784     $0.1689       -5.3%
scikit-learn__scikit-learn-25232            $0.2592     $0.2376       -8.3%
django__django-12155                        $0.2011     $0.1789       -11.0%

Bug 4: accumulated_token_counts is always zero

The top-level metrics.accumulated_token_counts field (with prompt, completion, cache_read keys) is 0 for ALL agent types. Actual token data exists in the per-turn metrics.token_usages array for default and acp-claude, but is empty for acp-gemini.

This field appears to be stale or never aggregated, which means any downstream reporting that reads accumulated_token_counts will see zeros.

Summary Table

Agent Type SDK Cost Proxy Cost Token Counts Which source works?
acp-claude ✓ (overestimates ~15%) Per-turn only Both (proxy more accurate)
acp-gemini ✗ ($0) ✗ (empty) Only proxy
default ✗ ($0) Per-turn only Only SDK

Suggested Fixes

  1. Enable virtual key tracking for default agent — create/inject virtual keys in the default evaluation path, not just ACP
  2. Fix gemini SDK cost reporting — either parse gemini CLI's cost output or estimate from token counts
  3. Aggregate accumulated_token_counts — sum token_usages array into the top-level field, or deprecate it
  4. Investigate acp-claude cost divergence — determine whether SDK or proxy is billing-accurate

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions