Summary
During PR validation runs for OpenHands/software-agent-sdk#2656, we observed three distinct cost tracking failures depending on agent type. No single cost source works correctly across all agent types.
Evidence
Runs from 2026-04-02 validating pr/acp-node22-and-defer-init-main (SDK commit 217c454), eval_limit=5:
| K8s Job |
Benchmark |
Agent Type |
Model |
eval-23899965586-claude-son |
swebench |
acp-claude |
claude-sonnet-4-5-20250929 |
eval-23899965217-claude-son |
swebench |
acp-gemini |
claude-sonnet-4-5-20250929 |
eval-23899966017-claude-4-6 |
swebench |
default |
claude-4.6-opus |
eval-23899971488-claude-son |
swebenchmultimodal |
acp-claude |
claude-sonnet-4-5-20250929 |
eval-23899973560-claude-son |
swebenchmultimodal |
acp-gemini |
claude-sonnet-4-5-20250929 |
eval-23899975488-claude-4-6 |
swebenchmultimodal |
default |
claude-4.6-opus |
GCS results: gs://openhands-evaluation-results/{benchmark}/{model_slug}/{eval_run_id}/
Bug 1: acp-gemini — SDK reports $0 cost, only proxy tracks spend
Gemini CLI does not report cost or token usage back to the SDK. metrics.costs and metrics.token_usages are empty arrays. metrics.accumulated_cost is $0.00.
Meanwhile, the LiteLLM proxy virtual key correctly tracks $7–13 per instance via test_result.proxy_cost.
Impact: Without proxy tracking, gemini costs are completely invisible. Also notable: gemini runs are ~50× more expensive than equivalent acp-claude runs on the same instances.
Instance SDK Cost Proxy Cost
django__django-12155 $0.0000 $7.08
django__django-13279 $0.0000 $12.26
django__django-14434 $0.0000 $10.18
scikit-learn__scikit-learn-13439 $0.0000 $7.49
scikit-learn__scikit-learn-25232 $0.0000 $13.55
Bug 2: default (OpenHands agent) — Proxy reports $0, only SDK tracks cost
The default OpenHands agent has full per-turn SDK cost breakdowns (via LiteLLM response headers), but test_result.proxy_cost is $0.00 for every instance.
Root cause: Virtual keys are only created and injected for ACP agents (in benchmarks/utils/acp.py), not for the default agent path. The default agent uses the base API key directly, bypassing proxy spend tracking.
Instance SDK Cost Proxy Cost
django__django-12155 $0.2047 $0.00
django__django-13279 $0.5177 $0.00
django__django-14434 $0.4271 $0.00
scikit-learn__scikit-learn-13439 $0.0988 $0.00
scikit-learn__scikit-learn-25232 $0.4557 $0.00
Bug 3: acp-claude — SDK overestimates cost by 5–38% vs proxy
Both sources report non-zero costs, but they diverge. SDK cost (from UsageUpdate.cost reported by claude-agent-acp) is consistently higher than the proxy-tracked cost.
Possible cause: claude-agent-acp reports list price while the proxy applies prompt caching discounts.
Instance SDK Cost Proxy Cost Diff
django__django-14434 $0.2106 $0.1978 -6.1%
django__django-13279 $0.4767 $0.2931 -38.5%
scikit-learn__scikit-learn-13439 $0.1784 $0.1689 -5.3%
scikit-learn__scikit-learn-25232 $0.2592 $0.2376 -8.3%
django__django-12155 $0.2011 $0.1789 -11.0%
Bug 4: accumulated_token_counts is always zero
The top-level metrics.accumulated_token_counts field (with prompt, completion, cache_read keys) is 0 for ALL agent types. Actual token data exists in the per-turn metrics.token_usages array for default and acp-claude, but is empty for acp-gemini.
This field appears to be stale or never aggregated, which means any downstream reporting that reads accumulated_token_counts will see zeros.
Summary Table
| Agent Type |
SDK Cost |
Proxy Cost |
Token Counts |
Which source works? |
| acp-claude |
✓ (overestimates ~15%) |
✓ |
Per-turn only |
Both (proxy more accurate) |
| acp-gemini |
✗ ($0) |
✓ |
✗ (empty) |
Only proxy |
| default |
✓ |
✗ ($0) |
Per-turn only |
Only SDK |
Suggested Fixes
- Enable virtual key tracking for default agent — create/inject virtual keys in the default evaluation path, not just ACP
- Fix gemini SDK cost reporting — either parse gemini CLI's cost output or estimate from token counts
- Aggregate
accumulated_token_counts — sum token_usages array into the top-level field, or deprecate it
- Investigate acp-claude cost divergence — determine whether SDK or proxy is billing-accurate
References
Summary
During PR validation runs for OpenHands/software-agent-sdk#2656, we observed three distinct cost tracking failures depending on agent type. No single cost source works correctly across all agent types.
Evidence
Runs from 2026-04-02 validating
pr/acp-node22-and-defer-init-main(SDK commit217c454), eval_limit=5:eval-23899965586-claude-soneval-23899965217-claude-soneval-23899966017-claude-4-6eval-23899971488-claude-soneval-23899973560-claude-soneval-23899975488-claude-4-6GCS results:
gs://openhands-evaluation-results/{benchmark}/{model_slug}/{eval_run_id}/Bug 1:
acp-gemini— SDK reports $0 cost, only proxy tracks spendGemini CLI does not report cost or token usage back to the SDK.
metrics.costsandmetrics.token_usagesare empty arrays.metrics.accumulated_costis$0.00.Meanwhile, the LiteLLM proxy virtual key correctly tracks $7–13 per instance via
test_result.proxy_cost.Impact: Without proxy tracking, gemini costs are completely invisible. Also notable: gemini runs are ~50× more expensive than equivalent acp-claude runs on the same instances.
Bug 2:
default(OpenHands agent) — Proxy reports $0, only SDK tracks costThe default OpenHands agent has full per-turn SDK cost breakdowns (via LiteLLM response headers), but
test_result.proxy_costis$0.00for every instance.Root cause: Virtual keys are only created and injected for ACP agents (in
benchmarks/utils/acp.py), not for the default agent path. The default agent uses the base API key directly, bypassing proxy spend tracking.Bug 3:
acp-claude— SDK overestimates cost by 5–38% vs proxyBoth sources report non-zero costs, but they diverge. SDK cost (from
UsageUpdate.costreported byclaude-agent-acp) is consistently higher than the proxy-tracked cost.Possible cause:
claude-agent-acpreports list price while the proxy applies prompt caching discounts.Bug 4:
accumulated_token_countsis always zeroThe top-level
metrics.accumulated_token_countsfield (withprompt,completion,cache_readkeys) is0for ALL agent types. Actual token data exists in the per-turnmetrics.token_usagesarray fordefaultandacp-claude, but is empty foracp-gemini.This field appears to be stale or never aggregated, which means any downstream reporting that reads
accumulated_token_countswill see zeros.Summary Table
Suggested Fixes
accumulated_token_counts— sumtoken_usagesarray into the top-level field, or deprecate itReferences