Cost tracking discrepancies: SDK vs LiteLLM proxy virtual key costs diverge by agent type

## Summary

During PR validation runs for OpenHands/software-agent-sdk#2656, we observed three distinct cost tracking failures depending on agent type. No single cost source works correctly across all agent types.

## Evidence

Runs from 2026-04-02 validating `pr/acp-node22-and-defer-init-main` (SDK commit `217c454`), eval_limit=5:

| K8s Job | Benchmark | Agent Type | Model |
|---------|-----------|-----------|-------|
| `eval-23899965586-claude-son` | swebench | acp-claude | claude-sonnet-4-5-20250929 |
| `eval-23899965217-claude-son` | swebench | acp-gemini | claude-sonnet-4-5-20250929 |
| `eval-23899966017-claude-4-6` | swebench | default | claude-4.6-opus |
| `eval-23899971488-claude-son` | swebenchmultimodal | acp-claude | claude-sonnet-4-5-20250929 |
| `eval-23899973560-claude-son` | swebenchmultimodal | acp-gemini | claude-sonnet-4-5-20250929 |
| `eval-23899975488-claude-4-6` | swebenchmultimodal | default | claude-4.6-opus |

GCS results: `gs://openhands-evaluation-results/{benchmark}/{model_slug}/{eval_run_id}/`

## Bug 1: `acp-gemini` — SDK reports $0 cost, only proxy tracks spend

Gemini CLI does not report cost or token usage back to the SDK. `metrics.costs` and `metrics.token_usages` are empty arrays. `metrics.accumulated_cost` is `$0.00`.

Meanwhile, the LiteLLM proxy virtual key correctly tracks $7–13 per instance via `test_result.proxy_cost`.

**Impact**: Without proxy tracking, gemini costs are completely invisible. Also notable: gemini runs are ~50× more expensive than equivalent acp-claude runs on the same instances.

```
Instance                                    SDK Cost    Proxy Cost
django__django-12155                        $0.0000     $7.08
django__django-13279                        $0.0000     $12.26
django__django-14434                        $0.0000     $10.18
scikit-learn__scikit-learn-13439            $0.0000     $7.49
scikit-learn__scikit-learn-25232            $0.0000     $13.55
```

## Bug 2: `default` (OpenHands agent) — Proxy reports $0, only SDK tracks cost

The default OpenHands agent has full per-turn SDK cost breakdowns (via LiteLLM response headers), but `test_result.proxy_cost` is `$0.00` for every instance.

**Root cause**: Virtual keys are only created and injected for ACP agents (in `benchmarks/utils/acp.py`), not for the default agent path. The default agent uses the base API key directly, bypassing proxy spend tracking.

```
Instance                                    SDK Cost    Proxy Cost
django__django-12155                        $0.2047     $0.00
django__django-13279                        $0.5177     $0.00
django__django-14434                        $0.4271     $0.00
scikit-learn__scikit-learn-13439            $0.0988     $0.00
scikit-learn__scikit-learn-25232            $0.4557     $0.00
```

## Bug 3: `acp-claude` — SDK overestimates cost by 5–38% vs proxy

Both sources report non-zero costs, but they diverge. SDK cost (from `UsageUpdate.cost` reported by `claude-agent-acp`) is consistently higher than the proxy-tracked cost.

Possible cause: `claude-agent-acp` reports list price while the proxy applies prompt caching discounts.

```
Instance                                    SDK Cost    Proxy Cost    Diff
django__django-14434                        $0.2106     $0.1978       -6.1%
django__django-13279                        $0.4767     $0.2931       -38.5%
scikit-learn__scikit-learn-13439            $0.1784     $0.1689       -5.3%
scikit-learn__scikit-learn-25232            $0.2592     $0.2376       -8.3%
django__django-12155                        $0.2011     $0.1789       -11.0%
```

## Bug 4: `accumulated_token_counts` is always zero

The top-level `metrics.accumulated_token_counts` field (with `prompt`, `completion`, `cache_read` keys) is `0` for ALL agent types. Actual token data exists in the per-turn `metrics.token_usages` array for `default` and `acp-claude`, but is empty for `acp-gemini`.

This field appears to be stale or never aggregated, which means any downstream reporting that reads `accumulated_token_counts` will see zeros.

## Summary Table

| Agent Type | SDK Cost | Proxy Cost | Token Counts | Which source works? |
|-----------|----------|-----------|-------------|-------------------|
| acp-claude | ✓ (overestimates ~15%) | ✓ | Per-turn only | Both (proxy more accurate) |
| acp-gemini | ✗ ($0) | ✓ | ✗ (empty) | **Only proxy** |
| default | ✓ | ✗ ($0) | Per-turn only | **Only SDK** |

## Suggested Fixes

1. **Enable virtual key tracking for default agent** — create/inject virtual keys in the default evaluation path, not just ACP
2. **Fix gemini SDK cost reporting** — either parse gemini CLI's cost output or estimate from token counts
3. **Aggregate `accumulated_token_counts`** — sum `token_usages` array into the top-level field, or deprecate it
4. **Investigate acp-claude cost divergence** — determine whether SDK or proxy is billing-accurate

## References

- PR under validation: OpenHands/software-agent-sdk#2656
- ACP failure investigation: OpenHands/runtime-api#458

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost tracking discrepancies: SDK vs LiteLLM proxy virtual key costs diverge by agent type #603

Summary

Evidence

Bug 1: `acp-gemini` — SDK reports $0 cost, only proxy tracks spend

Bug 2: `default` (OpenHands agent) — Proxy reports $0, only SDK tracks cost

Bug 3: `acp-claude` — SDK overestimates cost by 5–38% vs proxy

Bug 4: `accumulated_token_counts` is always zero

Summary Table

Suggested Fixes

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

K8s Job	Benchmark	Agent Type	Model
`eval-23899965586-claude-son`	swebench	acp-claude	claude-sonnet-4-5-20250929
`eval-23899965217-claude-son`	swebench	acp-gemini	claude-sonnet-4-5-20250929
`eval-23899966017-claude-4-6`	swebench	default	claude-4.6-opus
`eval-23899971488-claude-son`	swebenchmultimodal	acp-claude	claude-sonnet-4-5-20250929
`eval-23899973560-claude-son`	swebenchmultimodal	acp-gemini	claude-sonnet-4-5-20250929
`eval-23899975488-claude-4-6`	swebenchmultimodal	default	claude-4.6-opus

Agent Type	SDK Cost	Proxy Cost	Token Counts	Which source works?
acp-claude	✓ (overestimates ~15%)	✓	Per-turn only	Both (proxy more accurate)
acp-gemini	✗ ($0)	✓	✗ (empty)	Only proxy
default	✓	✗ ($0)	Per-turn only	Only SDK

Cost tracking discrepancies: SDK vs LiteLLM proxy virtual key costs diverge by agent type #603

Description

Summary

Evidence

Bug 1: acp-gemini — SDK reports $0 cost, only proxy tracks spend

Bug 2: default (OpenHands agent) — Proxy reports $0, only SDK tracks cost

Bug 3: acp-claude — SDK overestimates cost by 5–38% vs proxy

Bug 4: accumulated_token_counts is always zero

Summary Table

Suggested Fixes

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug 1: `acp-gemini` — SDK reports $0 cost, only proxy tracks spend

Bug 2: `default` (OpenHands agent) — Proxy reports $0, only SDK tracks cost

Bug 3: `acp-claude` — SDK overestimates cost by 5–38% vs proxy

Bug 4: `accumulated_token_counts` is always zero