Summary
Proposal to update the model-latency-benchmarking tool so it uses current SDK/APIs, benchmarks the latest Bedrock models, and can benchmark prompt caching.
Motivation
- The example dataset and model list point at older models (Llama 3.1 70B, Claude 3.5 Haiku).
- Bedrock now offers newer models (Claude 4.5 family, Amazon Nova) that users want to benchmark.
- Prompt caching is generally available and reduces latency by up to 85%, but the tool cannot benchmark it today.
Proposed changes
-
SDK/API refresh
- Use current
boto3/botocore.
- Keep the Converse/ConverseStream APIs; tidy client setup and error handling.
- Fix an existing inconsistency where
get_body hardcodes temperature=0 and topP=1 and ignores the configured inference parameters.
-
Latest model support
- Update the dummy JSONL and README to current cross-region inference profile IDs (for example Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, Claude 3.7 Sonnet, and Amazon Nova Pro/Lite/Micro).
-
Configurable prompt caching (opt-in)
- Add optional per-row JSONL fields, defaulting off:
prompt_caching (bool), cache_ttl ("5m"/"1h"), and a field for the cacheable static prefix (system prompt / context).
- Insert Converse
cachePoint checkpoints when caching is enabled.
- Capture
cacheReadInputTokens and cacheWriteInputTokens from the Converse usage block, derive a cache_hit_rate, and report TTFT cached vs uncached.
- Provide a longer sample prompt that meets the per-model cache minimums (1,024-4,096 tokens).
Backward compatibility
- Existing JSONL files run unchanged. All new fields are optional and default to off.
Out of scope
- No change to the core benchmarking methodology beyond adding cache-related columns and plots.
Contribution
I would like to submit a pull request for this. Happy to adjust the scope based on maintainer feedback.
Summary
Proposal to update the
model-latency-benchmarkingtool so it uses current SDK/APIs, benchmarks the latest Bedrock models, and can benchmark prompt caching.Motivation
Proposed changes
SDK/API refresh
boto3/botocore.get_bodyhardcodestemperature=0andtopP=1and ignores the configured inference parameters.Latest model support
Configurable prompt caching (opt-in)
prompt_caching(bool),cache_ttl("5m"/"1h"), and a field for the cacheable static prefix (system prompt / context).cachePointcheckpoints when caching is enabled.cacheReadInputTokensandcacheWriteInputTokensfrom the Converseusageblock, derive acache_hit_rate, and report TTFT cached vs uncached.Backward compatibility
Out of scope
Contribution
I would like to submit a pull request for this. Happy to adjust the scope based on maintainer feedback.