Skip to content

Add timings to server responses#1279

Open
spicyneuron wants to merge 5 commits into
ml-explore:mainfrom
spicyneuron:add-server-timings
Open

Add timings to server responses#1279
spicyneuron wants to merge 5 commits into
ml-explore:mainfrom
spicyneuron:add-server-timings

Conversation

@spicyneuron
Copy link
Copy Markdown
Contributor

@spicyneuron spicyneuron commented May 16, 2026

Adds a timings object to /v1/chat/completions and /v1/completions responses so clients can measure prompt and generation throughput without instrumenting the server themselves. Useful for benchmarking, comparing models, and surfacing tps in UIs.

The shape mirrors llama.cpp's timings field for ecosystem compatibility:

"timings": {
  "prompt_per_second": 412.3,
  "predicted_per_second": 78.9,
  "prompt_n": 128,
  "predicted_n": 64
}

Behavior

Follows the existing usage conventions:

  • Non-streaming: always included.
  • Streaming: included in the final usage chunk when stream_options.include_usage is set.

Implementation notes

  • Measured server-side; reusing GenerationResponse.prompt_tps doesn't work because batched generation has no per-sequence equivalent.
  • Both windows are stamped in the generation worker: prompt = prefill start → first generated token; predicted = first → last generated token. Excludes network I/O and pre-worker queue wait; includes a small amount of batch-tick overhead, so values are approximate.
  • prompt_n excludes cached tokens. predicted_per_second is 0 when fewer than two tokens are generated.
  • Includes a minor fix: streaming usage check uses .get("include_usage") instead of ["include_usage"] to avoid KeyError when stream_options is passed without that key.

@spicyneuron spicyneuron marked this pull request as draft May 16, 2026 15:47
@spicyneuron spicyneuron force-pushed the add-server-timings branch 2 times, most recently from d95da48 to 061264e Compare May 16, 2026 23:59
@spicyneuron spicyneuron force-pushed the add-server-timings branch from 061264e to 6ba776c Compare May 17, 2026 00:30
@spicyneuron spicyneuron marked this pull request as ready for review May 17, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant