Add timings to server responses by spicyneuron · Pull Request #1279 · ml-explore/mlx-lm

spicyneuron · 2026-05-16T13:34:35Z

Adds a timings object to /v1/chat/completions and /v1/completions responses so clients can measure prompt and generation throughput without instrumenting the server themselves. Useful for benchmarking, comparing models, and surfacing tps in UIs.

The shape mirrors llama.cpp's timings field for ecosystem compatibility:

"timings": {
  "prompt_per_second": 412.3,
  "predicted_per_second": 78.9,
  "prompt_n": 128,
  "predicted_n": 64
}

Behavior

Follows the existing usage conventions:

Non-streaming: always included.
Streaming: included in the final usage chunk when stream_options.include_usage is set.

Implementation notes

Measured server-side; reusing GenerationResponse.prompt_tps doesn't work because batched generation has no per-sequence equivalent.
Both windows are stamped in the generation worker: prompt = prefill start → first generated token; predicted = first → last generated token. Excludes network I/O and pre-worker queue wait; includes a small amount of batch-tick overhead, so values are approximate.
prompt_n excludes cached tokens. predicted_per_second is 0 when fewer than two tokens are generated.
Includes a minor fix: streaming usage check uses .get("include_usage") instead of ["include_usage"] to avoid KeyError when stream_options is passed without that key.

spicyneuron marked this pull request as draft May 16, 2026 15:47

spicyneuron force-pushed the add-server-timings branch 2 times, most recently from d95da48 to 061264e Compare May 16, 2026 23:59

Add timings to server responses

6ba776c

spicyneuron force-pushed the add-server-timings branch from 061264e to 6ba776c Compare May 17, 2026 00:30

spicyneuron marked this pull request as ready for review May 17, 2026 00:39

spicyneuron added 4 commits May 17, 2026 16:20

Add cache_n support

e154220

Report elapsed time in timings

a6b8e59

Update SERVER.md

aad8db3

Hoist cache_n

c072121

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timings to server responses#1279

Add timings to server responses#1279
spicyneuron wants to merge 5 commits into
ml-explore:mainfrom
spicyneuron:add-server-timings

spicyneuron commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spicyneuron commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Behavior

Implementation notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spicyneuron commented May 16, 2026 •

edited

Loading