Standardize a Protocol for Prefill/Decode Disaggregated Serving #1650

yangligt2 · 2025-09-25T17:20:41Z

yangligt2
Sep 25, 2025

Hey everyone,

I wanted to kick off a discussion about bringing prefill/decode disaggregated serving into the core GIE project.
Reference: llm-d/llm-d-inference-scheduler#356
As some of you know, this logic already exists and works in the llm-d project. The idea is to upstream it so we can create a standard protocol that other model servers can hook into. This would be a huge win for interoperability and would directly help all other adopters than llm-d.

The "Why"

Disaggregated serving is a key optimization for LLMs. By splitting the compute-heavy prefill from the memory-heavy decode steps, we can scale those resources independently and get much better hardware utilization.
By standardizing this in GIE, we can:

Let different model servers (sglang, vLLM, etc.) use this optimization out-of-the-box with IGW.
Keep the complex orchestration logic inside the EPP, so the model servers/inference frameworks themselves can stay simpler.

The Proposed Protocol

The protocol we've been using in llm-d is pretty straightforward and would be a great starting point:

The EPP picks a pair of servers: one for prefill and one for decode.
The EPP adds a header: It injects the address of the prefill server into the request (e.g., X-Prefill-Endpoint: :).
The EPP routes to the decode server: The gateway sends the request to the chosen decode server.
The decode server coordinates: It's then up to the decode server to read the header and pull the KV cache from the prefill server.
The great thing is, this should all be doable with the existing scheduler plugin architecture. We won't need to touch the core framework. It would mostly be a new ProfileHandler that orchestrates two separate scheduling profiles (prefill-profile and decode-profile).

Next Steps

This is the first step toward a formal GIEP. The plan would be to tackle this in two phases:

Milestone 1: Get the protocol defined and upstream the existing llm-d logic.
Milestone 2 (Future): Look into more advanced scheduling algorithms (e.g., deciding when to disaggregate based on sequence length, and etc).

Greatly appreciate any feedback on this. Does this protocol seem reasonable? Any potential gotchas we should be thinking about?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardize a Protocol for Prefill/Decode Disaggregated Serving #1650

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Standardize a Protocol for Prefill/Decode Disaggregated Serving #1650

Uh oh!

yangligt2 Sep 25, 2025

The "Why"

The Proposed Protocol

Next Steps

Replies: 0 comments

yangligt2
Sep 25, 2025