You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to kick off a discussion about bringing prefill/decode disaggregated serving into the core GIE project.
Reference: llm-d/llm-d-inference-scheduler#356
As some of you know, this logic already exists and works in the llm-d project. The idea is to upstream it so we can create a standard protocol that other model servers can hook into. This would be a huge win for interoperability and would directly help all other adopters than llm-d.
The "Why"
Disaggregated serving is a key optimization for LLMs. By splitting the compute-heavy prefill from the memory-heavy decode steps, we can scale those resources independently and get much better hardware utilization.
By standardizing this in GIE, we can:
Let different model servers (sglang, vLLM, etc.) use this optimization out-of-the-box with IGW.
Keep the complex orchestration logic inside the EPP, so the model servers/inference frameworks themselves can stay simpler.
The Proposed Protocol
The protocol we've been using in llm-d is pretty straightforward and would be a great starting point:
The EPP picks a pair of servers: one for prefill and one for decode.
The EPP adds a header: It injects the address of the prefill server into the request (e.g., X-Prefill-Endpoint: :).
The EPP routes to the decode server: The gateway sends the request to the chosen decode server.
The decode server coordinates: It's then up to the decode server to read the header and pull the KV cache from the prefill server.
The great thing is, this should all be doable with the existing scheduler plugin architecture. We won't need to touch the core framework. It would mostly be a new ProfileHandler that orchestrates two separate scheduling profiles (prefill-profile and decode-profile).
Next Steps
This is the first step toward a formal GIEP. The plan would be to tackle this in two phases:
Milestone 1: Get the protocol defined and upstream the existing llm-d logic.
Milestone 2 (Future): Look into more advanced scheduling algorithms (e.g., deciding when to disaggregate based on sequence length, and etc).
Greatly appreciate any feedback on this. Does this protocol seem reasonable? Any potential gotchas we should be thinking about?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
I wanted to kick off a discussion about bringing prefill/decode disaggregated serving into the core GIE project.
Reference: llm-d/llm-d-inference-scheduler#356
As some of you know, this logic already exists and works in the llm-d project. The idea is to upstream it so we can create a standard protocol that other model servers can hook into. This would be a huge win for interoperability and would directly help all other adopters than llm-d.
The "Why"
Disaggregated serving is a key optimization for LLMs. By splitting the compute-heavy prefill from the memory-heavy decode steps, we can scale those resources independently and get much better hardware utilization.
By standardizing this in GIE, we can:
The Proposed Protocol
The protocol we've been using in llm-d is pretty straightforward and would be a great starting point:
The great thing is, this should all be doable with the existing scheduler plugin architecture. We won't need to touch the core framework. It would mostly be a new ProfileHandler that orchestrates two separate scheduling profiles (prefill-profile and decode-profile).
Next Steps
This is the first step toward a formal GIEP. The plan would be to tackle this in two phases:
Milestone 1: Get the protocol defined and upstream the existing llm-d logic.
Milestone 2 (Future): Look into more advanced scheduling algorithms (e.g., deciding when to disaggregate based on sequence length, and etc).
Greatly appreciate any feedback on this. Does this protocol seem reasonable? Any potential gotchas we should be thinking about?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions