Add SGLang Router Support #3267
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Intro
We want to make it possible to create a gateway which extends the gateway functionality with additional features (all sgl-router features such as cache aware routing, etc) while keeping all the standard gateway features (such as authentication, rate limits).
For the user, using such gateway should be very simple, e.g. setting router to sglang. - for the gateway configurations. The rest for the user should look the same - the same service endpoint, authentication and rate limits working, etc.
While the first change should only bring minimum features - allow to route replicas traffic through the router (dstack’s gateway/ngnix -> sglang-router -> replica workers), in the future this may be extended with router-specific scaling metrics, such as ttft, e2e, dissagregated PD, etc).
As the first experimental version, the most critical is to come up with the minimum changes that are tested thoroughly that would allow embedding the router without breaking any existing functionality.
Key Changes
Add
src/dstack/_internal/core/models/routers.pyDefine router types and configuration models. The RouterType enum identifies available routers. Each router has its own config model (SGLangRouterConfig, VLLMRouterConfig) with router-specific options. AnyRouterConfig allows to select the correct config class based on the type field.
Add
router: AnyRouterConfiginGatewayConfigurationand inGatewayComputeConfigurationEnsure router config flows from user input → server → backend compute layer.
Update
gateway/pyproject.tomlto include router packages as optional dependenciesUpdate
get_dstack_gateway_commands()insrc/dstack/_internal/core/backends/base/compute.pyto accept router configUpdate
_update_gateway()insrc/dstack/_internal/server/services/gateways/__init__.pyto extract router_configAdd abstract Router base class in
src/dstack/_internal/proxy/gateway/model_routers/base.pyHandles lifecycle methods of router.
Extend abstract Router base class and implement SGLangRouter in
src/dstack/_internal/proxy/gateway/model_routers/sglang.pyAdd router register
src/dstack/_internal/proxy/gateway/model_routers/__init__.pyImplement the registry pattern (similar to dstack's backend configurators) for auto-discovery and lookup of available routers.
Update src/dstack/_internal/proxy/gateway/services/nginx.py
Update upstream block of src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2 to forward request when router is defined.
Add new nginx config to src/dstack/_internal/proxy/gateway/resources/nginx/router_workers.jinja2 to make service replicas's available to TCP port. Later we could avoid this extra proxying layer by switching from Unix sockets to TCP ports when opening SSH tunnels on the gateway.
Serving Concurrent Services with SGLang Router
SGLang allows you to route multiple models through the same single router. It identifies different models using model_id. (Link). We can utilize this to serve multiple services using the single sglang-router process.
How Router Upgrade Works
Steps
Gateway Service Restart and Gateway Instance Reboot
Router has been tested to successfully reconnect to replicas after both a gateway service restart and a full gateway instance reboot.
How to test
Step 1
Apply Below Gateway Config
Step 2
Update DNS
Step 3
We want to test with multiple services therefore, apply below service configs.
Config1
Config2
Step 3
To automate request and test autoscaling, you can use below script:
autoscale_test_sglang.pyStep 6
After updating token and service endpoint, run above script
python autoscale_test_sglang.pyfrom your local machine.Once the automated requests start hitting the service endpoint; dstack submits the job. When the service get's deployed and /health check from sglang-router responds with 200 as shown below, you will start to see response from the model.
As the automated requests continue, first dstack scales up to 2 jobs. If we stop the requests, dstack scales down to 0 jobs.
Note:
This PR uses "https://bihan-test-bucket.s3.eu-west-1.amazonaws.com/dstack_gateway-0.0.1-py3-none-any.whl" for testing. Later once the PR is ready for merge, I will update it in
src/dstack/_internal/core/backends/base/compute.pyFor testing
gateway/pyproject.tomlhas my fork as dependency. I will update it once the PR is ready for merge.