Skip to content

Conversation

@Bihan
Copy link
Collaborator

@Bihan Bihan commented Nov 6, 2025

Intro
We want to make it possible to create a gateway which extends the gateway functionality with additional features (all sgl-router features such as cache aware routing, etc) while keeping all the standard gateway features (such as authentication, rate limits).

For the user, using such gateway should be very simple, e.g. setting router to sglang. - for the gateway configurations. The rest for the user should look the same - the same service endpoint, authentication and rate limits working, etc.
While the first change should only bring minimum features - allow to route replicas traffic through the router (dstack’s gateway/ngnix -> sglang-router -> replica workers), in the future this may be extended with router-specific scaling metrics, such as ttft, e2e, dissagregated PD, etc).

As the first experimental version, the most critical is to come up with the minimum changes that are tested thoroughly that would allow embedding the router without breaking any existing functionality.

Key Changes

  1. Add src/dstack/_internal/core/models/routers.py
    Define router types and configuration models. The RouterType enum identifies available routers. Each router has its own config model (SGLangRouterConfig, VLLMRouterConfig) with router-specific options. AnyRouterConfig allows to select the correct config class based on the type field.

  2. Add router: AnyRouterConfig in GatewayConfiguration and in GatewayComputeConfiguration
    Ensure router config flows from user input → server → backend compute layer.

  3. Update gateway/pyproject.toml to include router packages as optional dependencies

  4. Update get_dstack_gateway_commands() in src/dstack/_internal/core/backends/base/compute.py to accept router config

  5. Update _update_gateway() in src/dstack/_internal/server/services/gateways/__init__.py to extract router_config

  6. Add abstract Router base class in src/dstack/_internal/proxy/gateway/model_routers/base.py
    Handles lifecycle methods of router.

  7. Extend abstract Router base class and implement SGLangRouter in src/dstack/_internal/proxy/gateway/model_routers/sglang.py

  8. Add router register src/dstack/_internal/proxy/gateway/model_routers/__init__.py
    Implement the registry pattern (similar to dstack's backend configurators) for auto-discovery and lookup of available routers.

  9. Update src/dstack/_internal/proxy/gateway/services/nginx.py

  10. Update upstream block of src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2 to forward request when router is defined.

  11. Add new nginx config to src/dstack/_internal/proxy/gateway/resources/nginx/router_workers.jinja2 to make service replicas's available to TCP port. Later we could avoid this extra proxying layer by switching from Unix sockets to TCP ports when opening SSH tunnels on the gateway.

Serving Concurrent Services with SGLang Router
SGLang allows you to route multiple models through the same single router. It identifies different models using model_id. (Link). We can utilize this to serve multiple services using the single sglang-router process.

How Router Upgrade Works

Steps

  1. Let’s say Gateway currently has dstack-gateway 0.19.34 and sglang-router 0.2.1 and we are releasing new dstack-gateway 0.19.35 with new SGLang feature in latest SGLang version (sglang-router 0.2.2)
  2. We bump up sglang-router version to 0.2.2 in gateway/pyproject.toml
  3. dstack server restart -> init_gateways() called -> _update_gateway() -> update.sh executes
  4. Flips to inactive venv, let’s say now version = “green”
  5. Installs new version in green venv

Gateway Service Restart and Gateway Instance Reboot
Router has been tested to successfully reconnect to replicas after both a gateway service restart and a full gateway instance reboot.

How to test

Step 1
Apply Below Gateway Config

#gateway.dstack.yml
type: gateway
name: bihan-gateway


# Gateways are bound to a specific backend and region
backend: aws
region: eu-west-1

# This domain will be used to access the endpoint
domain: example.com
router:
  type: sglang
  policy: cache_aware

Step 2
Update DNS

Step 3
We want to test with multiple services therefore, apply below service configs.

Config1

#sglang-service1.yml
type: service
name: sglang-service1

python: 3.12
nvcc: true

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
  - HF_HUB_DISABLE_XET=1
  - HF_HUB_ENABLE_HF_TRANSFER=0

commands:
  - pip install --upgrade pip
  - pip install uv
  - uv pip install sglang --prerelease=allow
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: sglang-service1.meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 0..2
scaling:
   metric: rps
   target: 1

Config2

#sglang-service2.yml
type: service
name: sglang-service2

python: 3.12
nvcc: true

env:
  - HF_TOKEN
  - MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
  - HF_HUB_DISABLE_XET=1
  - HF_HUB_ENABLE_HF_TRANSFER=0

commands:
  - pip install --upgrade pip
  - pip install uv
  - uv pip install sglang --prerelease=allow
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: sglang-service2.meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 0..2
scaling:
   metric: rps
   target: 1

Step 3
To automate request and test autoscaling, you can use below script: autoscale_test_sglang.py

import asyncio
import aiohttp
import time
import json

# ==== Configuration ====
URL = "https://sglang-service1.example.com/v1/chat/completions" # <-- replace with your endpoint
TOKEN = "esdfds3263-c36d-41db-ba9b-0d31df4efb15e"   # <-- replace with your token
RPS = 2            # requests per second
DURATION = 1800        # duration in seconds
# =======================

HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {TOKEN}"
}

PAYLOAD = {
    "model": "sglang-service1.meta-llama/Llama-3.2-3B-Instruct", #<--replace with your model
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Deep Learning?"}
    ]
}


async def send_request(session, idx):
    """Send a single request and print full response"""
    try:
        async with session.post(URL, headers=HEADERS, json=PAYLOAD) as resp:
            text = await resp.text()
            print(f"\n[{idx}] Status: {resp.status}")
            print(f"Response:\n{text}\n")
    except Exception as e:
        print(f"[{idx}] Error: {e}")


async def run_load_test():
    total_requests = RPS * DURATION
    interval = 1.0 / RPS

    async with aiohttp.ClientSession() as session:
        start_time = time.perf_counter()
        tasks = []

        for i in range(total_requests):
            tasks.append(asyncio.create_task(send_request(session, i + 1)))
            await asyncio.sleep(interval)

        await asyncio.gather(*tasks)

        elapsed = time.perf_counter() - start_time
        print(f"\n✅ Sent {total_requests} requests in {elapsed:.2f}s "
              f"(~{total_requests/elapsed:.2f} RPS)")


if __name__ == "__main__":
    asyncio.run(run_load_test())

Step 6
After updating token and service endpoint, run above script python autoscale_test_sglang.py from your local machine.

Once the automated requests start hitting the service endpoint; dstack submits the job. When the service get's deployed and /health check from sglang-router responds with 200 as shown below, you will start to see response from the model.

As the automated requests continue, first dstack scales up to 2 jobs. If we stop the requests, dstack scales down to 0 jobs.

Note:

  1. This PR uses "https://bihan-test-bucket.s3.eu-west-1.amazonaws.com/dstack_gateway-0.0.1-py3-none-any.whl" for testing. Later once the PR is ready for merge, I will update it in src/dstack/_internal/core/backends/base/compute.py

  2. For testing gateway/pyproject.toml has my fork as dependency. I will update it once the PR is ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant