feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate by hallerite · Pull Request #2408 · PrimeIntellect-ai/prime-rl

hallerite · 2026-05-03T22:39:55Z

What changes and why

vLLM 0.20 ships a generic tokens-in / tokens-out endpoint at /inference/v1/generate (vllm.entrypoints.serve.disagg.serving.ServingTokens) that supersedes the bespoke /v1/generate handler prime-rl maintained on top of vllm 0.19. This PR moves prime-rl onto it.

Server

Drop serving_generate.py + the /v1/generate route. vLLM 0.20's build_app already attaches /inference/v1/generate via attach_disagg_router — the bespoke handler is now redundant.
Subclass ServingTokens as PrimeRlServingTokens. Two prime-rl features aren't covered by the upstream protocol:
1. data_parallel_rank routing — read from the X-data-parallel-rank header and forwarded to engine_client.generate. The DP-replicated inference servers we run need this to target a specific replica.
2. routed_experts per-token export — surfaced on each choice when the engine runs with enable_return_routed_experts=True. The trainer's router-replay path consumes this.
custom_init_app_state swaps the upstream serving_tokens instance for our subclass after init_app_state.

Orchestrator

compute_teacher_logprobs points at /inference/v1/generate. Builds the upstream payload (token_ids + nested sampling_params) and re-flattens prompt_logprobs from the upstream list[dict[token_id, Logprob]] back to the list[float] callers expect.
Bypasses the SDK's parse layer without losing auth. The OpenAI SDK's cast_to enforces openai.BaseModel; vLLM's GenerateResponse is plain pydantic.BaseModel and the SDK rejects it with TypeError. Calling client.post(..., cast_to=httpx.Response, body=...) runs the full request pipeline (auth_headers, retries, timeouts, idempotency keys) and just hands back the raw response for us to validate ourselves — going through client._client.post directly would skip auth.

Tests

Replace test_serving_generate.py with test_serving_tokens.py covering the prime-rl deltas (routed_experts encoding, response shape stability).
Update test_teacher_logprobs.py for the new endpoint URL, payload shape, and response unwrap, and assert the AsyncOpenAI SDK doesn't double-prefix the absolute URL.

Renderers / verifiers pins

Pin renderers==0.1.6 (PyPI). First release after the renderers/verifiers monorepo split — same code as 9acdc60, now a published wheel. Declared as a direct prime-rl dependency since it was previously pulled transitively through verifiers' in-tree workspace package.
Bump verifiers to 7bdc769 for the post-split main. The matching renderers release carries the client-side switch to /inference/v1/generate.

Test plan

Unit

tests/unit/inference/test_serving_tokens.py (4 tests) and tests/unit/orchestrator/test_teacher_logprobs.py (1 test) pass locally against vllm 0.20.
Server module imports cleanly (prime_rl.inference.vllm.server, prime_rl.inference.vllm.serving_tokens).
Companion verifiers tests (packages/renderers/tests/test_client.py) green.
uv.lock regenerated against renderers==0.1.6 (PyPI) and verifiers 7bdc769.

E2E smoke matrix (5 steps each, all green, no NaN, no error)

Test	Model	Path under test	Transport	Result
A	Qwen3-4B-Instruct-2507	TITO `/v1/chat/completions/tokens` (agentic, multi-turn)	NCCL	✅ 5/5 steps, 9 960 TITO calls
B	Qwen3-30B-A3B-Thinking-2507 (MoE, EP=8)	Renderer → `/inference/v1/generate`	NCCL	✅ 5/5 steps, 1 481 generate calls, 3 weight broadcasts
C	Qwen3-30B-A3B-Thinking-2507 (MoE, EP=8)	TITO config (single-turn → MITO `/v1/chat/completions` fallback)¹	NCCL	✅ 5/5 steps, 1 156 chat calls
D	Qwen3-0.6B (live `inference` server)	`compute_teacher_logprobs` → `/inference/v1/generate` (`prompt_logprobs=1`)	n/a	✅ Flat `list[float]`, first slot `0.0`, all values ≤ 0

¹ reverse-text is single-turn, so the TITO client takes the MITO fallback at openai_chat_completions_token_client.py:114. A and B together cover both the TITO /tokens wire path and the new disagg path; C confirms 30B MoE weight reload doesn't regress on the unchanged TITO config.

WandB:

🤖 Generated with Claude Code

Note

Medium Risk
Moderate risk because it removes the legacy /v1/generate API and swaps in a custom subclass of vLLM’s ServingTokens, changing request/response wire shapes for renderer/teacher paths and generation defaults (max_tokens).

Overview
Unifies prime-RL’s token-level generation onto vLLM 0.20’s built-in /inference/v1/generate by removing the bespoke /v1/generate handler (serving_generate.py and route) and swapping state.serving_tokens to a new PrimeRlServingTokens subclass during server init.

PrimeRlServingTokens forwards X-data-parallel-rank into engine_client.generate, preserves routed_experts export in non-streaming responses, and adds server-side defaulting for missing sampling_params.max_tokens to avoid vLLM’s implicit 16-token cap. The chat-with-tokens path reuses the new routed-experts capture base.

The orchestrator’s compute_teacher_logprobs is updated to call /inference/v1/generate with the new payload shape (token_ids + nested sampling_params) and to parse/flatten vLLM’s upstream prompt_logprobs format; tests are updated accordingly. Dependencies are adjusted by pinning renderers==0.1.6 (PyPI) and bumping verifiers to a newer git rev, with uv.lock regenerated.

^{Reviewed by Cursor Bugbot for commit 8637611. Bugbot is set up for automated code reviews on this repo. Configure here.}

…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)

…/generate vLLM 0.20 ships a tokens-in / tokens-out endpoint at /inference/v1/generate (disagg/serving.py) that supersedes the bespoke /v1/generate handler prime-rl shipped on top of vllm 0.19. Replace it. Server side: - Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate route in server.py — vLLM 0.20's build_app already attaches /inference/v1/generate via attach_disagg_router. - Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve two prime-rl features the upstream protocol doesn't natively cover: 1. data_parallel_rank routing — read from the X-data-parallel-rank header and forwarded to engine_client.generate. 2. routed_experts per-token export — surfaced on each choice when the engine is launched with enable_return_routed_experts=True. custom_init_app_state swaps the upstream serving_tokens instance for our subclass. Orchestrator side: - compute_teacher_logprobs in orchestrator/utils.py points at /inference/v1/generate, builds the upstream payload (token_ids + nested sampling_params), and re-flattens prompt_logprobs from the upstream list[dict[token_id, Logprob]] shape back to the list[float] callers expect. Tests: - Replace test_serving_generate.py (class deleted) with test_serving_tokens.py — exercises the prime-rl deltas (routed_experts encoding, response shape stability). - Update test_teacher_logprobs.py to expect the new endpoint URL, payload shape, and response unwrap. Renderers pin: - Bump renderers source to 9c0b738e on the verifiers repo so the client-side switch to /inference/v1/generate ships together. Net: -1 endpoint, ~-275 LoC, no functional change for callers (renderer client emits the same parsed response shape; teacher logprobs return identical list[float]).

…to 7bdc769 Renderers moved out of the verifiers monorepo into their own repo (verifiers#1282). Repoint the source from verifiers/packages/renderers to PrimeIntellect-ai/renderers @ 9acdc60 and declare renderers as a direct prime-rl dependency since it was previously transitively pulled via verifiers' in-tree workspace package. Bump verifiers to 7bdc769 to pick up the post-split main. Pairs with the /inference/v1/generate switch — the renderer client at 9acdc60 emits the new endpoint shape.

Renderers 0.1.6 was published on PyPI today (commit 9acdc60 + version bump). Switch from the git rev source to the canonical PyPI release — keeps the same code (==0.1.6) but avoids depending on the renderers git repo at install time. Keeps `renderers = false` in `[tool.uv.exclude-newer-package]` since 0.1.6 is inside the 7-day cooldown window.

…generate vLLM 0.20's ServingTokens hands the client-supplied SamplingParams to the engine verbatim. SamplingParams.max_tokens defaults to 16 (a dataclass-level default that predates the OpenAI-compat layer), so any caller that omits the field gets a 16-token completion — long enough to start a sentence and stop mid-word. Other vLLM endpoints (/v1/chat/completions, /v1/completions, /v1/responses) all mask this server-side via vllm.entrypoints.utils.get_max_tokens, which falls back to max_model_len - prompt_len. The disagg endpoint skips that path. Mirror it inside PrimeRlServingTokens so callers don't need a client-side workaround. Detection: re-read the cached request body to tell "client sent max_tokens=16" from "client sent nothing → SamplingParams default 16". Pessimistic on read failures (assume the client did set it). Drop once vLLM patches upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…rompt_len" This reverts commit 831f8bc. The fix moved server-side: prime-rl's PrimeRlServingTokens now applies get_max_tokens() defaulting in serve_tokens (PrimeIntellect-ai/prime-rl#2408, commit 913cc4ca), matching every other vLLM endpoint. The client-side workaround was always a band-aid and is no longer needed for prime-rl deployments. Other vLLM 0.20 deployments hitting /inference/v1/generate still need the upstream fix or to apply the prime-rl override locally. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

# Conflicts: # uv.lock

Includes #1307 fix(renderer-client): wrap tools in OpenAI envelope to match training distribution, plus the v1 Taskset/Harness scaffolding from v0.1.14 and other post-release fixes. Smoke-tested on Qwen3-4B-Instruct-2507 + general-agent-solver-local agentic: step 0=25s, step 1=10s. No regressions vs the prior verifiers pin.

* Drop the dead "_ = (model_name, time.time())" no-op in serve_tokens_full_generator; move the Pydantic-silently-drops-id/model/ usage comment to where it belongs. * Dedupe encode_routed_experts: now public on serving_tokens; serving_chat_with_tokens.py imports it instead of carrying its own copy. * Refactor serve_tokens_full_generator from a copy-paste of upstream with a single delta into a thin wrapper-class + super() + post_process, matching how serving_chat_with_tokens.chat_completion_full_generator is structured. The new _RoutedExpertsCapture wraps the engine result generator, captures routed_experts as it streams, then post_process rebuilds the response into PrimeRlGenerateResponse so the encoded experts surface in the JSON. Tracks future vllm changes for free.

…generate

…absolute URL Add an assertion that AsyncOpenAI._prepare_url(absolute_url) returns the URL unchanged. This guards against the hypothetical regression where the SDK merges base_url + absolute_path, producing a malformed URL like `http://h:8000/v1/http://h:8000/inference/v1/generate`. The fake client used by this test only records the path string and doesn't exercise URL preparation; the assertion calls _prepare_url directly to verify the SDK's behavior matches our assumption (verified on openai==2.24.0: absolute URLs short-circuit the merge via httpx.URL.is_relative_url).

`AsyncOpenAI.post(cast_to=...)` enforces `cast_to` subclasses `openai.BaseModel`. `vllm.entrypoints.serve.disagg.protocol.GenerateResponse` is a vanilla `pydantic.BaseModel` and the SDK rejects it with `TypeError: Pydantic models must subclass our base model type`. The fake client in the unit test bypassed the SDK's parse pipeline so the bug slipped through; verified end-to-end against a live vllm server. Send the request through the underlying httpx client (`AsyncOpenAI._client`, where auth and connection pool are already wired) and validate the response JSON ourselves. Update the unit test to mock at `client._client.post` instead of `client.post` so it actually exercises the new code path. Comment in `utils.py` now documents both escape hatches the SDK forces (URL preservation + parse layer).

Going through ``client._client.post`` skips the SDK's auth pipeline — ``Authorization: Bearer <api_key>`` is added per-request in ``BaseClient._build_headers`` / ``auth_headers``, not on the underlying httpx client. Silent regression for any deployment that fronts inference behind an authenticating proxy (Bugbot, medium severity). openai 2.24.0's ``AsyncAPIClient._process_response`` short-circuits when ``cast_to == httpx.Response`` and returns the raw response. That keeps the full request pipeline — auth headers, retries, timeouts, idempotency keys — and skips only the parse layer that rejected the vLLM ``GenerateResponse`` for not subclassing ``openai.BaseModel``. Verified end-to-end against a live Qwen3-0.6B server with a synthetic ``pit_DEADBEEF_test_key``: URL: http://localhost:8123/inference/v1/generate Authorization: Bearer pit_DEADBEEF_test_key User-Agent: AsyncOpenAI/Python 2.24.0 Test updated to mock at ``client.post(cast_to=, body=)`` and synthesize an ``httpx.Response`` mirroring the SDK short-circuit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Both serving_tokens.py and serving_chat_with_tokens.py carried byte- identical __init__/__aiter__ implementations of _RoutedExpertsCapture — only post_process differed (in-place mutation for chat where ChatCompletionResponseChoice is extra='allow', vs response rebuild for generate where GenerateResponseChoice isn't). Lift the streaming logic into a shared base class so future fixes to capture only need to land once (Bugbot, low severity). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…onfig ``getattr(x, "attr", {})`` returns ``None`` when ``x.attr is None`` — the default only fires for missing attributes, not for attributes that exist with a ``None`` value. ``mc.override_generation_config`` is declared ``dict[str, Any] = field(default_factory=dict)`` so in practice it's always ``{}``, never ``None`` — and we're literally mirroring upstream ``OpenAIServingChat.__init__``, which uses the same idiom. But the ``or {}`` costs nothing and guards against a downstream caller ever mutating the attribute to ``None``. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

vllm 0.20.1 -> 0.20.2 from main (#2468). Resolved uv.lock dependency- metadata conflict by taking main's baseline (vllm 0.20.2) and re-running `uv lock` so our verifiers `aa428f3` + renderers `==0.1.6` pins are re-applied on top. Unit tests pass on the new pin (8/8).

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8637611. Configure here.}

cursor · 2026-05-11T12:51:10Z

            },
        )
-        return [0.0 if lp is None else float(lp) for lp in response.prompt_logprobs or []]
+        response = GenerateResponse.model_validate_json(http_response.content)


Missing HTTP status code check on raw response

Medium Severity

Using cast_to=httpx.Response bypasses the OpenAI SDK's status-code validation in _process_response (which short-circuits and returns the raw response for this cast type). The comment says "just hands back the raw response for us to validate ourselves," but http_response.status_code is never checked before calling GenerateResponse.model_validate_json(http_response.content). If the server returns a 4xx/5xx error, model_validate_json will attempt to parse the error body as a GenerateResponse, producing a confusing pydantic.ValidationError instead of a clear HTTP error.

Additional Locations (1)

src/prime_rl/orchestrator/utils.py#L104-L118

^{Reviewed by Cursor Bugbot for commit 8637611. Configure here.}

cursor · 2026-05-11T12:51:10Z

+
+        return await self.serve_tokens_full_generator(
+            request, result_generator, request_id, model_name, request_metadata
+        )


Full upstream method copy creates fragile maintenance burden

Medium Severity

serve_tokens duplicates ~116 lines of the upstream ServingTokens.serve_tokens verbatim to inject just two changes: data_parallel_rank at line 260 and max_tokens defaulting at lines 229-237. The entire engine-input construction, sampling-params setup, logging, tracing, and streaming-vs-full dispatch are copied. Any future upstream change (e.g., new parameters, bug fixes, flow changes) will silently diverge from this fork, making it a high-risk maintenance surface.

^{Reviewed by Cursor Bugbot for commit 8637611. Configure here.}

This was referenced May 4, 2026

feat(renderers): switch client to vLLM 0.20 /inference/v1/generate PrimeIntellect-ai/verifiers#1282

Merged

feat: switch client to vLLM 0.20 /inference/v1/generate PrimeIntellect-ai/renderers#1

Merged

hallerite marked this pull request as ready for review May 5, 2026 10:46

cursor Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/prime_rl/inference/vllm/serving_tokens.py Outdated

Comment thread src/prime_rl/inference/vllm/serving_tokens.py

hallerite added 2 commits May 7, 2026 14:15

hallerite force-pushed the feat/unify-inference-generate branch from 794a588 to 0f16ddc Compare May 7, 2026 14:22

hallerite changed the base branch from feat/vllm-0.20-cu13 to main May 7, 2026 14:22

hallerite and others added 2 commits May 7, 2026 14:30

cursor Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/prime_rl/inference/vllm/serving_tokens.py Outdated

hallerite added 2 commits May 10, 2026 15:14

Merge branch 'main' into feat/unify-inference-generate

523a08b

# Conflicts: # uv.lock

hallerite force-pushed the feat/unify-inference-generate branch 3 times, most recently from 58d17cb to 30ccf52 Compare May 10, 2026 10:10

hallerite force-pushed the feat/unify-inference-generate branch from 30ccf52 to 1ec4d0b Compare May 10, 2026 10:15

Merge remote-tracking branch 'origin/main' into feat/unify-inference-…

b912b04

…generate

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/utils.py

hallerite added 2 commits May 10, 2026 20:12

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/prime_rl/orchestrator/utils.py

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/prime_rl/inference/vllm/serving_tokens.py

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/prime_rl/inference/vllm/serving_tokens.py

hallerite added 2 commits May 11, 2026 18:12

style: ruff isort on compute_teacher_logprobs imports

8637611

cursor Bot reviewed May 11, 2026

View reviewed changes

rasdani requested review from Jackmin801, S1ro1, rasdani and samsja May 11, 2026 15:29

rasdani approved these changes May 11, 2026

View reviewed changes

samsja approved these changes May 11, 2026

View reviewed changes

hallerite merged commit 7c434ec into main May 11, 2026
19 of 22 checks passed

hallerite deleted the feat/unify-inference-generate branch May 11, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
hallerite merged 15 commits into
mainfrom
feat/unify-inference-generate

hallerite commented May 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 11, 2026

Uh oh!

cursor Bot May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hallerite commented May 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes and why

Server

Orchestrator

Tests

Renderers / verifiers pins

Test plan

Unit

E2E smoke matrix (5 steps each, all green, no NaN, no error)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Missing HTTP status code check on raw response

Uh oh!

cursor Bot May 11, 2026

Choose a reason for hiding this comment

Full upstream method copy creates fragile maintenance burden

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hallerite commented May 3, 2026 •

edited by cursor Bot

Loading