Skip to content

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408

Merged
hallerite merged 15 commits into
mainfrom
feat/unify-inference-generate
May 11, 2026
Merged

feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
hallerite merged 15 commits into
mainfrom
feat/unify-inference-generate

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented May 3, 2026

What changes and why

vLLM 0.20 ships a generic tokens-in / tokens-out endpoint at /inference/v1/generate (vllm.entrypoints.serve.disagg.serving.ServingTokens) that supersedes the bespoke /v1/generate handler prime-rl maintained on top of vllm 0.19. This PR moves prime-rl onto it.

Server

  • Drop serving_generate.py + the /v1/generate route. vLLM 0.20's build_app already attaches /inference/v1/generate via attach_disagg_router — the bespoke handler is now redundant.
  • Subclass ServingTokens as PrimeRlServingTokens. Two prime-rl features aren't covered by the upstream protocol:
    1. data_parallel_rank routing — read from the X-data-parallel-rank header and forwarded to engine_client.generate. The DP-replicated inference servers we run need this to target a specific replica.
    2. routed_experts per-token export — surfaced on each choice when the engine runs with enable_return_routed_experts=True. The trainer's router-replay path consumes this.
  • custom_init_app_state swaps the upstream serving_tokens instance for our subclass after init_app_state.

Orchestrator

  • compute_teacher_logprobs points at /inference/v1/generate. Builds the upstream payload (token_ids + nested sampling_params) and re-flattens prompt_logprobs from the upstream list[dict[token_id, Logprob]] back to the list[float] callers expect.
  • Bypasses the SDK's parse layer without losing auth. The OpenAI SDK's cast_to enforces openai.BaseModel; vLLM's GenerateResponse is plain pydantic.BaseModel and the SDK rejects it with TypeError. Calling client.post(..., cast_to=httpx.Response, body=...) runs the full request pipeline (auth_headers, retries, timeouts, idempotency keys) and just hands back the raw response for us to validate ourselves — going through client._client.post directly would skip auth.

Tests

  • Replace test_serving_generate.py with test_serving_tokens.py covering the prime-rl deltas (routed_experts encoding, response shape stability).
  • Update test_teacher_logprobs.py for the new endpoint URL, payload shape, and response unwrap, and assert the AsyncOpenAI SDK doesn't double-prefix the absolute URL.

Renderers / verifiers pins

  • Pin renderers==0.1.6 (PyPI). First release after the renderers/verifiers monorepo split — same code as 9acdc60, now a published wheel. Declared as a direct prime-rl dependency since it was previously pulled transitively through verifiers' in-tree workspace package.
  • Bump verifiers to 7bdc769 for the post-split main. The matching renderers release carries the client-side switch to /inference/v1/generate.

Test plan

Unit

  • tests/unit/inference/test_serving_tokens.py (4 tests) and tests/unit/orchestrator/test_teacher_logprobs.py (1 test) pass locally against vllm 0.20.
  • Server module imports cleanly (prime_rl.inference.vllm.server, prime_rl.inference.vllm.serving_tokens).
  • Companion verifiers tests (packages/renderers/tests/test_client.py) green.
  • uv.lock regenerated against renderers==0.1.6 (PyPI) and verifiers 7bdc769.

E2E smoke matrix (5 steps each, all green, no NaN, no error)

Test Model Path under test Transport Result
A Qwen3-4B-Instruct-2507 TITO /v1/chat/completions/tokens (agentic, multi-turn) NCCL ✅ 5/5 steps, 9 960 TITO calls
B Qwen3-30B-A3B-Thinking-2507 (MoE, EP=8) Renderer → /inference/v1/generate NCCL ✅ 5/5 steps, 1 481 generate calls, 3 weight broadcasts
C Qwen3-30B-A3B-Thinking-2507 (MoE, EP=8) TITO config (single-turn → MITO /v1/chat/completions fallback)¹ NCCL ✅ 5/5 steps, 1 156 chat calls
D Qwen3-0.6B (live inference server) compute_teacher_logprobs/inference/v1/generate (prompt_logprobs=1) n/a ✅ Flat list[float], first slot 0.0, all values ≤ 0

¹ reverse-text is single-turn, so the TITO client takes the MITO fallback at openai_chat_completions_token_client.py:114. A and B together cover both the TITO /tokens wire path and the new disagg path; C confirms 30B MoE weight reload doesn't regress on the unchanged TITO config.

WandB:

🤖 Generated with Claude Code


Note

Medium Risk
Moderate risk because it removes the legacy /v1/generate API and swaps in a custom subclass of vLLM’s ServingTokens, changing request/response wire shapes for renderer/teacher paths and generation defaults (max_tokens).

Overview
Unifies prime-RL’s token-level generation onto vLLM 0.20’s built-in /inference/v1/generate by removing the bespoke /v1/generate handler (serving_generate.py and route) and swapping state.serving_tokens to a new PrimeRlServingTokens subclass during server init.

PrimeRlServingTokens forwards X-data-parallel-rank into engine_client.generate, preserves routed_experts export in non-streaming responses, and adds server-side defaulting for missing sampling_params.max_tokens to avoid vLLM’s implicit 16-token cap. The chat-with-tokens path reuses the new routed-experts capture base.

The orchestrator’s compute_teacher_logprobs is updated to call /inference/v1/generate with the new payload shape (token_ids + nested sampling_params) and to parse/flatten vLLM’s upstream prompt_logprobs format; tests are updated accordingly. Dependencies are adjusted by pinning renderers==0.1.6 (PyPI) and bumping verifiers to a newer git rev, with uv.lock regenerated.

Reviewed by Cursor Bugbot for commit 8637611. Bugbot is set up for automated code reviews on this repo. Configure here.

hallerite added a commit to PrimeIntellect-ai/verifiers that referenced this pull request May 4, 2026
…package

Now that renderers lives in its own repo
(https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep
directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean
``generate()`` rewrite) and remove ``packages/renderers/`` from the
verifiers tree.

This also drops the ``uv pip install -e packages/renderers`` CI hack
introduced in c969123 — no longer needed once renderers resolves
through ``[tool.uv.sources]``.

Bump the version constraints to ``renderers>=0.1.6``. Once renderers
v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the
constraint resolve from the trusted publisher.

Companion to:
  - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite)
  - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
@hallerite hallerite marked this pull request as ready for review May 5, 2026 10:46
Comment thread src/prime_rl/inference/vllm/serving_tokens.py Outdated
Comment thread src/prime_rl/inference/vllm/serving_tokens.py
hallerite added 2 commits May 7, 2026 14:15
…/generate

vLLM 0.20 ships a tokens-in / tokens-out endpoint at /inference/v1/generate
(disagg/serving.py) that supersedes the bespoke /v1/generate handler
prime-rl shipped on top of vllm 0.19. Replace it.

Server side:
- Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate
  route in server.py — vLLM 0.20's build_app already attaches
  /inference/v1/generate via attach_disagg_router.
- Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve
  two prime-rl features the upstream protocol doesn't natively cover:
    1. data_parallel_rank routing — read from the X-data-parallel-rank
       header and forwarded to engine_client.generate.
    2. routed_experts per-token export — surfaced on each choice when
       the engine is launched with enable_return_routed_experts=True.
  custom_init_app_state swaps the upstream serving_tokens instance for our
  subclass.

Orchestrator side:
- compute_teacher_logprobs in orchestrator/utils.py points at
  /inference/v1/generate, builds the upstream payload (token_ids +
  nested sampling_params), and re-flattens prompt_logprobs from the
  upstream list[dict[token_id, Logprob]] shape back to the list[float]
  callers expect.

Tests:
- Replace test_serving_generate.py (class deleted) with
  test_serving_tokens.py — exercises the prime-rl deltas
  (routed_experts encoding, response shape stability).
- Update test_teacher_logprobs.py to expect the new endpoint URL,
  payload shape, and response unwrap.

Renderers pin:
- Bump renderers source to 9c0b738e on the verifiers repo so the
  client-side switch to /inference/v1/generate ships together.

Net: -1 endpoint, ~-275 LoC, no functional change for callers (renderer
client emits the same parsed response shape; teacher logprobs return
identical list[float]).
…to 7bdc769

Renderers moved out of the verifiers monorepo into their own repo
(verifiers#1282). Repoint the source from verifiers/packages/renderers
to PrimeIntellect-ai/renderers @ 9acdc60 and declare renderers as a
direct prime-rl dependency since it was previously transitively pulled
via verifiers' in-tree workspace package. Bump verifiers to 7bdc769 to
pick up the post-split main.

Pairs with the /inference/v1/generate switch — the renderer client at
9acdc60 emits the new endpoint shape.
@hallerite hallerite force-pushed the feat/unify-inference-generate branch from 794a588 to 0f16ddc Compare May 7, 2026 14:22
@hallerite hallerite changed the base branch from feat/vllm-0.20-cu13 to main May 7, 2026 14:22
hallerite and others added 2 commits May 7, 2026 14:30
Renderers 0.1.6 was published on PyPI today (commit 9acdc60 + version
bump). Switch from the git rev source to the canonical PyPI release —
keeps the same code (==0.1.6) but avoids depending on the renderers
git repo at install time.

Keeps `renderers = false` in `[tool.uv.exclude-newer-package]` since
0.1.6 is inside the 7-day cooldown window.
…generate

vLLM 0.20's ServingTokens hands the client-supplied SamplingParams to the
engine verbatim. SamplingParams.max_tokens defaults to 16 (a dataclass-level
default that predates the OpenAI-compat layer), so any caller that omits the
field gets a 16-token completion — long enough to start a sentence and stop
mid-word.

Other vLLM endpoints (/v1/chat/completions, /v1/completions, /v1/responses)
all mask this server-side via vllm.entrypoints.utils.get_max_tokens, which
falls back to max_model_len - prompt_len. The disagg endpoint skips that
path. Mirror it inside PrimeRlServingTokens so callers don't need a
client-side workaround.

Detection: re-read the cached request body to tell "client sent
max_tokens=16" from "client sent nothing → SamplingParams default 16".
Pessimistic on read failures (assume the client did set it).

Drop once vLLM patches upstream.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread src/prime_rl/inference/vllm/serving_tokens.py Outdated
hallerite added a commit to PrimeIntellect-ai/verifiers that referenced this pull request May 8, 2026
…rompt_len"

This reverts commit 831f8bc.

The fix moved server-side: prime-rl's PrimeRlServingTokens now applies
get_max_tokens() defaulting in serve_tokens (PrimeIntellect-ai/prime-rl#2408,
commit 913cc4ca), matching every other vLLM endpoint. The client-side
workaround was always a band-aid and is no longer needed for prime-rl
deployments. Other vLLM 0.20 deployments hitting /inference/v1/generate
still need the upstream fix or to apply the prime-rl override locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
hallerite added 2 commits May 10, 2026 15:14
Includes #1307 fix(renderer-client): wrap tools in OpenAI envelope to match
training distribution, plus the v1 Taskset/Harness scaffolding from v0.1.14
and other post-release fixes.

Smoke-tested on Qwen3-4B-Instruct-2507 + general-agent-solver-local agentic:
step 0=25s, step 1=10s. No regressions vs the prior verifiers pin.
@hallerite hallerite force-pushed the feat/unify-inference-generate branch 3 times, most recently from 58d17cb to 30ccf52 Compare May 10, 2026 10:10
* Drop the dead "_ = (model_name, time.time())" no-op in
  serve_tokens_full_generator; move the Pydantic-silently-drops-id/model/
  usage comment to where it belongs.

* Dedupe encode_routed_experts: now public on serving_tokens;
  serving_chat_with_tokens.py imports it instead of carrying its own
  copy.

* Refactor serve_tokens_full_generator from a copy-paste of upstream
  with a single delta into a thin wrapper-class + super() + post_process,
  matching how serving_chat_with_tokens.chat_completion_full_generator
  is structured. The new _RoutedExpertsCapture wraps the engine result
  generator, captures routed_experts as it streams, then post_process
  rebuilds the response into PrimeRlGenerateResponse so the encoded
  experts surface in the JSON. Tracks future vllm changes for free.
@hallerite hallerite force-pushed the feat/unify-inference-generate branch from 30ccf52 to 1ec4d0b Compare May 10, 2026 10:15
Comment thread src/prime_rl/orchestrator/utils.py
hallerite added 2 commits May 10, 2026 20:12
…absolute URL

Add an assertion that AsyncOpenAI._prepare_url(absolute_url) returns
the URL unchanged. This guards against the hypothetical regression
where the SDK merges base_url + absolute_path, producing a malformed
URL like `http://h:8000/v1/http://h:8000/inference/v1/generate`.

The fake client used by this test only records the path string and
doesn't exercise URL preparation; the assertion calls _prepare_url
directly to verify the SDK's behavior matches our assumption (verified
on openai==2.24.0: absolute URLs short-circuit the merge via
httpx.URL.is_relative_url).
`AsyncOpenAI.post(cast_to=...)` enforces `cast_to` subclasses
`openai.BaseModel`. `vllm.entrypoints.serve.disagg.protocol.GenerateResponse`
is a vanilla `pydantic.BaseModel` and the SDK rejects it with
`TypeError: Pydantic models must subclass our base model type`. The fake
client in the unit test bypassed the SDK's parse pipeline so the bug
slipped through; verified end-to-end against a live vllm server.

Send the request through the underlying httpx client
(`AsyncOpenAI._client`, where auth and connection pool are already
wired) and validate the response JSON ourselves.

Update the unit test to mock at `client._client.post` instead of
`client.post` so it actually exercises the new code path. Comment in
`utils.py` now documents both escape hatches the SDK forces (URL
preservation + parse layer).
Comment thread src/prime_rl/orchestrator/utils.py
Going through ``client._client.post`` skips the SDK's auth pipeline —
``Authorization: Bearer <api_key>`` is added per-request in
``BaseClient._build_headers`` / ``auth_headers``, not on the underlying
httpx client. Silent regression for any deployment that fronts inference
behind an authenticating proxy (Bugbot, medium severity).

openai 2.24.0's ``AsyncAPIClient._process_response`` short-circuits when
``cast_to == httpx.Response`` and returns the raw response. That keeps
the full request pipeline — auth headers, retries, timeouts, idempotency
keys — and skips only the parse layer that rejected the vLLM
``GenerateResponse`` for not subclassing ``openai.BaseModel``.

Verified end-to-end against a live Qwen3-0.6B server with a synthetic
``pit_DEADBEEF_test_key``:
  URL: http://localhost:8123/inference/v1/generate
  Authorization: Bearer pit_DEADBEEF_test_key
  User-Agent: AsyncOpenAI/Python 2.24.0

Test updated to mock at ``client.post(cast_to=, body=)`` and synthesize
an ``httpx.Response`` mirroring the SDK short-circuit.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread src/prime_rl/inference/vllm/serving_tokens.py
Both serving_tokens.py and serving_chat_with_tokens.py carried byte-
identical __init__/__aiter__ implementations of _RoutedExpertsCapture —
only post_process differed (in-place mutation for chat where
ChatCompletionResponseChoice is extra='allow', vs response rebuild for
generate where GenerateResponseChoice isn't). Lift the streaming logic
into a shared base class so future fixes to capture only need to land
once (Bugbot, low severity).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment thread src/prime_rl/inference/vllm/serving_tokens.py
…onfig

``getattr(x, "attr", {})`` returns ``None`` when ``x.attr is None`` — the
default only fires for missing attributes, not for attributes that exist
with a ``None`` value. ``mc.override_generation_config`` is declared
``dict[str, Any] = field(default_factory=dict)`` so in practice it's
always ``{}``, never ``None`` — and we're literally mirroring upstream
``OpenAIServingChat.__init__``, which uses the same idiom. But the
``or {}`` costs nothing and guards against a downstream caller ever
mutating the attribute to ``None``.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
hallerite added 2 commits May 11, 2026 18:12
vllm 0.20.1 -> 0.20.2 from main (#2468). Resolved uv.lock dependency-
metadata conflict by taking main's baseline (vllm 0.20.2) and re-running
`uv lock` so our verifiers `aa428f3` + renderers `==0.1.6` pins are
re-applied on top. Unit tests pass on the new pin (8/8).
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8637611. Configure here.

},
)
return [0.0 if lp is None else float(lp) for lp in response.prompt_logprobs or []]
response = GenerateResponse.model_validate_json(http_response.content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing HTTP status code check on raw response

Medium Severity

Using cast_to=httpx.Response bypasses the OpenAI SDK's status-code validation in _process_response (which short-circuits and returns the raw response for this cast type). The comment says "just hands back the raw response for us to validate ourselves," but http_response.status_code is never checked before calling GenerateResponse.model_validate_json(http_response.content). If the server returns a 4xx/5xx error, model_validate_json will attempt to parse the error body as a GenerateResponse, producing a confusing pydantic.ValidationError instead of a clear HTTP error.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8637611. Configure here.


return await self.serve_tokens_full_generator(
request, result_generator, request_id, model_name, request_metadata
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full upstream method copy creates fragile maintenance burden

Medium Severity

serve_tokens duplicates ~116 lines of the upstream ServingTokens.serve_tokens verbatim to inject just two changes: data_parallel_rank at line 260 and max_tokens defaulting at lines 229-237. The entire engine-input construction, sampling-params setup, logging, tracing, and streaming-vs-full dispatch are copied. Any future upstream change (e.g., new parameters, bug fixes, flow changes) will silently diverge from this fork, making it a high-risk maintenance surface.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8637611. Configure here.

@hallerite hallerite merged commit 7c434ec into main May 11, 2026
19 of 22 checks passed
@hallerite hallerite deleted the feat/unify-inference-generate branch May 11, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants