feat: unify renderer + teacher endpoints onto vLLM 0.20 /inference/v1/generate#2408
Conversation
…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
…/generate
vLLM 0.20 ships a tokens-in / tokens-out endpoint at /inference/v1/generate
(disagg/serving.py) that supersedes the bespoke /v1/generate handler
prime-rl shipped on top of vllm 0.19. Replace it.
Server side:
- Drop src/prime_rl/inference/vllm/serving_generate.py and the /v1/generate
route in server.py — vLLM 0.20's build_app already attaches
/inference/v1/generate via attach_disagg_router.
- Subclass upstream's ServingTokens with PrimeRlServingTokens to preserve
two prime-rl features the upstream protocol doesn't natively cover:
1. data_parallel_rank routing — read from the X-data-parallel-rank
header and forwarded to engine_client.generate.
2. routed_experts per-token export — surfaced on each choice when
the engine is launched with enable_return_routed_experts=True.
custom_init_app_state swaps the upstream serving_tokens instance for our
subclass.
Orchestrator side:
- compute_teacher_logprobs in orchestrator/utils.py points at
/inference/v1/generate, builds the upstream payload (token_ids +
nested sampling_params), and re-flattens prompt_logprobs from the
upstream list[dict[token_id, Logprob]] shape back to the list[float]
callers expect.
Tests:
- Replace test_serving_generate.py (class deleted) with
test_serving_tokens.py — exercises the prime-rl deltas
(routed_experts encoding, response shape stability).
- Update test_teacher_logprobs.py to expect the new endpoint URL,
payload shape, and response unwrap.
Renderers pin:
- Bump renderers source to 9c0b738e on the verifiers repo so the
client-side switch to /inference/v1/generate ships together.
Net: -1 endpoint, ~-275 LoC, no functional change for callers (renderer
client emits the same parsed response shape; teacher logprobs return
identical list[float]).
…to 7bdc769 Renderers moved out of the verifiers monorepo into their own repo (verifiers#1282). Repoint the source from verifiers/packages/renderers to PrimeIntellect-ai/renderers @ 9acdc60 and declare renderers as a direct prime-rl dependency since it was previously transitively pulled via verifiers' in-tree workspace package. Bump verifiers to 7bdc769 to pick up the post-split main. Pairs with the /inference/v1/generate switch — the renderer client at 9acdc60 emits the new endpoint shape.
794a588 to
0f16ddc
Compare
Renderers 0.1.6 was published on PyPI today (commit 9acdc60 + version bump). Switch from the git rev source to the canonical PyPI release — keeps the same code (==0.1.6) but avoids depending on the renderers git repo at install time. Keeps `renderers = false` in `[tool.uv.exclude-newer-package]` since 0.1.6 is inside the 7-day cooldown window.
…generate vLLM 0.20's ServingTokens hands the client-supplied SamplingParams to the engine verbatim. SamplingParams.max_tokens defaults to 16 (a dataclass-level default that predates the OpenAI-compat layer), so any caller that omits the field gets a 16-token completion — long enough to start a sentence and stop mid-word. Other vLLM endpoints (/v1/chat/completions, /v1/completions, /v1/responses) all mask this server-side via vllm.entrypoints.utils.get_max_tokens, which falls back to max_model_len - prompt_len. The disagg endpoint skips that path. Mirror it inside PrimeRlServingTokens so callers don't need a client-side workaround. Detection: re-read the cached request body to tell "client sent max_tokens=16" from "client sent nothing → SamplingParams default 16". Pessimistic on read failures (assume the client did set it). Drop once vLLM patches upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…rompt_len" This reverts commit 831f8bc. The fix moved server-side: prime-rl's PrimeRlServingTokens now applies get_max_tokens() defaulting in serve_tokens (PrimeIntellect-ai/prime-rl#2408, commit 913cc4ca), matching every other vLLM endpoint. The client-side workaround was always a band-aid and is no longer needed for prime-rl deployments. Other vLLM 0.20 deployments hitting /inference/v1/generate still need the upstream fix or to apply the prime-rl override locally. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
# Conflicts: # uv.lock
Includes #1307 fix(renderer-client): wrap tools in OpenAI envelope to match training distribution, plus the v1 Taskset/Harness scaffolding from v0.1.14 and other post-release fixes. Smoke-tested on Qwen3-4B-Instruct-2507 + general-agent-solver-local agentic: step 0=25s, step 1=10s. No regressions vs the prior verifiers pin.
58d17cb to
30ccf52
Compare
* Drop the dead "_ = (model_name, time.time())" no-op in serve_tokens_full_generator; move the Pydantic-silently-drops-id/model/ usage comment to where it belongs. * Dedupe encode_routed_experts: now public on serving_tokens; serving_chat_with_tokens.py imports it instead of carrying its own copy. * Refactor serve_tokens_full_generator from a copy-paste of upstream with a single delta into a thin wrapper-class + super() + post_process, matching how serving_chat_with_tokens.chat_completion_full_generator is structured. The new _RoutedExpertsCapture wraps the engine result generator, captures routed_experts as it streams, then post_process rebuilds the response into PrimeRlGenerateResponse so the encoded experts surface in the JSON. Tracks future vllm changes for free.
30ccf52 to
1ec4d0b
Compare
…absolute URL Add an assertion that AsyncOpenAI._prepare_url(absolute_url) returns the URL unchanged. This guards against the hypothetical regression where the SDK merges base_url + absolute_path, producing a malformed URL like `http://h:8000/v1/http://h:8000/inference/v1/generate`. The fake client used by this test only records the path string and doesn't exercise URL preparation; the assertion calls _prepare_url directly to verify the SDK's behavior matches our assumption (verified on openai==2.24.0: absolute URLs short-circuit the merge via httpx.URL.is_relative_url).
`AsyncOpenAI.post(cast_to=...)` enforces `cast_to` subclasses `openai.BaseModel`. `vllm.entrypoints.serve.disagg.protocol.GenerateResponse` is a vanilla `pydantic.BaseModel` and the SDK rejects it with `TypeError: Pydantic models must subclass our base model type`. The fake client in the unit test bypassed the SDK's parse pipeline so the bug slipped through; verified end-to-end against a live vllm server. Send the request through the underlying httpx client (`AsyncOpenAI._client`, where auth and connection pool are already wired) and validate the response JSON ourselves. Update the unit test to mock at `client._client.post` instead of `client.post` so it actually exercises the new code path. Comment in `utils.py` now documents both escape hatches the SDK forces (URL preservation + parse layer).
Going through ``client._client.post`` skips the SDK's auth pipeline — ``Authorization: Bearer <api_key>`` is added per-request in ``BaseClient._build_headers`` / ``auth_headers``, not on the underlying httpx client. Silent regression for any deployment that fronts inference behind an authenticating proxy (Bugbot, medium severity). openai 2.24.0's ``AsyncAPIClient._process_response`` short-circuits when ``cast_to == httpx.Response`` and returns the raw response. That keeps the full request pipeline — auth headers, retries, timeouts, idempotency keys — and skips only the parse layer that rejected the vLLM ``GenerateResponse`` for not subclassing ``openai.BaseModel``. Verified end-to-end against a live Qwen3-0.6B server with a synthetic ``pit_DEADBEEF_test_key``: URL: http://localhost:8123/inference/v1/generate Authorization: Bearer pit_DEADBEEF_test_key User-Agent: AsyncOpenAI/Python 2.24.0 Test updated to mock at ``client.post(cast_to=, body=)`` and synthesize an ``httpx.Response`` mirroring the SDK short-circuit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Both serving_tokens.py and serving_chat_with_tokens.py carried byte- identical __init__/__aiter__ implementations of _RoutedExpertsCapture — only post_process differed (in-place mutation for chat where ChatCompletionResponseChoice is extra='allow', vs response rebuild for generate where GenerateResponseChoice isn't). Lift the streaming logic into a shared base class so future fixes to capture only need to land once (Bugbot, low severity). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…onfig
``getattr(x, "attr", {})`` returns ``None`` when ``x.attr is None`` — the
default only fires for missing attributes, not for attributes that exist
with a ``None`` value. ``mc.override_generation_config`` is declared
``dict[str, Any] = field(default_factory=dict)`` so in practice it's
always ``{}``, never ``None`` — and we're literally mirroring upstream
``OpenAIServingChat.__init__``, which uses the same idiom. But the
``or {}`` costs nothing and guards against a downstream caller ever
mutating the attribute to ``None``.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
vllm 0.20.1 -> 0.20.2 from main (#2468). Resolved uv.lock dependency- metadata conflict by taking main's baseline (vllm 0.20.2) and re-running `uv lock` so our verifiers `aa428f3` + renderers `==0.1.6` pins are re-applied on top. Unit tests pass on the new pin (8/8).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8637611. Configure here.
| }, | ||
| ) | ||
| return [0.0 if lp is None else float(lp) for lp in response.prompt_logprobs or []] | ||
| response = GenerateResponse.model_validate_json(http_response.content) |
There was a problem hiding this comment.
Missing HTTP status code check on raw response
Medium Severity
Using cast_to=httpx.Response bypasses the OpenAI SDK's status-code validation in _process_response (which short-circuits and returns the raw response for this cast type). The comment says "just hands back the raw response for us to validate ourselves," but http_response.status_code is never checked before calling GenerateResponse.model_validate_json(http_response.content). If the server returns a 4xx/5xx error, model_validate_json will attempt to parse the error body as a GenerateResponse, producing a confusing pydantic.ValidationError instead of a clear HTTP error.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8637611. Configure here.
|
|
||
| return await self.serve_tokens_full_generator( | ||
| request, result_generator, request_id, model_name, request_metadata | ||
| ) |
There was a problem hiding this comment.
Full upstream method copy creates fragile maintenance burden
Medium Severity
serve_tokens duplicates ~116 lines of the upstream ServingTokens.serve_tokens verbatim to inject just two changes: data_parallel_rank at line 260 and max_tokens defaulting at lines 229-237. The entire engine-input construction, sampling-params setup, logging, tracing, and streaming-vs-full dispatch are copied. Any future upstream change (e.g., new parameters, bug fixes, flow changes) will silently diverge from this fork, making it a high-risk maintenance surface.
Reviewed by Cursor Bugbot for commit 8637611. Configure here.


What changes and why
vLLM 0.20 ships a generic tokens-in / tokens-out endpoint at
/inference/v1/generate(vllm.entrypoints.serve.disagg.serving.ServingTokens) that supersedes the bespoke/v1/generatehandler prime-rl maintained on top of vllm 0.19. This PR moves prime-rl onto it.Server
serving_generate.py+ the/v1/generateroute. vLLM 0.20'sbuild_appalready attaches/inference/v1/generateviaattach_disagg_router— the bespoke handler is now redundant.ServingTokensasPrimeRlServingTokens. Two prime-rl features aren't covered by the upstream protocol:data_parallel_rankrouting — read from theX-data-parallel-rankheader and forwarded toengine_client.generate. The DP-replicated inference servers we run need this to target a specific replica.routed_expertsper-token export — surfaced on each choice when the engine runs withenable_return_routed_experts=True. The trainer's router-replay path consumes this.custom_init_app_stateswaps the upstreamserving_tokensinstance for our subclass afterinit_app_state.Orchestrator
compute_teacher_logprobspoints at/inference/v1/generate. Builds the upstream payload (token_ids+ nestedsampling_params) and re-flattensprompt_logprobsfrom the upstreamlist[dict[token_id, Logprob]]back to thelist[float]callers expect.cast_toenforcesopenai.BaseModel; vLLM'sGenerateResponseis plainpydantic.BaseModeland the SDK rejects it withTypeError. Callingclient.post(..., cast_to=httpx.Response, body=...)runs the full request pipeline (auth_headers, retries, timeouts, idempotency keys) and just hands back the raw response for us to validate ourselves — going throughclient._client.postdirectly would skip auth.Tests
test_serving_generate.pywithtest_serving_tokens.pycovering the prime-rl deltas (routed_expertsencoding, response shape stability).test_teacher_logprobs.pyfor the new endpoint URL, payload shape, and response unwrap, and assert the AsyncOpenAI SDK doesn't double-prefix the absolute URL.Renderers / verifiers pins
renderers==0.1.6(PyPI). First release after the renderers/verifiers monorepo split — same code as9acdc60, now a published wheel. Declared as a direct prime-rl dependency since it was previously pulled transitively through verifiers' in-tree workspace package.verifiersto7bdc769for the post-split main. The matching renderers release carries the client-side switch to/inference/v1/generate.Test plan
Unit
tests/unit/inference/test_serving_tokens.py(4 tests) andtests/unit/orchestrator/test_teacher_logprobs.py(1 test) pass locally against vllm 0.20.prime_rl.inference.vllm.server,prime_rl.inference.vllm.serving_tokens).packages/renderers/tests/test_client.py) green.uv.lockregenerated againstrenderers==0.1.6(PyPI) and verifiers7bdc769.E2E smoke matrix (5 steps each, all green, no NaN, no error)
/v1/chat/completions/tokens(agentic, multi-turn)/inference/v1/generate/v1/chat/completionsfallback)¹inferenceserver)compute_teacher_logprobs→/inference/v1/generate(prompt_logprobs=1)list[float], first slot0.0, all values ≤ 0¹
reverse-textis single-turn, so the TITO client takes the MITO fallback atopenai_chat_completions_token_client.py:114. A and B together cover both the TITO/tokenswire path and the new disagg path; C confirms 30B MoE weight reload doesn't regress on the unchanged TITO config.WandB:
4b-pr-smoke-agentic-5steppr-smoke-30b-rendered-rt-1010pr-smoke-30b-tito-rt-1010🤖 Generated with Claude Code
Note
Medium Risk
Moderate risk because it removes the legacy
/v1/generateAPI and swaps in a custom subclass of vLLM’sServingTokens, changing request/response wire shapes for renderer/teacher paths and generation defaults (max_tokens).Overview
Unifies prime-RL’s token-level generation onto vLLM 0.20’s built-in
/inference/v1/generateby removing the bespoke/v1/generatehandler (serving_generate.pyand route) and swappingstate.serving_tokensto a newPrimeRlServingTokenssubclass during server init.PrimeRlServingTokensforwardsX-data-parallel-rankintoengine_client.generate, preservesrouted_expertsexport in non-streaming responses, and adds server-side defaulting for missingsampling_params.max_tokensto avoid vLLM’s implicit 16-token cap. The chat-with-tokens path reuses the new routed-experts capture base.The orchestrator’s
compute_teacher_logprobsis updated to call/inference/v1/generatewith the new payload shape (token_ids+ nestedsampling_params) and to parse/flatten vLLM’s upstreamprompt_logprobsformat; tests are updated accordingly. Dependencies are adjusted by pinningrenderers==0.1.6(PyPI) and bumpingverifiersto a newer git rev, withuv.lockregenerated.Reviewed by Cursor Bugbot for commit 8637611. Bugbot is set up for automated code reviews on this repo. Configure here.