feat: switch client to vLLM 0.20 /inference/v1/generate#1
Merged
Conversation
hallerite
added a commit
to PrimeIntellect-ai/verifiers
that referenced
this pull request
May 4, 2026
…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
This was referenced May 4, 2026
Merged
Replace the OpenAI-chat-completions-shaped ``completions_request`` with a lean ``generate()`` built around what /inference/v1/generate actually exposes: - Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim. No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are OpenAI-SDK habits that don't apply here. - Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named args (matching the wire shape, no rummaging through extra_body). - Result dict drops the ChatCompletion-shaped fillers (``id``, ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual field /inference/v1/generate returns) and the renderer-specific fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts). - ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced by us; everything else flows through. Kept: the ``finish_reason: stop → tool_calls`` promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic. Bump version 0.1.5 → 0.1.6 — the wire format change is a break against v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6 to publish. Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now that this package lives in its own repo.
40bc2a6 to
4a16ecc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vLLM 0.20 ships a unified tokens-in / tokens-out endpoint at
/inference/v1/generatethat supersedes the bespoke/v1/generatehandler the prior client targeted. Migraterenderers.clientonto the new endpoint and shed the OpenAI-chat-completions DNA accumulated in the oldcompletions_request.This is the renderers-side companion to the verifiers PR (PrimeIntellect-ai/verifiers#1282) and the prime-rl PR (PrimeIntellect-ai/prime-rl#2408). Lifted from verifiers#1282's
packages/renderers/now that renderers lives in its own repo.What changed
renderers.client.generate()(wascompletions_request):sampling_params: dictforwarded to vLLM verbatim. No moreextra_bodyfallback,_SAMPLING_KEYSallowlist, ormax_completion_tokens ↔ max_tokensaliasing — those are OpenAI-SDK habits that don't apply here.cache_salt/priority/extra_headersas named args (matching the wire shape; no rummaging throughextra_body).id/created/model/usage(ChatCompletion-shaped fillers — the endpoint returns none of them); emitsrequest_id(the field/inference/v1/generateactually returns) plus the renderer fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts).stop_token_ids(from the renderer) andlogprobs=1are forced by us; everything else flows through.Kept: the
finish_reason: stop → tool_callspromotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), theAsyncOpenAItransport (auth + retries), and the overlong-prompt 4xx diagnostic.Version bump
0.1.5 → 0.1.6. The wire format change is a break against v0.1.5 (which targets the legacy
/generateroute). Tagrenderers-v0.1.6once merged so the publish workflow (when added) picks it up.Test plan
tests/test_client.pyrewritten against the lean API; passes locally (uv run pytest tests/test_client.py).ruff format/ruff checkclean.ty check renderersexits 0 (51 advisories surfaced as warnings, no errors).multi_reverse_textRL run, 2688 calls to/inference/v1/generate, eval Avg@4 = 0.83.