feat: switch client to vLLM 0.20 /inference/v1/generate by hallerite · Pull Request #1 · PrimeIntellect-ai/renderers

hallerite · 2026-05-04T22:54:27Z

Summary

vLLM 0.20 ships a unified tokens-in / tokens-out endpoint at /inference/v1/generate that supersedes the bespoke /v1/generate handler the prior client targeted. Migrate renderers.client onto the new endpoint and shed the OpenAI-chat-completions DNA accumulated in the old completions_request.

This is the renderers-side companion to the verifiers PR (PrimeIntellect-ai/verifiers#1282) and the prime-rl PR (PrimeIntellect-ai/prime-rl#2408). Lifted from verifiers#1282's packages/renderers/ now that renderers lives in its own repo.

What changed

renderers.client.generate() (was completions_request):

Structured sampling_params: dict forwarded to vLLM verbatim. No more extra_body fallback, _SAMPLING_KEYS allowlist, or max_completion_tokens ↔ max_tokens aliasing — those are OpenAI-SDK habits that don't apply here.
Top-level cache_salt / priority / extra_headers as named args (matching the wire shape; no rummaging through extra_body).
Lean result dict: drops id/created/model/usage (ChatCompletion-shaped fillers — the endpoint returns none of them); emits request_id (the field /inference/v1/generate actually returns) plus the renderer fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts).
stop_token_ids (from the renderer) and logprobs=1 are forced by us; everything else flows through.

Kept: the finish_reason: stop → tool_calls promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic.

Version bump

0.1.5 → 0.1.6. The wire format change is a break against v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6 once merged so the publish workflow (when added) picks it up.

Test plan

tests/test_client.py rewritten against the lean API; passes locally (uv run pytest tests/test_client.py).
Full test suite: 784 passed, 43 skipped, 1 xfailed.
ruff format / ruff check clean.
ty check renderers exits 0 (51 advisories surfaced as warnings, no errors).
e2e renderer rollout against a live vllm 0.20 server (verifiers#1282 + prime-rl#2408 pinned to this client): 20-step multi_reverse_text RL run, 2688 calls to /inference/v1/generate, eval Avg@4 = 0.83.

…package Now that renderers lives in its own repo (https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean ``generate()`` rewrite) and remove ``packages/renderers/`` from the verifiers tree. This also drops the ``uv pip install -e packages/renderers`` CI hack introduced in c969123 — no longer needed once renderers resolves through ``[tool.uv.sources]``. Bump the version constraints to ``renderers>=0.1.6``. Once renderers v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the constraint resolve from the trusted publisher. Companion to: - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite) - PrimeIntellect-ai/prime-rl#2408 (consumer migration)

Replace the OpenAI-chat-completions-shaped ``completions_request`` with a lean ``generate()`` built around what /inference/v1/generate actually exposes: - Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim. No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are OpenAI-SDK habits that don't apply here. - Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named args (matching the wire shape, no rummaging through extra_body). - Result dict drops the ChatCompletion-shaped fillers (``id``, ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual field /inference/v1/generate returns) and the renderer-specific fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts). - ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced by us; everything else flows through. Kept: the ``finish_reason: stop → tool_calls`` promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic. Bump version 0.1.5 → 0.1.6 — the wire format change is a break against v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6 to publish. Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now that this package lives in its own repo.

This was referenced May 4, 2026

feat(renderers): switch client to vLLM 0.20 /inference/v1/generate PrimeIntellect-ai/verifiers#1282

Merged

ci: add PyPI publish workflow #2

Merged

hallerite force-pushed the feat/inference-v1-generate branch from 40bc2a6 to 4a16ecc Compare May 5, 2026 12:52

hallerite marked this pull request as ready for review May 5, 2026 12:52

hallerite merged commit 9acdc60 into main May 5, 2026
6 checks passed

hallerite deleted the feat/inference-v1-generate branch May 5, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: switch client to vLLM 0.20 /inference/v1/generate#1

feat: switch client to vLLM 0.20 /inference/v1/generate#1
hallerite merged 1 commit into
mainfrom
feat/inference-v1-generate

hallerite commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 4, 2026

Summary

What changed

Version bump

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant