Skip to content

feat: switch client to vLLM 0.20 /inference/v1/generate#1

Merged
hallerite merged 1 commit into
mainfrom
feat/inference-v1-generate
May 5, 2026
Merged

feat: switch client to vLLM 0.20 /inference/v1/generate#1
hallerite merged 1 commit into
mainfrom
feat/inference-v1-generate

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

vLLM 0.20 ships a unified tokens-in / tokens-out endpoint at /inference/v1/generate that supersedes the bespoke /v1/generate handler the prior client targeted. Migrate renderers.client onto the new endpoint and shed the OpenAI-chat-completions DNA accumulated in the old completions_request.

This is the renderers-side companion to the verifiers PR (PrimeIntellect-ai/verifiers#1282) and the prime-rl PR (PrimeIntellect-ai/prime-rl#2408). Lifted from verifiers#1282's packages/renderers/ now that renderers lives in its own repo.

What changed

renderers.client.generate() (was completions_request):

  • Structured sampling_params: dict forwarded to vLLM verbatim. No more extra_body fallback, _SAMPLING_KEYS allowlist, or max_completion_tokens ↔ max_tokens aliasing — those are OpenAI-SDK habits that don't apply here.
  • Top-level cache_salt / priority / extra_headers as named args (matching the wire shape; no rummaging through extra_body).
  • Lean result dict: drops id/created/model/usage (ChatCompletion-shaped fillers — the endpoint returns none of them); emits request_id (the field /inference/v1/generate actually returns) plus the renderer fields (content, reasoning_content, tool_calls, finish_reason, prompt/completion_ids, completion_logprobs, routed_experts).
  • stop_token_ids (from the renderer) and logprobs=1 are forced by us; everything else flows through.

Kept: the finish_reason: stop → tool_calls promotion when the renderer extracts tool calls client-side (downstream agent loops genuinely depend on it), the AsyncOpenAI transport (auth + retries), and the overlong-prompt 4xx diagnostic.

Version bump

0.1.5 → 0.1.6. The wire format change is a break against v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6 once merged so the publish workflow (when added) picks it up.

Test plan

  • tests/test_client.py rewritten against the lean API; passes locally (uv run pytest tests/test_client.py).
  • Full test suite: 784 passed, 43 skipped, 1 xfailed.
  • ruff format / ruff check clean.
  • ty check renderers exits 0 (51 advisories surfaced as warnings, no errors).
  • e2e renderer rollout against a live vllm 0.20 server (verifiers#1282 + prime-rl#2408 pinned to this client): 20-step multi_reverse_text RL run, 2688 calls to /inference/v1/generate, eval Avg@4 = 0.83.

hallerite added a commit to PrimeIntellect-ai/verifiers that referenced this pull request May 4, 2026
…package

Now that renderers lives in its own repo
(https://github.com/PrimeIntellect-ai/renderers), pin the verifiers dep
directly at PrimeIntellect-ai/renderers#1's head (40bc2a6 — the lean
``generate()`` rewrite) and remove ``packages/renderers/`` from the
verifiers tree.

This also drops the ``uv pip install -e packages/renderers`` CI hack
introduced in c969123 — no longer needed once renderers resolves
through ``[tool.uv.sources]``.

Bump the version constraints to ``renderers>=0.1.6``. Once renderers
v0.1.6 publishes to PyPI, drop ``[tool.uv.sources]`` and let the
constraint resolve from the trusted publisher.

Companion to:
  - PrimeIntellect-ai/renderers#1 (lean ``generate()`` rewrite)
  - PrimeIntellect-ai/prime-rl#2408 (consumer migration)
Replace the OpenAI-chat-completions-shaped ``completions_request`` with
a lean ``generate()`` built around what /inference/v1/generate actually
exposes:

- Structured ``sampling_params: dict`` arg, forwarded to vLLM verbatim.
  No more ``extra_body`` fallback, no ``_SAMPLING_KEYS`` allowlist, no
  ``max_completion_tokens`` ↔ ``max_tokens`` aliasing — those are
  OpenAI-SDK habits that don't apply here.
- Top-level ``cache_salt`` / ``priority`` / ``extra_headers`` as named
  args (matching the wire shape, no rummaging through extra_body).
- Result dict drops the ChatCompletion-shaped fillers (``id``,
  ``created``, ``model``, ``usage``); keeps ``request_id`` (the actual
  field /inference/v1/generate returns) and the renderer-specific
  fields (content, reasoning_content, tool_calls, finish_reason,
  prompt/completion_ids, completion_logprobs, routed_experts).
- ``stop_token_ids`` (from the renderer) and ``logprobs=1`` are forced
  by us; everything else flows through.

Kept: the ``finish_reason: stop → tool_calls`` promotion when the
renderer extracts tool calls client-side (downstream agent loops
genuinely depend on it), the AsyncOpenAI transport (auth + retries),
and the overlong-prompt 4xx diagnostic.

Bump version 0.1.5 → 0.1.6 — the wire format change is a break against
v0.1.5 (which targets the legacy /generate route). Tag renderers-v0.1.6
to publish.

Lifted from PrimeIntellect-ai/verifiers#1282 packages/renderers/ now
that this package lives in its own repo.
@hallerite hallerite force-pushed the feat/inference-v1-generate branch from 40bc2a6 to 4a16ecc Compare May 5, 2026 12:52
@hallerite hallerite marked this pull request as ready for review May 5, 2026 12:52
@hallerite hallerite merged commit 9acdc60 into main May 5, 2026
6 checks passed
@hallerite hallerite deleted the feat/inference-v1-generate branch May 5, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant