fix(gpt-oss): route harmony prompts through openai-harmony (refs #568) by CBribiescas · Pull Request #581 · waybarrios/vllm-mlx

CBribiescas · 2026-05-25T08:42:30Z

Closes #568.

Following @Thump604's recommended PR shape on #568 — you mentioned that the right answer is a focused harmony-rendering workstream, with a regression test, scoped to the harmony path, routed through the official openai-harmony library, and leaving non-harmony unchanged. This PR is exactly that shape. Happy to swap the rendering for an in-tree template if you'd prefer the no-extra-dep approach instead; everything else (the gate, the test coverage, the scope) stays the same either way.

Apologies for the volume of recent activity in this repo — #556, #562, #563, #564, #580, and now this. After this one I have nothing else in flight.

The bug, in one paragraph

extract_multimodal_content() (vllm_mlx/api/utils.py) text-flattens prior assistant tool_calls to bracket strings like [Calling tool: X(args)] whenever the active parser doesn't preserve native format. HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT = False, so GPT-OSS hits this path. The model then sees a malformed history (no commentary-channel tool calls, no functions.X to=assistant tool results) and converges to a one-shot final-channel response on multi-turn tasks. Locally measured: avg_turns_taken=1.0, zero tool_calls_made, on every CC task.

Flipping SUPPORTS_NATIVE_TOOL_FORMAT alone (which I tried as a first probe) makes things worse — the Jinja template renders structurally-correct-but-wire-format-mismatched harmony and the model emits a final answer in ~2.7 turns instead of ~5. Confirms that the template path itself is fragile to drift from what the weights expect, which is why the right fix is to route through OpenAI's canonical renderer.

Patch shape (matching @Thump604's bullets)

1. Regression test

tests/test_harmony_render.py — 9 tests:

single-turn user message renders with system + generation-prompt suffix
developer block renders the function namespace
section order: system precedes developer precedes the first user turn
assistant tool_calls render in the commentary channel addressed to functions.X (and bracket-text fallback does not appear — explicit anti-assertion)
role=tool messages render as <|start|>functions.X to=assistant… with the function name resolved by tracing back the most recent assistant tool_call_id
prior thinking (or reasoning_content) on an assistant turn renders in the analysis channel before the commentary tool call
arguments as dict gets JSON-serialized
prior assistant content without tool_calls renders in the final channel
prompt ends with <|start|>assistant

All 9 pass alongside the 124 existing tool-parser tests (no regression).

2. Native tool-call preservation only for the harmony path

HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT is left at False — non-harmony behavior in extract_multimodal_content doesn't change. The harmony path doesn't go through that function for prompt rendering at all; it constructs openai_harmony.Conversation objects directly from the OpenAI messages, with full tool_calls + role=tool plumbing happening inside vllm_mlx/utils/harmony_render.py.

3. Route through `openai-harmony`

vllm_mlx/utils/harmony_render.py (new ~200 LOC) does the conversion + rendering. Added openai-harmony as an optional extra in pyproject.toml (pip install vllm-mlx[harmony]); a missing install logs a warning at server startup and falls back to apply_chat_template, so the default install footprint is unchanged.

The gate: engine.use_harmony_rendering (default False, set by _detect_harmony_rendering() next to the existing engine.preserve_native_tool_format = _detect_native_tool_support() calls in server.py). Returns True only when all of:

--enable-auto-tool-choice is on
--tool-call-parser harmony or gpt-oss is set
openai-harmony is importable

In engine/simple.py's LLM stream_chat path, one if at the existing render site:

if getattr(self, "use_harmony_rendering", False):
    prompt = render_messages(safe_messages, tools=template_tools, ...)
else:
    prompt = tokenizer.apply_chat_template(safe_messages, **template_kwargs)

The system-prefix KV cache probe (which assumes the Jinja path) adds "harmony_rendering" to cache_blocking_controls so the probe/actual-prompt strings can't desynchronize.

4. Non-harmony behavior unchanged

No edits to any non-harmony parser, no edits to the legacy apply_chat_template codepath, no flag flips on parsers other than the deliberate gate. The 124 existing tool-parser tests pass.

Empirical signal

Locally, on a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0, gpt-oss-120b MXFP4-Q8):

variant	pass	avg_turns_taken	score
current upstream (text-flatten)	8/20	5.05	0.400
`SUPPORTS_NATIVE=True` only (Jinja renders mismatched harmony)	5/20	2.70	0.250
this PR (`use_harmony_rendering=True`, via openai-harmony lib)	12/20	6.30	0.600
ollama (reference, runs its own harmony template)	13/20	5.95	0.650

Effectively closes the gap to ollama on the same model weights.

What I'm asking for

Review. Happy to:

Swap to an in-tree template instead of the openai-harmony dependency if you'd prefer (the gate stays the same, the renderer is the only thing that swaps).
Add more test cases — the 9 I added cover the cases I know about; suggest others if there's a known edge case.
Split into smaller commits if that helps review.

The fork-side patch has been running my own production gpt-oss-120b traffic for ~24 hours without issue.

@Thump604

Addresses waybarrios#568. GPT-OSS prompts go through ``tokenizer.apply_chat_template`` today, but ``api.utils.extract_multimodal_content()`` text-flattens prior assistant ``tool_calls`` to ``[Calling tool: X(args)]`` bracket strings when the active tool parser doesn't preserve native format. That breaks multi-turn agentic workloads on GPT-OSS — the model sees a malformed conversation and falls back to one-shot final-channel responses (avg_turns ~1, no tool use). This patch adds a separate prompt-rendering path that, only when active, routes through OpenAI's canonical ``openai-harmony`` renderer instead of the Jinja template. The renderer accepts structured ``Conversation`` objects (messages + tool definitions) and emits the wire format the GPT-OSS weights were trained on, sidestepping the text-flattening upstream entirely. Per @Thump604's suggested patch shape on waybarrios#568: 1. **Regression test** — ``tests/test_harmony_render.py`` (9 tests) covers the multi-turn assistant-tool/tool-result rendering, section order, channel placement, generation-prompt suffix, and explicitly asserts the bracket-text fallback NEVER appears. 2. **Scoped narrowly** — engine reads ``self.use_harmony_rendering`` (default ``False``); set by the server only when ``--tool-call-parser harmony``/``gpt-oss`` is active AND ``openai-harmony`` is importable. The flag controls one ``if`` at the prompt-render site; non-harmony parsers and templates are untouched. 3. **Routed through openai-harmony** — new ``vllm_mlx/utils/harmony_render.py`` converts OpenAI-format messages + tools to a ``Conversation`` and calls ``HarmonyEncoding.render_conversation_for_completion``. Tool-call-id -> function-name resolution is done locally so the OpenAI ``role=tool`` shape (which lacks the function name field) flows through correctly. 4. **Non-harmony behavior unchanged** — when the harmony path is inactive (parser is anything other than harmony/gpt-oss, OR the optional ``openai-harmony`` package is not installed) the existing ``apply_chat_template`` codepath runs verbatim. The system-prefix KV cache probe (which assumes the Jinja path) is gated off for harmony requests so the probe/actual-prompt strings can't desynchronize. ``openai-harmony`` is added as an optional extra (``pip install vllm-mlx[harmony]``) so the default install footprint is unchanged. All 124 existing parser tests still pass alongside the 9 new harmony tests.

Thump604

Thanks for putting this into the narrower Harmony-rendering shape. I agree with the direction, but I see a served-path blocker before this can close #568.

Blocking items:

The new renderer is wired too late to avoid the existing flattening path. _prepare_chat_messages() still computes preserve_native = engine.preserve_native_tool_format, and for harmony / gpt-oss that remains false because HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT is still false. The LLM path then calls extract_multimodal_content(..., preserve_native_format=False), which converts assistant tool_calls into [Calling tool: ...] text and converts role=tool into a user text fallback before engine.simple.stream_chat() calls render_messages(). So the direct render_messages() tests prove the helper works on already-structured messages, but they do not prove the actual /v1/chat/completions path preserves the structures that #568 is about. Please either make the prep path preserve native messages when use_harmony_rendering is active, or route Harmony rendering before this flattening step, and add a regression through the server/prepare path that fails if [Calling tool: or [Tool Result reaches the Harmony renderer.
The new Harmony renderer is effectively untested in CI right now. In the 3.11 matrix, vllm_mlx/utils/harmony_render.py shows 0% coverage and the run reports skipped tests, because openai-harmony is not installed. Since this PR depends on an external API, at least one CI job should install the harmony extra and run tests/test_harmony_render.py against the real package. Otherwise import/API drift in the optional dependency can merge unnoticed.
Closes #568 is premature while the actual served path above is still unproven. Please use Refs #568 until the server-path regression is in place and passing.
CI lint is also failing because Black would reformat vllm_mlx/engine/simple.py and vllm_mlx/utils/harmony_render.py.

Not claiming the openai-harmony direction is wrong; the issue is that the currently patched path can still receive already-flattened messages and the CI job is not exercising the new renderer.

@Thump604

…age + black Per @Thump604's review: 1. The harmony renderer was wired too late: `_prepare_chat_messages()` ran `extract_multimodal_content(..., preserve_native_format=False)` first (because `HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT` is still False), flattening prior assistant `tool_calls` to `[Calling tool: ...]` bracket text + `role=tool` to text-stuffed `role=user` BEFORE `render_messages()` could see them. Fix: when `engine.use_harmony_rendering` is True, the prep path forces `preserve_native = True` locally so the structural shape survives into the harmony renderer. `HarmonyToolParser` keeps its public `SUPPORTS_NATIVE_TOOL_FORMAT=False` setting — the override is harmony-scoped and lives in server prep. 2. The new renderer is now exercised in CI: - `tests/test_harmony_render.py` added to the explicit pytest list under the no-MLX `test-matrix` job (covers Python 3.10/3.11/3.12/3.13). - `openai-harmony` added to that job's pip install line so the skipif gate doesn't silently skip. - Three new regression tests assert the end-to-end server prep path keeps tool_calls structured AND that the rendered prompt never contains `[Calling tool:` or `[Tool Result`. They fail loudly if the prep-path coupling ever regresses. 3. Will update the PR description on push: "Closes waybarrios#568" → "Refs waybarrios#568" until the maintainer merges and confirms the served path works in their environment. 4. Black would have reformatted `vllm_mlx/engine/simple.py` and `vllm_mlx/utils/harmony_render.py`; ran with project default style. All 127 parser + native-format + harmony tests pass.

CBribiescas · 2026-05-26T13:44:45Z

Pushed the review fixes in f827eee:

Prep path now preserves native when engine.use_harmony_rendering=True. The override lives in _prepare_chat_messages() (server.py) so HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT stays False and only the harmony-active path changes behavior.
Three new server-path regression tests (tests/test_harmony_render.py::TestServerPathPreservesNativeForHarmony):
- test_prep_path_preserves_tool_calls_for_harmony — tool_calls survive structurally
- test_rendered_prompt_after_prep_has_no_bracket_text — asserts [Calling tool: and [Tool Result never reach the renderer
- test_non_harmony_engine_falls_through_unchanged — guards against accidentally flipping native preservation for unrelated parsers
Title updated Closes #568 → refs #568.
Black-formatted vllm_mlx/engine/simple.py and vllm_mlx/utils/harmony_render.py.

All 127 parser + native-format + harmony tests pass locally.

One thing I couldn't push directly: the CI workflow change to install openai-harmony and add tests/test_harmony_render.py to the matrix job. My token doesn't have workflow scope (refusing to allow an OAuth App to create or update workflow ... without workflow scope). The change needed is small — feel free to apply this yourself when reviewing:

--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -63,7 +63,7 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install pytest anyio pytest-cov pydantic fastapi jsonschema httpx psutil transformers requests
+          pip install pytest anyio pytest-cov pydantic fastapi jsonschema httpx psutil transformers requests openai-harmony
 
       - name: Run unit tests (no MLX required)
         run: |
@@ -83,6 +83,7 @@ jobs:
             tests/test_anthropic_models.py \
             tests/test_anthropic_adapter.py \
             tests/test_harmony_parsers.py \
+            tests/test_harmony_render.py \
             tests/test_endpoint_model_policies.py \
             tests/test_gemma4_openai_format.py \
             tests/test_gemma4_streaming_edge.py \

Or, if you'd rather, I'm happy to grant my token the workflow scope and re-push myself — let me know.

…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).

Thump604 · 2026-05-30T21:55:24Z

I reran the focused coverage two ways:

python -m pytest tests/test_harmony_render.py tests/test_harmony_parsers.py tests/test_native_tool_format.py -q
# 54 passed, 12 skipped  (openai-harmony absent)

uv pip install --python /Users/David/code/vllm-mlx/.venv/bin/python \
  --target /tmp/openai-harmony-pr581 'openai-harmony>=0.0.8'
PYTHONPATH=/tmp/openai-harmony-pr581 \
  python -m pytest tests/test_harmony_render.py tests/test_harmony_parsers.py tests/test_native_tool_format.py -q
# 66 passed

ruff check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/utils/harmony_render.py tests/test_harmony_render.py pyproject.toml
black --check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/utils/harmony_render.py tests/test_harmony_render.py
git diff --check
# pass

Two merge-readiness points remain from my side:

The PR body still starts with Closes #568. Please change that to Refs #568 unless the maintainers want this PR to close the whole issue.
Current CI does not install openai-harmony, so tests/test_harmony_render.py is skipped in the green CI run. The renderer tests do pass when the optional dependency is present, but this should be exercised in CI before merge, either by adding openai-harmony to the relevant test job or by adding a separate optional-extra test job.

Once those are addressed, the focused local test result above looks good to me.

Thump604 requested changes May 25, 2026

View reviewed changes

CBribiescas changed the title ~~fix(gpt-oss): route harmony prompts through openai-harmony (closes #568)~~ fix(gpt-oss): route harmony prompts through openai-harmony (refs #568) May 26, 2026

lint(tests): drop unused json import from test_harmony_render

6376e47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpt-oss): route harmony prompts through openai-harmony (refs #568)#581

fix(gpt-oss): route harmony prompts through openai-harmony (refs #568)#581
CBribiescas wants to merge 3 commits into
waybarrios:mainfrom
CBribiescas:fix/gpt-oss-harmony-rendering

CBribiescas commented May 25, 2026

Uh oh!

Thump604 left a comment

Uh oh!

CBribiescas commented May 26, 2026

Uh oh!

Thump604 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CBribiescas commented May 25, 2026

The bug, in one paragraph

Patch shape (matching @Thump604's bullets)

1. Regression test

2. Native tool-call preservation only for the harmony path

3. Route through openai-harmony

4. Non-harmony behavior unchanged

Empirical signal

What I'm asking for

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

CBribiescas commented May 26, 2026

Uh oh!

Thump604 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3. Route through `openai-harmony`