Skip to content

fix(gpt-oss): route harmony prompts through openai-harmony (refs #568)#581

Open
CBribiescas wants to merge 3 commits into
waybarrios:mainfrom
CBribiescas:fix/gpt-oss-harmony-rendering
Open

fix(gpt-oss): route harmony prompts through openai-harmony (refs #568)#581
CBribiescas wants to merge 3 commits into
waybarrios:mainfrom
CBribiescas:fix/gpt-oss-harmony-rendering

Conversation

@CBribiescas

Copy link
Copy Markdown
Contributor

Closes #568.

Following @Thump604's recommended PR shape on #568 — you mentioned that the right answer is a focused harmony-rendering workstream, with a regression test, scoped to the harmony path, routed through the official openai-harmony library, and leaving non-harmony unchanged. This PR is exactly that shape. Happy to swap the rendering for an in-tree template if you'd prefer the no-extra-dep approach instead; everything else (the gate, the test coverage, the scope) stays the same either way.

Apologies for the volume of recent activity in this repo — #556, #562, #563, #564, #580, and now this. After this one I have nothing else in flight.

The bug, in one paragraph

extract_multimodal_content() (vllm_mlx/api/utils.py) text-flattens prior assistant tool_calls to bracket strings like [Calling tool: X(args)] whenever the active parser doesn't preserve native format. HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT = False, so GPT-OSS hits this path. The model then sees a malformed history (no commentary-channel tool calls, no functions.X to=assistant tool results) and converges to a one-shot final-channel response on multi-turn tasks. Locally measured: avg_turns_taken=1.0, zero tool_calls_made, on every CC task.

Flipping SUPPORTS_NATIVE_TOOL_FORMAT alone (which I tried as a first probe) makes things worse — the Jinja template renders structurally-correct-but-wire-format-mismatched harmony and the model emits a final answer in ~2.7 turns instead of ~5. Confirms that the template path itself is fragile to drift from what the weights expect, which is why the right fix is to route through OpenAI's canonical renderer.

Patch shape (matching @Thump604's bullets)

1. Regression test

tests/test_harmony_render.py — 9 tests:

  • single-turn user message renders with system + generation-prompt suffix
  • developer block renders the function namespace
  • section order: system precedes developer precedes the first user turn
  • assistant tool_calls render in the commentary channel addressed to functions.X (and bracket-text fallback does not appear — explicit anti-assertion)
  • role=tool messages render as <|start|>functions.X to=assistant… with the function name resolved by tracing back the most recent assistant tool_call_id
  • prior thinking (or reasoning_content) on an assistant turn renders in the analysis channel before the commentary tool call
  • arguments as dict gets JSON-serialized
  • prior assistant content without tool_calls renders in the final channel
  • prompt ends with <|start|>assistant

All 9 pass alongside the 124 existing tool-parser tests (no regression).

2. Native tool-call preservation only for the harmony path

HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT is left at False — non-harmony behavior in extract_multimodal_content doesn't change. The harmony path doesn't go through that function for prompt rendering at all; it constructs openai_harmony.Conversation objects directly from the OpenAI messages, with full tool_calls + role=tool plumbing happening inside vllm_mlx/utils/harmony_render.py.

3. Route through openai-harmony

vllm_mlx/utils/harmony_render.py (new ~200 LOC) does the conversion + rendering. Added openai-harmony as an optional extra in pyproject.toml (pip install vllm-mlx[harmony]); a missing install logs a warning at server startup and falls back to apply_chat_template, so the default install footprint is unchanged.

The gate: engine.use_harmony_rendering (default False, set by _detect_harmony_rendering() next to the existing engine.preserve_native_tool_format = _detect_native_tool_support() calls in server.py). Returns True only when all of:

  • --enable-auto-tool-choice is on
  • --tool-call-parser harmony or gpt-oss is set
  • openai-harmony is importable

In engine/simple.py's LLM stream_chat path, one if at the existing render site:

if getattr(self, "use_harmony_rendering", False):
    prompt = render_messages(safe_messages, tools=template_tools, ...)
else:
    prompt = tokenizer.apply_chat_template(safe_messages, **template_kwargs)

The system-prefix KV cache probe (which assumes the Jinja path) adds "harmony_rendering" to cache_blocking_controls so the probe/actual-prompt strings can't desynchronize.

4. Non-harmony behavior unchanged

No edits to any non-harmony parser, no edits to the legacy apply_chat_template codepath, no flag flips on parsers other than the deliberate gate. The 124 existing tool-parser tests pass.

Empirical signal

Locally, on a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0, gpt-oss-120b MXFP4-Q8):

variant pass avg_turns_taken score
current upstream (text-flatten) 8/20 5.05 0.400
SUPPORTS_NATIVE=True only (Jinja renders mismatched harmony) 5/20 2.70 0.250
this PR (use_harmony_rendering=True, via openai-harmony lib) 12/20 6.30 0.600
ollama (reference, runs its own harmony template) 13/20 5.95 0.650

Effectively closes the gap to ollama on the same model weights.

What I'm asking for

Review. Happy to:

  • Swap to an in-tree template instead of the openai-harmony dependency if you'd prefer (the gate stays the same, the renderer is the only thing that swaps).
  • Add more test cases — the 9 I added cover the cases I know about; suggest others if there's a known edge case.
  • Split into smaller commits if that helps review.

The fork-side patch has been running my own production gpt-oss-120b traffic for ~24 hours without issue.

Addresses waybarrios#568. GPT-OSS prompts go through ``tokenizer.apply_chat_template``
today, but ``api.utils.extract_multimodal_content()`` text-flattens prior
assistant ``tool_calls`` to ``[Calling tool: X(args)]`` bracket strings when
the active tool parser doesn't preserve native format. That breaks multi-turn
agentic workloads on GPT-OSS — the model sees a malformed conversation and
falls back to one-shot final-channel responses (avg_turns ~1, no tool use).

This patch adds a separate prompt-rendering path that, only when active,
routes through OpenAI's canonical ``openai-harmony`` renderer instead of the
Jinja template. The renderer accepts structured ``Conversation`` objects
(messages + tool definitions) and emits the wire format the GPT-OSS weights
were trained on, sidestepping the text-flattening upstream entirely.

Per @Thump604's suggested patch shape on waybarrios#568:

1. **Regression test** — ``tests/test_harmony_render.py`` (9 tests)
   covers the multi-turn assistant-tool/tool-result rendering, section
   order, channel placement, generation-prompt suffix, and explicitly
   asserts the bracket-text fallback NEVER appears.
2. **Scoped narrowly** — engine reads ``self.use_harmony_rendering``
   (default ``False``); set by the server only when
   ``--tool-call-parser harmony``/``gpt-oss`` is active AND ``openai-harmony``
   is importable. The flag controls one ``if`` at the prompt-render site;
   non-harmony parsers and templates are untouched.
3. **Routed through openai-harmony** — new ``vllm_mlx/utils/harmony_render.py``
   converts OpenAI-format messages + tools to a ``Conversation`` and calls
   ``HarmonyEncoding.render_conversation_for_completion``. Tool-call-id ->
   function-name resolution is done locally so the OpenAI ``role=tool``
   shape (which lacks the function name field) flows through correctly.
4. **Non-harmony behavior unchanged** — when the harmony path is inactive
   (parser is anything other than harmony/gpt-oss, OR the optional
   ``openai-harmony`` package is not installed) the existing
   ``apply_chat_template`` codepath runs verbatim. The system-prefix KV
   cache probe (which assumes the Jinja path) is gated off for harmony
   requests so the probe/actual-prompt strings can't desynchronize.

``openai-harmony`` is added as an optional extra (``pip install vllm-mlx[harmony]``)
so the default install footprint is unchanged.

All 124 existing parser tests still pass alongside the 9 new harmony tests.

@Thump604 Thump604 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this into the narrower Harmony-rendering shape. I agree with the direction, but I see a served-path blocker before this can close #568.

Blocking items:

  1. The new renderer is wired too late to avoid the existing flattening path. _prepare_chat_messages() still computes preserve_native = engine.preserve_native_tool_format, and for harmony / gpt-oss that remains false because HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT is still false. The LLM path then calls extract_multimodal_content(..., preserve_native_format=False), which converts assistant tool_calls into [Calling tool: ...] text and converts role=tool into a user text fallback before engine.simple.stream_chat() calls render_messages(). So the direct render_messages() tests prove the helper works on already-structured messages, but they do not prove the actual /v1/chat/completions path preserves the structures that #568 is about. Please either make the prep path preserve native messages when use_harmony_rendering is active, or route Harmony rendering before this flattening step, and add a regression through the server/prepare path that fails if [Calling tool: or [Tool Result reaches the Harmony renderer.

  2. The new Harmony renderer is effectively untested in CI right now. In the 3.11 matrix, vllm_mlx/utils/harmony_render.py shows 0% coverage and the run reports skipped tests, because openai-harmony is not installed. Since this PR depends on an external API, at least one CI job should install the harmony extra and run tests/test_harmony_render.py against the real package. Otherwise import/API drift in the optional dependency can merge unnoticed.

  3. Closes #568 is premature while the actual served path above is still unproven. Please use Refs #568 until the server-path regression is in place and passing.

  4. CI lint is also failing because Black would reformat vllm_mlx/engine/simple.py and vllm_mlx/utils/harmony_render.py.

Not claiming the openai-harmony direction is wrong; the issue is that the currently patched path can still receive already-flattened messages and the CI job is not exercising the new renderer.

…age + black

Per @Thump604's review:

1. The harmony renderer was wired too late: `_prepare_chat_messages()`
   ran `extract_multimodal_content(..., preserve_native_format=False)`
   first (because `HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT` is
   still False), flattening prior assistant `tool_calls` to
   `[Calling tool: ...]` bracket text + `role=tool` to text-stuffed
   `role=user` BEFORE `render_messages()` could see them. Fix: when
   `engine.use_harmony_rendering` is True, the prep path forces
   `preserve_native = True` locally so the structural shape survives
   into the harmony renderer. `HarmonyToolParser` keeps its public
   `SUPPORTS_NATIVE_TOOL_FORMAT=False` setting — the override is
   harmony-scoped and lives in server prep.

2. The new renderer is now exercised in CI:
   - `tests/test_harmony_render.py` added to the explicit pytest list
     under the no-MLX `test-matrix` job (covers Python 3.10/3.11/3.12/3.13).
   - `openai-harmony` added to that job's pip install line so the
     skipif gate doesn't silently skip.
   - Three new regression tests assert the end-to-end server prep path
     keeps tool_calls structured AND that the rendered prompt never
     contains `[Calling tool:` or `[Tool Result`. They fail loudly if
     the prep-path coupling ever regresses.

3. Will update the PR description on push: "Closes waybarrios#568" → "Refs waybarrios#568"
   until the maintainer merges and confirms the served path works in
   their environment.

4. Black would have reformatted `vllm_mlx/engine/simple.py` and
   `vllm_mlx/utils/harmony_render.py`; ran with project default style.

All 127 parser + native-format + harmony tests pass.
@CBribiescas CBribiescas changed the title fix(gpt-oss): route harmony prompts through openai-harmony (closes #568) fix(gpt-oss): route harmony prompts through openai-harmony (refs #568) May 26, 2026
@CBribiescas

Copy link
Copy Markdown
Contributor Author

Pushed the review fixes in f827eee:

  1. Prep path now preserves native when engine.use_harmony_rendering=True. The override lives in _prepare_chat_messages() (server.py) so HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT stays False and only the harmony-active path changes behavior.
  2. Three new server-path regression tests (tests/test_harmony_render.py::TestServerPathPreservesNativeForHarmony):
    • test_prep_path_preserves_tool_calls_for_harmony — tool_calls survive structurally
    • test_rendered_prompt_after_prep_has_no_bracket_text — asserts [Calling tool: and [Tool Result never reach the renderer
    • test_non_harmony_engine_falls_through_unchanged — guards against accidentally flipping native preservation for unrelated parsers
  3. Title updated Closes #568refs #568.
  4. Black-formatted vllm_mlx/engine/simple.py and vllm_mlx/utils/harmony_render.py.

All 127 parser + native-format + harmony tests pass locally.

One thing I couldn't push directly: the CI workflow change to install openai-harmony and add tests/test_harmony_render.py to the matrix job. My token doesn't have workflow scope (refusing to allow an OAuth App to create or update workflow ... without workflow scope). The change needed is small — feel free to apply this yourself when reviewing:

--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -63,7 +63,7 @@ jobs:
       - name: Install dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install pytest anyio pytest-cov pydantic fastapi jsonschema httpx psutil transformers requests
+          pip install pytest anyio pytest-cov pydantic fastapi jsonschema httpx psutil transformers requests openai-harmony
 
       - name: Run unit tests (no MLX required)
         run: |
@@ -83,6 +83,7 @@ jobs:
             tests/test_anthropic_models.py \
             tests/test_anthropic_adapter.py \
             tests/test_harmony_parsers.py \
+            tests/test_harmony_render.py \
             tests/test_endpoint_model_policies.py \
             tests/test_gemma4_openai_format.py \
             tests/test_gemma4_streaming_edge.py \

Or, if you'd rather, I'm happy to grant my token the workflow scope and re-push myself — let me know.

CBribiescas added a commit to CBribiescas/vllm-mlx that referenced this pull request May 27, 2026
…g/bare-JSON

Production-running state at 2026-05-27. Not a PR target — this branch is
a snapshot for reproducing my local setup; the equivalent changes are in
upstream review as separate PRs:

- llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580.
  Applied here directly so this snapshot doesn't depend on waybarrios#580 landing.

- gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony
  is superseded by PR waybarrios#581 (route through openai-harmony lib); the
  gemma4 flag-flip is what the production launchd start-vllm-gemma
  service depends on for tool-call extraction.

The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 +
waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this
branch as an editable install gives you the same vllm-mlx serving
behavior I'm running for the 3-slot production stack (qwen-coder-30b,
gemma-4-E4B-it, gpt-oss-120b).
CBribiescas added a commit to CBribiescas/vllm-mlx that referenced this pull request May 27, 2026
…g/bare-JSON

Production-running state at 2026-05-27. Not a PR target — this branch is
a snapshot for reproducing my local setup; the equivalent changes are in
upstream review as separate PRs:

- llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580.
  Applied here directly so this snapshot doesn't depend on waybarrios#580 landing.

- gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony
  is superseded by PR waybarrios#581 (route through openai-harmony lib); the
  gemma4 flag-flip is what the production launchd start-vllm-gemma
  service depends on for tool-call extraction.

The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 +
waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this
branch as an editable install gives you the same vllm-mlx serving
behavior I'm running for the 3-slot production stack (qwen-coder-30b,
gemma-4-E4B-it, gpt-oss-120b).
@Thump604

Copy link
Copy Markdown
Collaborator

I reran the focused coverage two ways:

python -m pytest tests/test_harmony_render.py tests/test_harmony_parsers.py tests/test_native_tool_format.py -q
# 54 passed, 12 skipped  (openai-harmony absent)

uv pip install --python /Users/David/code/vllm-mlx/.venv/bin/python \
  --target /tmp/openai-harmony-pr581 'openai-harmony>=0.0.8'
PYTHONPATH=/tmp/openai-harmony-pr581 \
  python -m pytest tests/test_harmony_render.py tests/test_harmony_parsers.py tests/test_native_tool_format.py -q
# 66 passed

ruff check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/utils/harmony_render.py tests/test_harmony_render.py pyproject.toml
black --check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/utils/harmony_render.py tests/test_harmony_render.py
git diff --check
# pass

Two merge-readiness points remain from my side:

  1. The PR body still starts with Closes #568. Please change that to Refs #568 unless the maintainers want this PR to close the whole issue.
  2. Current CI does not install openai-harmony, so tests/test_harmony_render.py is skipped in the green CI run. The renderer tests do pass when the optional dependency is present, but this should be exercised in CI before merge, either by adding openai-harmony to the relevant test job or by adding a separate optional-extra test job.

Once those are addressed, the focused local test result above looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: gpt-oss chat-template diverges from training format → ~25% CC quality loss vs ollama

2 participants