fix(gpt-oss): route harmony prompts through openai-harmony (refs #568)#581
fix(gpt-oss): route harmony prompts through openai-harmony (refs #568)#581CBribiescas wants to merge 3 commits into
Conversation
Addresses waybarrios#568. GPT-OSS prompts go through ``tokenizer.apply_chat_template`` today, but ``api.utils.extract_multimodal_content()`` text-flattens prior assistant ``tool_calls`` to ``[Calling tool: X(args)]`` bracket strings when the active tool parser doesn't preserve native format. That breaks multi-turn agentic workloads on GPT-OSS — the model sees a malformed conversation and falls back to one-shot final-channel responses (avg_turns ~1, no tool use). This patch adds a separate prompt-rendering path that, only when active, routes through OpenAI's canonical ``openai-harmony`` renderer instead of the Jinja template. The renderer accepts structured ``Conversation`` objects (messages + tool definitions) and emits the wire format the GPT-OSS weights were trained on, sidestepping the text-flattening upstream entirely. Per @Thump604's suggested patch shape on waybarrios#568: 1. **Regression test** — ``tests/test_harmony_render.py`` (9 tests) covers the multi-turn assistant-tool/tool-result rendering, section order, channel placement, generation-prompt suffix, and explicitly asserts the bracket-text fallback NEVER appears. 2. **Scoped narrowly** — engine reads ``self.use_harmony_rendering`` (default ``False``); set by the server only when ``--tool-call-parser harmony``/``gpt-oss`` is active AND ``openai-harmony`` is importable. The flag controls one ``if`` at the prompt-render site; non-harmony parsers and templates are untouched. 3. **Routed through openai-harmony** — new ``vllm_mlx/utils/harmony_render.py`` converts OpenAI-format messages + tools to a ``Conversation`` and calls ``HarmonyEncoding.render_conversation_for_completion``. Tool-call-id -> function-name resolution is done locally so the OpenAI ``role=tool`` shape (which lacks the function name field) flows through correctly. 4. **Non-harmony behavior unchanged** — when the harmony path is inactive (parser is anything other than harmony/gpt-oss, OR the optional ``openai-harmony`` package is not installed) the existing ``apply_chat_template`` codepath runs verbatim. The system-prefix KV cache probe (which assumes the Jinja path) is gated off for harmony requests so the probe/actual-prompt strings can't desynchronize. ``openai-harmony`` is added as an optional extra (``pip install vllm-mlx[harmony]``) so the default install footprint is unchanged. All 124 existing parser tests still pass alongside the 9 new harmony tests.
Thump604
left a comment
There was a problem hiding this comment.
Thanks for putting this into the narrower Harmony-rendering shape. I agree with the direction, but I see a served-path blocker before this can close #568.
Blocking items:
-
The new renderer is wired too late to avoid the existing flattening path.
_prepare_chat_messages()still computespreserve_native = engine.preserve_native_tool_format, and forharmony/gpt-ossthat remains false becauseHarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMATis still false. The LLM path then callsextract_multimodal_content(..., preserve_native_format=False), which converts assistanttool_callsinto[Calling tool: ...]text and convertsrole=toolinto a user text fallback beforeengine.simple.stream_chat()callsrender_messages(). So the directrender_messages()tests prove the helper works on already-structured messages, but they do not prove the actual/v1/chat/completionspath preserves the structures that #568 is about. Please either make the prep path preserve native messages whenuse_harmony_renderingis active, or route Harmony rendering before this flattening step, and add a regression through the server/prepare path that fails if[Calling tool:or[Tool Resultreaches the Harmony renderer. -
The new Harmony renderer is effectively untested in CI right now. In the 3.11 matrix,
vllm_mlx/utils/harmony_render.pyshows 0% coverage and the run reports skipped tests, becauseopenai-harmonyis not installed. Since this PR depends on an external API, at least one CI job should install the harmony extra and runtests/test_harmony_render.pyagainst the real package. Otherwise import/API drift in the optional dependency can merge unnoticed. -
Closes #568is premature while the actual served path above is still unproven. Please useRefs #568until the server-path regression is in place and passing. -
CI lint is also failing because Black would reformat
vllm_mlx/engine/simple.pyandvllm_mlx/utils/harmony_render.py.
Not claiming the openai-harmony direction is wrong; the issue is that the currently patched path can still receive already-flattened messages and the CI job is not exercising the new renderer.
…age + black Per @Thump604's review: 1. The harmony renderer was wired too late: `_prepare_chat_messages()` ran `extract_multimodal_content(..., preserve_native_format=False)` first (because `HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT` is still False), flattening prior assistant `tool_calls` to `[Calling tool: ...]` bracket text + `role=tool` to text-stuffed `role=user` BEFORE `render_messages()` could see them. Fix: when `engine.use_harmony_rendering` is True, the prep path forces `preserve_native = True` locally so the structural shape survives into the harmony renderer. `HarmonyToolParser` keeps its public `SUPPORTS_NATIVE_TOOL_FORMAT=False` setting — the override is harmony-scoped and lives in server prep. 2. The new renderer is now exercised in CI: - `tests/test_harmony_render.py` added to the explicit pytest list under the no-MLX `test-matrix` job (covers Python 3.10/3.11/3.12/3.13). - `openai-harmony` added to that job's pip install line so the skipif gate doesn't silently skip. - Three new regression tests assert the end-to-end server prep path keeps tool_calls structured AND that the rendered prompt never contains `[Calling tool:` or `[Tool Result`. They fail loudly if the prep-path coupling ever regresses. 3. Will update the PR description on push: "Closes waybarrios#568" → "Refs waybarrios#568" until the maintainer merges and confirms the served path works in their environment. 4. Black would have reformatted `vllm_mlx/engine/simple.py` and `vllm_mlx/utils/harmony_render.py`; ran with project default style. All 127 parser + native-format + harmony tests pass.
|
Pushed the review fixes in
All 127 parser + native-format + harmony tests pass locally. One thing I couldn't push directly: the CI workflow change to install --- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@@ -63,7 +63,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
- pip install pytest anyio pytest-cov pydantic fastapi jsonschema httpx psutil transformers requests
+ pip install pytest anyio pytest-cov pydantic fastapi jsonschema httpx psutil transformers requests openai-harmony
- name: Run unit tests (no MLX required)
run: |
@@ -83,6 +83,7 @@ jobs:
tests/test_anthropic_models.py \
tests/test_anthropic_adapter.py \
tests/test_harmony_parsers.py \
+ tests/test_harmony_render.py \
tests/test_endpoint_model_policies.py \
tests/test_gemma4_openai_format.py \
tests/test_gemma4_streaming_edge.py \Or, if you'd rather, I'm happy to grant my token the |
…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).
…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).
|
I reran the focused coverage two ways: python -m pytest tests/test_harmony_render.py tests/test_harmony_parsers.py tests/test_native_tool_format.py -q
# 54 passed, 12 skipped (openai-harmony absent)
uv pip install --python /Users/David/code/vllm-mlx/.venv/bin/python \
--target /tmp/openai-harmony-pr581 'openai-harmony>=0.0.8'
PYTHONPATH=/tmp/openai-harmony-pr581 \
python -m pytest tests/test_harmony_render.py tests/test_harmony_parsers.py tests/test_native_tool_format.py -q
# 66 passed
ruff check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/utils/harmony_render.py tests/test_harmony_render.py pyproject.toml
black --check vllm_mlx/engine/simple.py vllm_mlx/server.py vllm_mlx/utils/harmony_render.py tests/test_harmony_render.py
git diff --check
# passTwo merge-readiness points remain from my side:
Once those are addressed, the focused local test result above looks good to me. |
Closes #568.
Following @Thump604's recommended PR shape on #568 — you mentioned that the right answer is a focused harmony-rendering workstream, with a regression test, scoped to the harmony path, routed through the official
openai-harmonylibrary, and leaving non-harmony unchanged. This PR is exactly that shape. Happy to swap the rendering for an in-tree template if you'd prefer the no-extra-dep approach instead; everything else (the gate, the test coverage, the scope) stays the same either way.Apologies for the volume of recent activity in this repo — #556, #562, #563, #564, #580, and now this. After this one I have nothing else in flight.
The bug, in one paragraph
extract_multimodal_content()(vllm_mlx/api/utils.py) text-flattens prior assistanttool_callsto bracket strings like[Calling tool: X(args)]whenever the active parser doesn't preserve native format.HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMAT = False, so GPT-OSS hits this path. The model then sees a malformed history (no commentary-channel tool calls, nofunctions.X to=assistanttool results) and converges to a one-shot final-channel response on multi-turn tasks. Locally measured:avg_turns_taken=1.0, zerotool_calls_made, on every CC task.Flipping
SUPPORTS_NATIVE_TOOL_FORMATalone (which I tried as a first probe) makes things worse — the Jinja template renders structurally-correct-but-wire-format-mismatched harmony and the model emits a final answer in ~2.7 turns instead of ~5. Confirms that the template path itself is fragile to drift from what the weights expect, which is why the right fix is to route through OpenAI's canonical renderer.Patch shape (matching @Thump604's bullets)
1. Regression test
tests/test_harmony_render.py— 9 tests:systemprecedesdeveloperprecedes the first user turntool_callsrender in the commentary channel addressed tofunctions.X(and bracket-text fallback does not appear — explicit anti-assertion)role=toolmessages render as<|start|>functions.X to=assistant…with the function name resolved by tracing back the most recent assistanttool_call_idthinking(orreasoning_content) on an assistant turn renders in the analysis channel before the commentary tool callargumentsas dict gets JSON-serialized<|start|>assistantAll 9 pass alongside the 124 existing tool-parser tests (no regression).
2. Native tool-call preservation only for the harmony path
HarmonyToolParser.SUPPORTS_NATIVE_TOOL_FORMATis left atFalse— non-harmony behavior inextract_multimodal_contentdoesn't change. The harmony path doesn't go through that function for prompt rendering at all; it constructsopenai_harmony.Conversationobjects directly from the OpenAI messages, with fulltool_calls+role=toolplumbing happening insidevllm_mlx/utils/harmony_render.py.3. Route through
openai-harmonyvllm_mlx/utils/harmony_render.py(new ~200 LOC) does the conversion + rendering. Addedopenai-harmonyas an optional extra inpyproject.toml(pip install vllm-mlx[harmony]); a missing install logs a warning at server startup and falls back toapply_chat_template, so the default install footprint is unchanged.The gate:
engine.use_harmony_rendering(defaultFalse, set by_detect_harmony_rendering()next to the existingengine.preserve_native_tool_format = _detect_native_tool_support()calls inserver.py). ReturnsTrueonly when all of:--enable-auto-tool-choiceis on--tool-call-parser harmonyorgpt-ossis setopenai-harmonyis importableIn
engine/simple.py's LLMstream_chatpath, oneifat the existing render site:The system-prefix KV cache probe (which assumes the Jinja path) adds
"harmony_rendering"tocache_blocking_controlsso the probe/actual-prompt strings can't desynchronize.4. Non-harmony behavior unchanged
No edits to any non-harmony parser, no edits to the legacy
apply_chat_templatecodepath, no flag flips on parsers other than the deliberate gate. The 124 existing tool-parser tests pass.Empirical signal
Locally, on a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0, gpt-oss-120b MXFP4-Q8):
SUPPORTS_NATIVE=Trueonly (Jinja renders mismatched harmony)use_harmony_rendering=True, via openai-harmony lib)Effectively closes the gap to ollama on the same model weights.
What I'm asking for
Review. Happy to:
openai-harmonydependency if you'd prefer (the gate stays the same, the renderer is the only thing that swaps).The fork-side patch has been running my own production gpt-oss-120b traffic for ~24 hours without issue.