fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats#580
fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats#580CBribiescas wants to merge 2 commits into
Conversation
The Llama tool parser only matched the legacy
`<function=name>{"json"}</function>` XML format. Llama 3.1+ and Llama 4
actually emit calls with a `<|python_tag|>{"name":...,"parameters":...}`
JSON envelope per Meta's model card, and Llama 3.3 emits the same JSON
shape bare (no leading marker). With the old parser, both formats leak
through to `content` untouched and downstream callers see no tool_calls.
This adds two extraction helpers — `_extract_python_tag_calls` and
`_extract_bare_json_call` — and chains them ahead of the existing XML
matcher. Existing `<function=...>` behavior is unchanged.
Tests (8 new + 3 existing, all passing):
- python_tag JSON, with `parameters` and the `arguments` alias
- multiple python_tag calls in one response
- bare-JSON envelope, with and without the `type:function` field
- precedence: python_tag short-circuits the bare-JSON path
- negative cases: plain text and non-tool JSON must not be flagged
Empirical: on a multi-turn Claude-Code-style agentic benchmark
(20 tasks, T=0), Llama-3.3-70B-Instruct-4bit went from 0/20 to 4/20
after this patch (and avg_turns from 1.0 -> 3.4 — the model was
emitting valid bare-JSON calls all along, they were just being missed).
Same shape of bug reported (input-side) for gpt-oss in waybarrios#568.
Thump604
left a comment
There was a problem hiding this comment.
Thanks for narrowing this. The complete-response extraction path looks like the right direction, but I think one served path needs to be covered before merge.
Blocking items:
-
The streaming parser still only recognizes the legacy XML marker.
extract_tool_calls_streaming()returns{"content": delta_text}whenever"<function=" not in current_text, so streamed Llama 3.1+/3.3/4 outputs using<|python_tag|>{...}or the bare JSON envelope will still be emitted as assistant content and never becometool_calls. Since the server calls this streaming parser for streamed chat completions, this leavesstream=Truetool-use clients in the old failure mode. Please add streaming coverage for the new formats, or otherwise make the streaming path use equivalent detection once the tagged/bare JSON object is complete. -
CI lint is currently failing because Black would reformat
vllm_mlx/tool_parsers/llama_tool_parser.py.
Not claiming the non-stream extraction fix is wrong; this is about making the existing streaming parser path match the new non-stream behavior before merge.
Per @Thump604's review: 1. Streaming parser now recognizes all three formats (legacy XML, python-tag JSON, bare-JSON envelope). Before this commit `extract_tool_calls_streaming` only checked `"<function="` so streamed Llama 3.1+/3.3/4 tool calls leaked through to assistant content. The new logic gates on any of the three shape markers (`<function=`, `<|python_tag|>`, opening brace) and reuses the complete-response extractor once the call(s) parse end-to-end. 2. Black would have reformatted llama_tool_parser.py + test_tool_parsers.py; ran with project default style. 4 new streaming tests cover python-tag, bare-JSON, plain-text, and legacy XML paths. 15 total Llama parser tests pass; 124 across all tool-parser suites green; no regression in test_native_tool_format.
…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).
…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).
|
Pushed the fix in
16 Llama parser tests + 124 tool-parser suite green; no regression in |
First — apologies for another PR; I know I've been busy in the issue tracker lately. After this one, the only thing I'm still investigating locally is the gpt-oss harmony rendering side from #568 (waiting on maintainer direction there per @Thump604's reply). Putting this up because it's a narrow, well-scoped fix in a different parser.
The bug
LlamaToolParseronly matches the legacy<function=name>{"json"}</function>XML format. But Llama 3.1+ and Llama 4 emit calls with the<|python_tag|>{"name":..., "parameters":...}JSON envelope per Meta's model card, and Llama 3.3 emits the same JSON shape bare, without any marker:Both currently leak through to
message.contentuntouched and downstream callers seetool_calls: [], treating the assistant turn as a clean stop.Fix shape
Two extraction helpers added ahead of the existing XML matcher:
_extract_python_tag_calls— scans for<|python_tag|>markers and usesjson.JSONDecoder().raw_decode()to extract each following JSON object (so multiple tool calls in one response work)._extract_bare_json_call— only fires when the whole stripped content is a JSON object withnameAND (parametersORarguments). Intentionally strict to avoid mis-parsing user-visible JSON the model might emit for unrelated reasons.Existing
<function=...>{...}</function>XML behavior is unchanged.Empirical signal
On a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0):
Same model weights, same harness, same prompt. The avg_turns lift from 1.00 -> 3.40 is the signal: the model was emitting valid bare-JSON tool calls on every turn — they were just being silently dropped.
Test coverage
8 new tests + 3 existing — all passing:
parametersand theargumentsaliastype:functionfieldRelation to #568
Same class of bug as the gpt-oss rendering issue I reported in #568 — vllm-mlx code paths haven't kept up with formats the models actually emit at inference time. This one is on the output side (parsing) rather than the input side (prompt rendering), which is why it's a much cleaner, smaller patch: no template-format design choices, no
openai-harmonyvs in-tree decision, no maintainer-direction blocker. Just two regex/decoder helpers and tests.Happy to address review feedback.