fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats by CBribiescas · Pull Request #580 · waybarrios/vllm-mlx

CBribiescas · 2026-05-25T07:12:51Z

First — apologies for another PR; I know I've been busy in the issue tracker lately. After this one, the only thing I'm still investigating locally is the gpt-oss harmony rendering side from #568 (waiting on maintainer direction there per @Thump604's reply). Putting this up because it's a narrow, well-scoped fix in a different parser.

The bug

LlamaToolParser only matches the legacy <function=name>{"json"}</function> XML format. But Llama 3.1+ and Llama 4 emit calls with the <|python_tag|>{"name":..., "parameters":...} JSON envelope per Meta's model card, and Llama 3.3 emits the same JSON shape bare, without any marker:

{"type": "function", "name": "read_file", "parameters": {"path": "foo.py"}}

Both currently leak through to message.content untouched and downstream callers see tool_calls: [], treating the assistant turn as a clean stop.

Fix shape

Two extraction helpers added ahead of the existing XML matcher:

_extract_python_tag_calls — scans for <|python_tag|> markers and uses json.JSONDecoder().raw_decode() to extract each following JSON object (so multiple tool calls in one response work).
_extract_bare_json_call — only fires when the whole stripped content is a JSON object with name AND (parameters OR arguments). Intentionally strict to avoid mis-parsing user-visible JSON the model might emit for unrelated reasons.

Existing <function=...>{...}</function> XML behavior is unchanged.

Empirical signal

On a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0):

variant	tasks_passed	avg_turns_taken	score
Llama-3.3-70B-Instruct-4bit, pre-patch	0/20	1.00	0.000
Llama-3.3-70B-Instruct-4bit, post-patch	4/20	3.40	0.200

Same model weights, same harness, same prompt. The avg_turns lift from 1.00 -> 3.40 is the signal: the model was emitting valid bare-JSON tool calls on every turn — they were just being silently dropped.

Test coverage

8 new tests + 3 existing — all passing:

python_tag JSON, with parameters and the arguments alias
multiple python_tag calls in one response
bare-JSON envelope, with and without the type:function field
precedence: python_tag short-circuits the bare-JSON path (no double-count)
negative cases: plain text and non-tool JSON must NOT be flagged

$ pytest tests/test_tool_parsers.py::TestLlamaToolParser -v
============================== 11 passed in 0.48s ==============================

Relation to #568

Same class of bug as the gpt-oss rendering issue I reported in #568 — vllm-mlx code paths haven't kept up with formats the models actually emit at inference time. This one is on the output side (parsing) rather than the input side (prompt rendering), which is why it's a much cleaner, smaller patch: no template-format design choices, no openai-harmony vs in-tree decision, no maintainer-direction blocker. Just two regex/decoder helpers and tests.

Happy to address review feedback.

The Llama tool parser only matched the legacy `<function=name>{"json"}</function>` XML format. Llama 3.1+ and Llama 4 actually emit calls with a `<|python_tag|>{"name":...,"parameters":...}` JSON envelope per Meta's model card, and Llama 3.3 emits the same JSON shape bare (no leading marker). With the old parser, both formats leak through to `content` untouched and downstream callers see no tool_calls. This adds two extraction helpers — `_extract_python_tag_calls` and `_extract_bare_json_call` — and chains them ahead of the existing XML matcher. Existing `<function=...>` behavior is unchanged. Tests (8 new + 3 existing, all passing): - python_tag JSON, with `parameters` and the `arguments` alias - multiple python_tag calls in one response - bare-JSON envelope, with and without the `type:function` field - precedence: python_tag short-circuits the bare-JSON path - negative cases: plain text and non-tool JSON must not be flagged Empirical: on a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0), Llama-3.3-70B-Instruct-4bit went from 0/20 to 4/20 after this patch (and avg_turns from 1.0 -> 3.4 — the model was emitting valid bare-JSON calls all along, they were just being missed). Same shape of bug reported (input-side) for gpt-oss in waybarrios#568.

Thump604

Thanks for narrowing this. The complete-response extraction path looks like the right direction, but I think one served path needs to be covered before merge.

Blocking items:

The streaming parser still only recognizes the legacy XML marker. extract_tool_calls_streaming() returns {"content": delta_text} whenever "<function=" not in current_text, so streamed Llama 3.1+/3.3/4 outputs using <|python_tag|>{...} or the bare JSON envelope will still be emitted as assistant content and never become tool_calls. Since the server calls this streaming parser for streamed chat completions, this leaves stream=True tool-use clients in the old failure mode. Please add streaming coverage for the new formats, or otherwise make the streaming path use equivalent detection once the tagged/bare JSON object is complete.
CI lint is currently failing because Black would reformat vllm_mlx/tool_parsers/llama_tool_parser.py.

Not claiming the non-stream extraction fix is wrong; this is about making the existing streaming parser path match the new non-stream behavior before merge.

@Thump604

Per @Thump604's review: 1. Streaming parser now recognizes all three formats (legacy XML, python-tag JSON, bare-JSON envelope). Before this commit `extract_tool_calls_streaming` only checked `"<function="` so streamed Llama 3.1+/3.3/4 tool calls leaked through to assistant content. The new logic gates on any of the three shape markers (`<function=`, `<|python_tag|>`, opening brace) and reuses the complete-response extractor once the call(s) parse end-to-end. 2. Black would have reformatted llama_tool_parser.py + test_tool_parsers.py; ran with project default style. 4 new streaming tests cover python-tag, bare-JSON, plain-text, and legacy XML paths. 15 total Llama parser tests pass; 124 across all tool-parser suites green; no regression in test_native_tool_format.

…g/bare-JSON Production-running state at 2026-05-27. Not a PR target — this branch is a snapshot for reproducing my local setup; the equivalent changes are in upstream review as separate PRs: - llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580. Applied here directly so this snapshot doesn't depend on waybarrios#580 landing. - gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony is superseded by PR waybarrios#581 (route through openai-harmony lib); the gemma4 flag-flip is what the production launchd start-vllm-gemma service depends on for tool-call extraction. The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 + waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this branch as an editable install gives you the same vllm-mlx serving behavior I'm running for the 3-slot production stack (qwen-coder-30b, gemma-4-E4B-it, gpt-oss-120b).

CBribiescas · 2026-06-05T10:17:48Z

Pushed the fix in 9222c4e:

Streaming parser now detects <|python_tag|> and bare-JSON envelopes alongside legacy <function=. 4 new tests in TestLlamaToolParserStreaming cover both new shapes plus regression guards for plain-text and legacy-XML paths.
Black-formatted vllm_mlx/tool_parsers/llama_tool_parser.py and tests/test_tool_parsers.py.

16 Llama parser tests + 124 tool-parser suite green; no regression in test_native_tool_format. Ready for another look.

CBribiescas mentioned this pull request May 25, 2026

fix(gpt-oss): route harmony prompts through openai-harmony (refs #568) #581

Open

Thump604 requested changes May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats#580

fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats#580
CBribiescas wants to merge 2 commits into
waybarrios:mainfrom
CBribiescas:fix/llama-tool-parser-python-tag-and-bare-json

CBribiescas commented May 25, 2026

Uh oh!

Thump604 left a comment

Uh oh!

CBribiescas commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CBribiescas commented May 25, 2026

The bug

Fix shape

Empirical signal

Test coverage

Relation to #568

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

CBribiescas commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants