Skip to content

fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats#580

Open
CBribiescas wants to merge 2 commits into
waybarrios:mainfrom
CBribiescas:fix/llama-tool-parser-python-tag-and-bare-json
Open

fix(llama-tool-parser): recognize Llama 3.1+ / 3.3 tool-call formats#580
CBribiescas wants to merge 2 commits into
waybarrios:mainfrom
CBribiescas:fix/llama-tool-parser-python-tag-and-bare-json

Conversation

@CBribiescas

Copy link
Copy Markdown
Contributor

First — apologies for another PR; I know I've been busy in the issue tracker lately. After this one, the only thing I'm still investigating locally is the gpt-oss harmony rendering side from #568 (waiting on maintainer direction there per @Thump604's reply). Putting this up because it's a narrow, well-scoped fix in a different parser.

The bug

LlamaToolParser only matches the legacy <function=name>{"json"}</function> XML format. But Llama 3.1+ and Llama 4 emit calls with the <|python_tag|>{"name":..., "parameters":...} JSON envelope per Meta's model card, and Llama 3.3 emits the same JSON shape bare, without any marker:

{"type": "function", "name": "read_file", "parameters": {"path": "foo.py"}}

Both currently leak through to message.content untouched and downstream callers see tool_calls: [], treating the assistant turn as a clean stop.

Fix shape

Two extraction helpers added ahead of the existing XML matcher:

  • _extract_python_tag_calls — scans for <|python_tag|> markers and uses json.JSONDecoder().raw_decode() to extract each following JSON object (so multiple tool calls in one response work).
  • _extract_bare_json_call — only fires when the whole stripped content is a JSON object with name AND (parameters OR arguments). Intentionally strict to avoid mis-parsing user-visible JSON the model might emit for unrelated reasons.

Existing <function=...>{...}</function> XML behavior is unchanged.

Empirical signal

On a multi-turn Claude-Code-style agentic benchmark (20 tasks, T=0):

variant tasks_passed avg_turns_taken score
Llama-3.3-70B-Instruct-4bit, pre-patch 0/20 1.00 0.000
Llama-3.3-70B-Instruct-4bit, post-patch 4/20 3.40 0.200

Same model weights, same harness, same prompt. The avg_turns lift from 1.00 -> 3.40 is the signal: the model was emitting valid bare-JSON tool calls on every turn — they were just being silently dropped.

Test coverage

8 new tests + 3 existing — all passing:

  • python_tag JSON, with parameters and the arguments alias
  • multiple python_tag calls in one response
  • bare-JSON envelope, with and without the type:function field
  • precedence: python_tag short-circuits the bare-JSON path (no double-count)
  • negative cases: plain text and non-tool JSON must NOT be flagged
$ pytest tests/test_tool_parsers.py::TestLlamaToolParser -v
============================== 11 passed in 0.48s ==============================

Relation to #568

Same class of bug as the gpt-oss rendering issue I reported in #568 — vllm-mlx code paths haven't kept up with formats the models actually emit at inference time. This one is on the output side (parsing) rather than the input side (prompt rendering), which is why it's a much cleaner, smaller patch: no template-format design choices, no openai-harmony vs in-tree decision, no maintainer-direction blocker. Just two regex/decoder helpers and tests.

Happy to address review feedback.

The Llama tool parser only matched the legacy
`<function=name>{"json"}</function>` XML format. Llama 3.1+ and Llama 4
actually emit calls with a `<|python_tag|>{"name":...,"parameters":...}`
JSON envelope per Meta's model card, and Llama 3.3 emits the same JSON
shape bare (no leading marker). With the old parser, both formats leak
through to `content` untouched and downstream callers see no tool_calls.

This adds two extraction helpers — `_extract_python_tag_calls` and
`_extract_bare_json_call` — and chains them ahead of the existing XML
matcher. Existing `<function=...>` behavior is unchanged.

Tests (8 new + 3 existing, all passing):
  - python_tag JSON, with `parameters` and the `arguments` alias
  - multiple python_tag calls in one response
  - bare-JSON envelope, with and without the `type:function` field
  - precedence: python_tag short-circuits the bare-JSON path
  - negative cases: plain text and non-tool JSON must not be flagged

Empirical: on a multi-turn Claude-Code-style agentic benchmark
(20 tasks, T=0), Llama-3.3-70B-Instruct-4bit went from 0/20 to 4/20
after this patch (and avg_turns from 1.0 -> 3.4 — the model was
emitting valid bare-JSON calls all along, they were just being missed).
Same shape of bug reported (input-side) for gpt-oss in waybarrios#568.

@Thump604 Thump604 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for narrowing this. The complete-response extraction path looks like the right direction, but I think one served path needs to be covered before merge.

Blocking items:

  1. The streaming parser still only recognizes the legacy XML marker. extract_tool_calls_streaming() returns {"content": delta_text} whenever "<function=" not in current_text, so streamed Llama 3.1+/3.3/4 outputs using <|python_tag|>{...} or the bare JSON envelope will still be emitted as assistant content and never become tool_calls. Since the server calls this streaming parser for streamed chat completions, this leaves stream=True tool-use clients in the old failure mode. Please add streaming coverage for the new formats, or otherwise make the streaming path use equivalent detection once the tagged/bare JSON object is complete.

  2. CI lint is currently failing because Black would reformat vllm_mlx/tool_parsers/llama_tool_parser.py.

Not claiming the non-stream extraction fix is wrong; this is about making the existing streaming parser path match the new non-stream behavior before merge.

Per @Thump604's review:

1. Streaming parser now recognizes all three formats (legacy XML, python-tag
   JSON, bare-JSON envelope). Before this commit `extract_tool_calls_streaming`
   only checked `"<function="` so streamed Llama 3.1+/3.3/4 tool calls leaked
   through to assistant content. The new logic gates on any of the three
   shape markers (`<function=`, `<|python_tag|>`, opening brace) and reuses
   the complete-response extractor once the call(s) parse end-to-end.

2. Black would have reformatted llama_tool_parser.py + test_tool_parsers.py;
   ran with project default style.

4 new streaming tests cover python-tag, bare-JSON, plain-text, and
legacy XML paths. 15 total Llama parser tests pass; 124 across all
tool-parser suites green; no regression in test_native_tool_format.
CBribiescas added a commit to CBribiescas/vllm-mlx that referenced this pull request May 27, 2026
…g/bare-JSON

Production-running state at 2026-05-27. Not a PR target — this branch is
a snapshot for reproducing my local setup; the equivalent changes are in
upstream review as separate PRs:

- llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580.
  Applied here directly so this snapshot doesn't depend on waybarrios#580 landing.

- gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony
  is superseded by PR waybarrios#581 (route through openai-harmony lib); the
  gemma4 flag-flip is what the production launchd start-vllm-gemma
  service depends on for tool-call extraction.

The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 +
waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this
branch as an editable install gives you the same vllm-mlx serving
behavior I'm running for the 3-slot production stack (qwen-coder-30b,
gemma-4-E4B-it, gpt-oss-120b).
CBribiescas added a commit to CBribiescas/vllm-mlx that referenced this pull request May 27, 2026
…g/bare-JSON

Production-running state at 2026-05-27. Not a PR target — this branch is
a snapshot for reproducing my local setup; the equivalent changes are in
upstream review as separate PRs:

- llama parser changes (python_tag + bare-JSON envelopes) are PR waybarrios#580.
  Applied here directly so this snapshot doesn't depend on waybarrios#580 landing.

- gemma4 + harmony `SUPPORTS_NATIVE_TOOL_FORMAT = True` flips. Harmony
  is superseded by PR waybarrios#581 (route through openai-harmony lib); the
  gemma4 flag-flip is what the production launchd start-vllm-gemma
  service depends on for tool-call extraction.

The branch base (fix/gemma4-shared-kv-batching) already has PR waybarrios#562 +
waybarrios#563 + waybarrios#564 merged locally on top of upstream/main, so installing this
branch as an editable install gives you the same vllm-mlx serving
behavior I'm running for the 3-slot production stack (qwen-coder-30b,
gemma-4-E4B-it, gpt-oss-120b).
@CBribiescas

Copy link
Copy Markdown
Contributor Author

Pushed the fix in 9222c4e:

  • Streaming parser now detects <|python_tag|> and bare-JSON envelopes alongside legacy <function=. 4 new tests in TestLlamaToolParserStreaming cover both new shapes plus regression guards for plain-text and legacy-XML paths.
  • Black-formatted vllm_mlx/tool_parsers/llama_tool_parser.py and tests/test_tool_parsers.py.

16 Llama parser tests + 124 tool-parser suite green; no regression in test_native_tool_format. Ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants