Skip to content

Keep assistant tool_calls and tool messages on the MLLM chat path#611

Open
waybarrios wants to merge 2 commits into
mainfrom
fix/mllm-tool-messages
Open

Keep assistant tool_calls and tool messages on the MLLM chat path#611
waybarrios wants to merge 2 commits into
mainfrom
fix/mllm-tool-messages

Conversation

@waybarrios

Copy link
Copy Markdown
Owner

Summary

Fixes #608. On the MLLM path, _build_mllm_chat_messages rebuilt every message as {role, content} and dropped any message without renderable text. An OpenAI assistant message with content: null + tool_calls was removed entirely, and role:"tool" messages lost their tool_call_id. The rendered continuation prompt came out byte-identical to the fresh prompt, so agent clients looped on the same tool call forever.

What changed

All in _build_mllm_chat_messages (shared by chat() and stream_chat(), so both routes are covered with no call-site changes):

  • Assistant messages carrying tool_calls survive even with empty text. Arguments are normalized str -> mapping via the same normalize_messages_for_chat_template helper the batched engine path wraps, so both paths render identically. reasoning_content is forwarded when present.
  • role:"tool" messages keep their tool_call_id. Empty tool results (legit for action tools) survive with content: "" instead of being dropped, so template forward-scans can still pair calls to responses.
  • Everything else is byte-identical to before: image/audio/video part ordering, plain-text messages, and the drop rule for genuinely empty messages (verified pre/post on the same inputs).

SUPPORTS_NATIVE_TOOL_FORMAT on Gemma4ToolParser (issue suggestion 3) is intentionally left out of scope.

Tests

5 new regression tests in tests/test_mllm_message_ordering.py (written first, 3 failed on main for the exact bug reasons):

  • assistant {content: null, tool_calls} survives with normalized arguments
  • tool messages keep tool_call_id + content
  • empty tool results survive with their anchor
  • a 4-message tool exchange rebuilds strictly more messages than the fresh 3-message turn
  • genuinely empty messages without tool_calls are still dropped
pytest tests/test_mllm_message_ordering.py -q   -> 8 passed
pytest tests/test_mllm*.py tests/test_server.py -> 256 passed
pytest tests/ -k "not slow"                     -> 2185 passed, 0 failed
ruff check                                       -> clean

Live verification (issue repro)

Served mlx-community/Qwen3-VL-8B-Instruct-4bit with --mllm --enable-auto-tool-choice --tool-call-parser hermes, M4 Max:

  • Probe A (fresh: system + user + 1 tool): tool_call emitted, prompt_tokens = 166
  • Probe B (continuation: + assistant{content:null, tool_calls} + tool{result}): prompt_tokens = 213 > 166 (history now contributes tokens), no repeated tool_call, final answer uses the tool result: "The weather in Paris is currently 19°C with sunny conditions."

On main, probe B renders the same prompt as probe A and re-issues the identical tool call.

@waybarrios

Copy link
Copy Markdown
Owner Author

Full test logs from the branch:

Regression tests (verbose) — 8 passed
$ pytest tests/test_mllm_message_ordering.py -v
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_preserve_text_only_shape PASSED [ 12%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_preserve_image_text_order PASSED [ 25%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_preserve_audio_text_order PASSED [ 37%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_keep_assistant_tool_calls_without_text PASSED [ 50%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_keep_tool_call_id_on_tool_messages PASSED [ 62%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_keep_empty_tool_results PASSED [ 75%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_keep_tool_exchange PASSED [ 87%]
tests/test_mllm_message_ordering.py::test_mllm_chat_messages_still_drop_empty_messages_without_tool_calls PASSED [100%]
============================== 8 passed in 0.25s ===============================
MLLM + server suites — 250 passed
$ pytest tests/test_mllm_cache.py tests/test_mllm_continuous_batching.py \
    tests/test_mllm_message_ordering.py tests/test_mllm_mtp_routing.py \
    tests/test_mllm.py tests/test_server.py -q
====================== 250 passed, 13 deselected in 5.18s ======================
Full fast suite — 2185 passed, 0 failed
$ pytest tests/ -q -k "not slow"
=============== 2185 passed, 11 skipped, 23 deselected in 30.95s ===============

$ ruff check vllm_mlx/models/mllm.py tests/test_mllm_message_ordering.py
All checks passed!
Live repro verification (Qwen3-VL-8B-Instruct-4bit, M4 Max)
$ vllm-mlx serve mlx-community/Qwen3-VL-8B-Instruct-4bit --mllm \
    --enable-auto-tool-choice --tool-call-parser hermes --port 8077

# Probe A — fresh turn (system + user + 1 tool), temperature 0
prompt_tokens: 166
tool_calls: [{"id": "call_0a25c249", "type": "function",
              "function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]
content: ''

# Probe B — continuation (+ assistant{content:null, tool_calls} + tool{result})
prompt_tokens: 213        # > 166: history now contributes tokens
tool_calls: null          # no repeated call — loop broken
content: 'The weather in Paris is currently 19°C with sunny conditions.'

On main, probe B renders byte-identical to probe A (166 tokens) and re-issues the same tool call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] MLLM chat path drops assistant tool_calls + tool-role messages — multi-turn tool use loops forever

1 participant