feat(local): Qwen3.5 V2 hybrid runtime support#394
Draft
nuri-yoo wants to merge 1 commit into
Draft
Conversation
Adds support in the local runtime for V2-bytecode artifacts that the matching mlc-llm release emits — Qwen3.5 (Gated DeltaNet hybrid: PagedKVCache + RNNState) in particular, plus parity with V1 artifacts (Qwen3 KvCache, BAAI/bge-m3 embedding). Six load-bearing changes plus build-infra and dependency bumps: - chat_template.rs: inject `content` alias so jinja templates that read `message.content` (singular) see the body our struct serializes as `contents`. - inferencer.rs: prefill returns the final-chunk logits so the caller can skip a redundant decode pass on the prompt's last token; V2 hybrid params are loaded via the global tensor cache (vm.builtin.tensor_cache.*) so the packed prefill function gets the Tensor instances it expects; logit_positions buffer is allocated CPU-first then copied to device. - local_language_model.rs: sample the first token from prefill logits when present; decide reasoning/content mode by scanning the prompt's last token ids for the most recent <think>/</think> marker (handles open `<think>\n` and closed `<think>...</think>` trailers alike). - tokenizer.rs: add `token_to_id` helper used by mode init. - build.rs: optional MLC_LLM_LIB_DIR knob to link libmlc_llm_module so a cohabiting mlc_llm Python package shares a single GlobalFunctionTable in process; @loader_path rpath so the cdylib is portable next to its dylibs. - Cargo.toml: bump tvm-runtime / tvm-ffi to brekkylab/tvm-runtime-rs feat/qwen35-v2-runtime (cea927a; depends on brekkylab/tvm-runtime-rs#2 which transitively depends on brekkylab/relax#3). Adds 10 #[ignore]-gated integration tests covering Qwen3-{0.6B,8B} (V1 KvCache), Qwen3.5-{0.8B,2B,4B,9B} (V2 hybrid), and BAAI/bge-m3 (V1 embedding). All require locally-compiled artifacts and are skipped in default test runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support in the local runtime for V2-bytecode artifacts that the matching mlc-llm release emits — Qwen3.5 dense (Gated DeltaNet hybrid: PagedKVCache + RNNState) in particular, plus parity with V1 artifacts (Qwen3 KvCache, BAAI/bge-m3 embedding). Prior to this change the local runtime was wired for V1-only artifacts; loading a V2 hybrid
rt.dylibproduced semantically incorrect output even though it loaded successfully and ran without runtime errors.The fix is six small, independent code changes in
src/model/local/, plus an opt-in build-infra knob and a dependency bump that pulls in the matching changes ontvm-runtime-rs(brekkylab/tvm-runtime-rs#2) andrelax(brekkylab/relax#3).Verified models
End-to-end on macOS arm64 + Metal:
1+1=2✅The capital of France is Paris.✅Paris is the capital of France.✅The capital of France is Paris.✅Multi-turn (LCP rewind across PagedKVCache + RNNState) verified with
think_effort=disableover 3 cumulative turns; context propagation correct (Paris → Paris population → Lyon).Changes
src/model/local/chat_template.rs—contentalias for HF-style templatesWhy. HuggingFace-style chat templates address message bodies as
message.content(singular, e.g.{{ message.content }}), but ourMessagestruct serializes the field ascontents(plural). When the template asked formessage.contentjinja sawundefinedand emitted empty user / system bodies — the model received a near-empty prompt and returned content unrelated to the actual conversation.What. Re-serialize each
Messageinto aserde_json::Valueand injectcontentas an alias mirroringcontentsbefore passing the messages array into the jinja context. The on-the-wire schema is unchanged; only what jinja sees is widened.The previously-passed
messagesis replaced bymessages_for_templatein the jinjacontext!macro.src/model/local/inferencer.rs— prefill returns logits, V2 params via global cache, host-builtlogit_positionsThree independent fixes in the V2 hybrid forward path.
a)
prefillreturnsOption<Tensor>so the caller can skip a redundant decode passWhy. The previous
prefill(&mut self, tokens) -> Result<()>discarded the prefill function's return value; the caller obtained logits for the first generated token by callingdecode(input_tokens.last())separately. For V1 KvCache that worked because the additional decode step replayed an already-known token through a deterministic, non-recurrent attention path. For V2 hybrid, the same redundant decode pushed the prompt's last token into the RNNState a second time — RNNState being recurrent, the resulting hidden state diverged from where the prefill left it, and every sampled token from that point on was drawn from a position that no longer corresponded to the prompt tail.What. Change the signature to
prefill(...) -> Result<Option<Tensor>>. On the final prefill chunk, extractlogitsfrom the prefill function's heterogeneous return array (usingtvm_runtime::get_from_any_arraynewly added ontvm-runtime-rs) and return it. Earlier chunks returnNone.b) V2 hybrid: load params via the global runtime tensor cache
Why. V2 hybrid's compiled
batch_prefillpacked function correlates theArray<Tensor>it receives with entries the global runtime tensor cache has just loaded. Building params through the in-treeTensorCache::fromhelper producedObjectRefs that look identical at thedlpacklevel (same shape, dtype, data pointer) but cause subtle forward-output drift in V2 hybrid because the prefill function picks a different code path internally. V1 prefill/decode does not exhibit this because it does not tie params to the global cache instance.What. Dispatch on the cache descriptor file name:
tensor-cache.json(V2 hybrid artifacts): callvm.builtin.tensor_cache.load(model_dir, dev_type, dev_id), thenvm.builtin.param_array_from_cache_by_name(names) → Array<Tensor>, thenvm.builtin.tensor_cache.clear(). This matches whatmlc-llm'sFunctionTable::LoadParamsdoes for non-disco models.ndarray-cache.json(V1 artifacts: Qwen3, BAAI/bge-m3, …): keep the existing in-treeTensorCache::frompath. Forces no API-surface dependency on the new builtins for older artifacts.c)
make_logit_positionsallocates on CPU first, then copies to deviceWhy. Mirrors the
CopyToWorker0patternmlc-llmuses. The behavioural effect on Metal is small but the new shape removes one source of divergence between the two stacks while we are debugging V2 hybrid forward outputs — keeping the allocation pattern aligned makes future apples-to-apples comparison withmlc-llmstraightforward.What. Allocate a
[1]int32 tensor onkDLCPU, write the position viadata_as_slice_mut, thencopy_frominto a same-shape device-side buffer.src/model/local/local_language_model.rs— first token from prefill, mode init by token id, integration testsa) Sample the first generated token from the prefill logits
Why. Counterpart to the inferencer.rs change — without it the
Option<Tensor>return value would be unused.What. When
prefillreturnsSome(logits), sample the first token from those logits and stash the id inprefilled_first_token: Option<u32>; the decode loop's first iteration consumes it instead of running its owndecode(last_token) → sample(...). Subsequent iterations are unchanged.b) Decode-mode init by scanning the last
<think>/</think>token idWhy. The decode loop tracks a
modeof"reasoning"/"content"/"tool_call"and flips on emitted<think>/</think>tokens. The starting mode used to be hard-coded to"content", which was correct for templates that close<think>\n\n</think>\n\nbefore generation begins (older Qwen). Newer templates (Qwen3.5) leave<think>\nopen and expect the model to stream reasoning tokens before emitting</think>itself; with the hard-coded"content"start, those reasoning tokens were classified as final content and surfaced as the assistant's answer.What. Resolve the marker token ids via
tokenizer.token_to_id("<think>")/("</think>")(Rust'stokenizerscrate suppresses special tokens throughdecodeeven withskip_special_tokens=false, so the rendered string is unreliable for this check). Scaninput_tokensfor the last position of each marker and pick the starting mode:<think></think></think>"reasoning""reasoning""content"c) Integration tests (10, all
#[ignore]-gated)Adds a
#[cfg(test)] mod testsblock exercising the full pipeline. Each test loads a realrt.dylibfrom~/.cache/ailoy/<MODEL>/and runsLocalLangModel::infer_deltawith a representative prompt, asserting onFinishReasonand that the assistant message ends up non-empty. Coverage:local_infer_qwen3_0_6b/local_infer_qwen3_8b— V1 KvCache parity.local_infer_qwen35_0_8b_hybrid/_2b_hybrid/_4b_hybrid/_9b_hybrid— V2 hybrid Gated DeltaNet across the dense Qwen3.5 size matrix.local_infer_qwen3_8b_multi_turn— multi-turn LCP rewind acrosspopN(...)for both PagedKVCache and RNNState.local_infer_throughput_measurement— emits tok/s + peak RSS for both V1 and V2 hybrid; runs only with--ignored, asserts no specific number (host-dependent).src/model/local/tokenizer.rs—token_to_idhelperWhy. Required by the mode-init scan in
local_language_model.rs. The Rusttokenizerscrate exposesTokenizer::token_to_idbut our wrapper did not.What. Thin pass-through:
build.rs— optionalMLC_LLM_LIB_DIR+@loader_pathrpatha)
MLC_LLM_LIB_DIR(opt-in)Why. When ailoy is loaded into a Python process that also imports
mlc_llm(e.g. for ad-hoc model compilation in the same session), both packages must share a single in-processGlobalFunctionTableand a single tvm-ffi static singleton. Otherwise the secondinit-block to load re-registersffi.Module.load_from_bytes.const_loaderand aborts at process load. Linkinglibmlc_llm_module.dylib(the same dylibmlc_llm's Python package loads viatvm.ffi.load_module) makes the runtime resolve to the single instance.What. When the env var is set at build time, emit a
cargo:rustc-link-search/-link-libpair targetinglibmlc_llm_module. The variant is deliberate:libmlc_llm_module.dylib(not plainlibmlc_llm.dylib), becausemlc_llm's Python package loads the_modulevariant; linking plainlibmlc_llmwould put two copies of everymlc.json_ffi.*function into the global registry and trip the duplicate-registration assertion at process load. When the env var is unset, the link line is unchanged from before.b)
@loader_pathrpathWhy. Makes the resulting
_core.abi3.soportable: dyld looks for@rpath/libmlc_llm_module.dylibetc. relative to the cdylib's own directory (bindings/python/ailoy/) rather than against an absolute venv path baked in at build time.What. A single
cargo:rustc-link-arg=-Wl,-rpath,@loader_pathline.Cargo.toml/Cargo.lock— bumptvm-runtime/tvm-ffiWhy. Pulls in the new
tvm_runtime::get_from_any_arrayAny-array helper plus the V2-bytecode reader and flat C-ABI shim that the local runtime needs.What. Both
tvm-runtime(metal / vulkan target dependencies) andtvm-ffimove to:cea927ais the head of brekkylab/tvm-runtime-rs#2 (feat: Qwen3.5 V2 hybrid runtime support), which transitively bumps3rdparty/tvmto brekkylab/relax#3.Cargo.lockregenerates accordingly; no changes to other dependencies.Cross-PR dependency
This PR is the third in a chain that needs to land in submodule order:
feat: Qwen3.5 V2 VM bytecode runtime supportRuntime accepts V2 VM bytecode (
tirx/tiralias) and exposes a flat C-ABI shim for Rust consumers.feat: Qwen3.5 V2 hybrid runtime supportRust shell adds the
Any-array helper and points3rdparty/tvmat Feature: Integrate MCP tools #3 above.Each PR can be reviewed independently; (3) cannot be merged until (1) and (2) are merged because the
Cargo.tomlrev pins the squash-commit head of (2), which in turn references the squash-commit head of (1) via its submodule pointer.Tests
cargo buildon macOS arm64 — succeeds with the new dependency rev.cargo test(default, no--ignored) — unchanged frommain; nothing in this PR touches non-#[ignore]tests.cargo test -- --ignored local_infer_qwen35_2b_hybrid— passes locally with a Qwen3.5-2Brt.dylibin~/.cache/ailoy/.Out of scope
bindings/python/ailoy/layout is unchanged. A separate follow-up will copylibmlc_llm_module / libtvm{,_runtime,_ffi[_testing]}next to_core.abi3.soautomatically at wheel-build time so@loader_pathlookup is fully self-contained.