feat(local): Qwen3.5 V2 hybrid runtime support by nuri-yoo · Pull Request #394 · brekkylab/ailoy

nuri-yoo · 2026-05-01T17:38:42Z

Summary

Adds support in the local runtime for V2-bytecode artifacts that the matching mlc-llm release emits — Qwen3.5 dense (Gated DeltaNet hybrid: PagedKVCache + RNNState) in particular, plus parity with V1 artifacts (Qwen3 KvCache, BAAI/bge-m3 embedding). Prior to this change the local runtime was wired for V1-only artifacts; loading a V2 hybrid rt.dylib produced semantically incorrect output even though it loaded successfully and ran without runtime errors.

The fix is six small, independent code changes in src/model/local/, plus an opt-in build-infra knob and a dependency bump that pulls in the matching changes on tvm-runtime-rs (brekkylab/tvm-runtime-rs#2) and relax (brekkylab/relax#3).

Verified models

End-to-end on macOS arm64 + Metal:

Model	Kind	Result
Qwen3-0.6B	V1 KvCache	`1+1=2` ✅
Qwen3-8B	V1 KvCache	loads & runs ✅
Qwen3.5-0.8B	V2 hybrid (Gated DeltaNet)	`The capital of France is Paris.` ✅
Qwen3.5-2B	V2 hybrid	`Paris is the capital of France.` ✅
Qwen3.5-4B	V2 hybrid	`The capital of France is Paris.` ✅
Qwen3.5-9B	V2 hybrid	loads (1-token, 24GB ceiling) ✅
BAAI/bge-m3	V1 embedding	1024-d L2-normalized vector ✅

Multi-turn (LCP rewind across PagedKVCache + RNNState) verified with think_effort=disable over 3 cumulative turns; context propagation correct (Paris → Paris population → Lyon).

Changes

`src/model/local/chat_template.rs` — `content` alias for HF-style templates

Why. HuggingFace-style chat templates address message bodies as message.content (singular, e.g. {{ message.content }}), but our Message struct serializes the field as contents (plural). When the template asked for message.content jinja saw undefined and emitted empty user / system bodies — the model received a near-empty prompt and returned content unrelated to the actual conversation.

What. Re-serialize each Message into a serde_json::Value and inject content as an alias mirroring contents before passing the messages array into the jinja context. The on-the-wire schema is unchanged; only what jinja sees is widened.

let messages_for_template: Vec<serde_json::Value> = messages
    .iter()
    .map(|m| {
        let mut v = serde_json::to_value(m).unwrap_or(serde_json::Value::Null);
        if let Some(obj) = v.as_object_mut() {
            if let Some(c) = obj.get("contents").cloned() {
                obj.insert("content".to_string(), c);
            }
        }
        v
    })
    .collect();

The previously-passed messages is replaced by messages_for_template in the jinja context! macro.

`src/model/local/inferencer.rs` — prefill returns logits, V2 params via global cache, host-built `logit_positions`

Three independent fixes in the V2 hybrid forward path.

a) `prefill` returns `Option<Tensor>` so the caller can skip a redundant decode pass

Why. The previous prefill(&mut self, tokens) -> Result<()> discarded the prefill function's return value; the caller obtained logits for the first generated token by calling decode(input_tokens.last()) separately. For V1 KvCache that worked because the additional decode step replayed an already-known token through a deterministic, non-recurrent attention path. For V2 hybrid, the same redundant decode pushed the prompt's last token into the RNNState a second time — RNNState being recurrent, the resulting hidden state diverged from where the prefill left it, and every sampled token from that point on was drawn from a position that no longer corresponded to the prompt tail.

What. Change the signature to prefill(...) -> Result<Option<Tensor>>. On the final prefill chunk, extract logits from the prefill function's heterogeneous return array (using tvm_runtime::get_from_any_array newly added on tvm-runtime-rs) and return it. Earlier chunks return None.

if j == new_tokens.len() {
    let logits: Tensor = unsafe {
        tvm_runtime::get_from_any_array(output, 0)
            .map_err(|e| anyhow!("Failed to get prefill logits: {:?}", e))?
    };
    return Ok(Some(logits));
}

b) V2 hybrid: load params via the global runtime tensor cache

Why. V2 hybrid's compiled batch_prefill packed function correlates the Array<Tensor> it receives with entries the global runtime tensor cache has just loaded. Building params through the in-tree TensorCache::from helper produced ObjectRefs that look identical at the dlpack level (same shape, dtype, data pointer) but cause subtle forward-output drift in V2 hybrid because the prefill function picks a different code path internally. V1 prefill/decode does not exhibit this because it does not tie params to the global cache instance.

What. Dispatch on the cache descriptor file name:

tensor-cache.json (V2 hybrid artifacts): call vm.builtin.tensor_cache.load(model_dir, dev_type, dev_id), then vm.builtin.param_array_from_cache_by_name(names) → Array<Tensor>, then vm.builtin.tensor_cache.clear(). This matches what mlc-llm's FunctionTable::LoadParams does for non-disco models.
ndarray-cache.json (V1 artifacts: Qwen3, BAAI/bge-m3, …): keep the existing in-tree TensorCache::from path. Forces no API-surface dependency on the new builtins for older artifacts.

c) `make_logit_positions` allocates on CPU first, then copies to device

Why. Mirrors the CopyToWorker0 pattern mlc-llm uses. The behavioural effect on Metal is small but the new shape removes one source of divergence between the two stacks while we are debugging V2 hybrid forward outputs — keeping the allocation pattern aligned makes future apples-to-apples comparison with mlc-llm straightforward.

What. Allocate a [1] int32 tensor on kDLCPU, write the position via data_as_slice_mut, then copy_from into a same-shape device-side buffer.

`src/model/local/local_language_model.rs` — first token from prefill, mode init by token id, integration tests

a) Sample the first generated token from the prefill logits

Why. Counterpart to the inferencer.rs change — without it the Option<Tensor> return value would be unused.

What. When prefill returns Some(logits), sample the first token from those logits and stash the id in prefilled_first_token: Option<u32>; the decode loop's first iteration consumes it instead of running its own decode(last_token) → sample(...). Subsequent iterations are unchanged.

let mut prefilled_first_token: Option<u32> = {
    let prefill_logits = self.inferencer.prefill(&input_tokens).unwrap();
    let temperature = config.temperature.unwrap_or(0.6);
    let top_p = config.top_p.unwrap_or(0.9);
    prefill_logits.map(|l| self.inferencer.sample(l, temperature, top_p).unwrap())
};

b) Decode-mode init by scanning the last `<think>` / `</think>` token id

Why. The decode loop tracks a mode of "reasoning" / "content" / "tool_call" and flips on emitted <think> / </think> tokens. The starting mode used to be hard-coded to "content", which was correct for templates that close <think>\n\n</think>\n\n before generation begins (older Qwen). Newer templates (Qwen3.5) leave <think>\n open and expect the model to stream reasoning tokens before emitting </think> itself; with the hard-coded "content" start, those reasoning tokens were classified as final content and surfaced as the assistant's answer.

What. Resolve the marker token ids via tokenizer.token_to_id("<think>") / ("</think>") (Rust's tokenizers crate suppresses special tokens through decode even with skip_special_tokens=false, so the rendered string is unreliable for this check). Scan input_tokens for the last position of each marker and pick the starting mode:

Last `<think>`	Last `</think>`	Starting mode
present, after `</think>`	present	`"reasoning"`
present	absent	`"reasoning"`
otherwise (incl. closed pair)	—	`"content"`

c) Integration tests (10, all `#[ignore]`-gated)

Adds a #[cfg(test)] mod tests block exercising the full pipeline. Each test loads a real rt.dylib from ~/.cache/ailoy/<MODEL>/ and runs LocalLangModel::infer_delta with a representative prompt, asserting on FinishReason and that the assistant message ends up non-empty. Coverage:

local_infer_qwen3_0_6b / local_infer_qwen3_8b — V1 KvCache parity.
local_infer_qwen35_0_8b_hybrid / _2b_hybrid / _4b_hybrid / _9b_hybrid — V2 hybrid Gated DeltaNet across the dense Qwen3.5 size matrix.
local_infer_qwen3_8b_multi_turn — multi-turn LCP rewind across popN(...) for both PagedKVCache and RNNState.
local_infer_throughput_measurement — emits tok/s + peak RSS for both V1 and V2 hybrid; runs only with --ignored, asserts no specific number (host-dependent).

⚠️ All tests are gated by #[ignore = "requires locally-compiled <MODEL> artifacts"] and require the corresponding rt.dylib under ~/.cache/ailoy/<MODEL>--<TARGET>--<DEVICE>/. They never run in the default cargo test invocation, so CI is not affected. Run individually with cargo test --ignored -- local_infer_qwen35_2b_hybrid.

`src/model/local/tokenizer.rs` — `token_to_id` helper

Why. Required by the mode-init scan in local_language_model.rs. The Rust tokenizers crate exposes Tokenizer::token_to_id but our wrapper did not.

What. Thin pass-through:

/// Resolve a (special or regular) token literal to its id.
pub fn token_to_id(&self, token: &str) -> Option<u32> {
    self.inner.token_to_id(token)
}

`build.rs` — optional `MLC_LLM_LIB_DIR` + `@loader_path` rpath

a) `MLC_LLM_LIB_DIR` (opt-in)

Why. When ailoy is loaded into a Python process that also imports mlc_llm (e.g. for ad-hoc model compilation in the same session), both packages must share a single in-process GlobalFunctionTable and a single tvm-ffi static singleton. Otherwise the second init-block to load re-registers ffi.Module.load_from_bytes.const_loader and aborts at process load. Linking libmlc_llm_module.dylib (the same dylib mlc_llm's Python package loads via tvm.ffi.load_module) makes the runtime resolve to the single instance.

What. When the env var is set at build time, emit a cargo:rustc-link-search/-link-lib pair targeting libmlc_llm_module. The variant is deliberate: libmlc_llm_module.dylib (not plain libmlc_llm.dylib), because mlc_llm's Python package loads the _module variant; linking plain libmlc_llm would put two copies of every mlc.json_ffi.* function into the global registry and trip the duplicate-registration assertion at process load. When the env var is unset, the link line is unchanged from before.

b) `@loader_path` rpath

Why. Makes the resulting _core.abi3.so portable: dyld looks for @rpath/libmlc_llm_module.dylib etc. relative to the cdylib's own directory (bindings/python/ailoy/) rather than against an absolute venv path baked in at build time.

What. A single cargo:rustc-link-arg=-Wl,-rpath,@loader_path line.

📌 Compatibility note for reviewers. This adds a single LC_RPATH entry of @loader_path to the cdylib. The change is a no-op for users who don't co-locate dylibs next to _core.abi3.so (dyld silently ignores rpath entries it cannot resolve), and macOS-only in effect even though the linker flag is emitted on Linux too — -Wl,-rpath is honored on both, but only macOS uses @loader_path semantics; Linux's $ORIGIN would be the equivalent and is not affected here. No change to the cdylib's exported symbols, no change to the existing LC_LOAD_DYLIB table.

`Cargo.toml` / `Cargo.lock` — bump `tvm-runtime` / `tvm-ffi`

Why. Pulls in the new tvm_runtime::get_from_any_array Any-array helper plus the V2-bytecode reader and flat C-ABI shim that the local runtime needs.

What. Both tvm-runtime (metal / vulkan target dependencies) and tvm-ffi move to:

{ git = "https://github.com/brekkylab/tvm-runtime-rs", rev = "cea927a" }

cea927a is the head of brekkylab/tvm-runtime-rs#2 (feat: Qwen3.5 V2 hybrid runtime support), which transitively bumps 3rdparty/tvm to brekkylab/relax#3. Cargo.lock regenerates accordingly; no changes to other dependencies.

Cross-PR dependency

This PR is the third in a chain that needs to land in submodule order:

feat: Qwen3.5 V2 VM bytecode runtime support relax#3 — feat: Qwen3.5 V2 VM bytecode runtime support
Runtime accepts V2 VM bytecode (tirx / tir alias) and exposes a flat C-ABI shim for Rust consumers.
feat: Qwen3.5 V2 hybrid runtime support tvm-runtime-rs#2 — feat: Qwen3.5 V2 hybrid runtime support
Rust shell adds the Any-array helper and points 3rdparty/tvm at Feature: Integrate MCP tools #3 above.
this PR — bumps the Rust dependency to (2) and ships the inferencer / chat-template / mode-init / build-infra changes.

Each PR can be reviewed independently; (3) cannot be merged until (1) and (2) are merged because the Cargo.toml rev pins the squash-commit head of (2), which in turn references the squash-commit head of (1) via its submodule pointer.

Tests

cargo build on macOS arm64 — succeeds with the new dependency rev.
cargo test (default, no --ignored) — unchanged from main; nothing in this PR touches non-#[ignore] tests.
cargo test -- --ignored local_infer_qwen35_2b_hybrid — passes locally with a Qwen3.5-2B rt.dylib in ~/.cache/ailoy/.
Manual end-to-end across the verified-models matrix above.

Out of scope

Wheel packaging changes — bindings/python/ailoy/ layout is unchanged. A separate follow-up will copy libmlc_llm_module / libtvm{,_runtime,_ffi[_testing]} next to _core.abi3.so automatically at wheel-build time so @loader_path lookup is fully self-contained.

Adds support in the local runtime for V2-bytecode artifacts that the matching mlc-llm release emits — Qwen3.5 (Gated DeltaNet hybrid: PagedKVCache + RNNState) in particular, plus parity with V1 artifacts (Qwen3 KvCache, BAAI/bge-m3 embedding). Six load-bearing changes plus build-infra and dependency bumps: - chat_template.rs: inject `content` alias so jinja templates that read `message.content` (singular) see the body our struct serializes as `contents`. - inferencer.rs: prefill returns the final-chunk logits so the caller can skip a redundant decode pass on the prompt's last token; V2 hybrid params are loaded via the global tensor cache (vm.builtin.tensor_cache.*) so the packed prefill function gets the Tensor instances it expects; logit_positions buffer is allocated CPU-first then copied to device. - local_language_model.rs: sample the first token from prefill logits when present; decide reasoning/content mode by scanning the prompt's last token ids for the most recent <think>/</think> marker (handles open `<think>\n` and closed `<think>...</think>` trailers alike). - tokenizer.rs: add `token_to_id` helper used by mode init. - build.rs: optional MLC_LLM_LIB_DIR knob to link libmlc_llm_module so a cohabiting mlc_llm Python package shares a single GlobalFunctionTable in process; @loader_path rpath so the cdylib is portable next to its dylibs. - Cargo.toml: bump tvm-runtime / tvm-ffi to brekkylab/tvm-runtime-rs feat/qwen35-v2-runtime (cea927a; depends on brekkylab/tvm-runtime-rs#2 which transitively depends on brekkylab/relax#3). Adds 10 #[ignore]-gated integration tests covering Qwen3-{0.6B,8B} (V1 KvCache), Qwen3.5-{0.8B,2B,4B,9B} (V2 hybrid), and BAAI/bge-m3 (V1 embedding). All require locally-compiled artifacts and are skipped in default test runs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(local): Qwen3.5 V2 hybrid runtime support#394

feat(local): Qwen3.5 V2 hybrid runtime support#394
nuri-yoo wants to merge 1 commit into
mainfrom
feat/qwen35-v2-runtime

nuri-yoo commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nuri-yoo commented May 1, 2026

Summary

Verified models

Changes

src/model/local/chat_template.rs — content alias for HF-style templates

src/model/local/inferencer.rs — prefill returns logits, V2 params via global cache, host-built logit_positions

a) prefill returns Option<Tensor> so the caller can skip a redundant decode pass

b) V2 hybrid: load params via the global runtime tensor cache

c) make_logit_positions allocates on CPU first, then copies to device

src/model/local/local_language_model.rs — first token from prefill, mode init by token id, integration tests

a) Sample the first generated token from the prefill logits

b) Decode-mode init by scanning the last <think> / </think> token id

c) Integration tests (10, all #[ignore]-gated)

src/model/local/tokenizer.rs — token_to_id helper

build.rs — optional MLC_LLM_LIB_DIR + @loader_path rpath

a) MLC_LLM_LIB_DIR (opt-in)

b) @loader_path rpath

Cargo.toml / Cargo.lock — bump tvm-runtime / tvm-ffi

Cross-PR dependency

Tests

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/model/local/chat_template.rs` — `content` alias for HF-style templates

`src/model/local/inferencer.rs` — prefill returns logits, V2 params via global cache, host-built `logit_positions`

a) `prefill` returns `Option<Tensor>` so the caller can skip a redundant decode pass

c) `make_logit_positions` allocates on CPU first, then copies to device

`src/model/local/local_language_model.rs` — first token from prefill, mode init by token id, integration tests

b) Decode-mode init by scanning the last `<think>` / `</think>` token id

c) Integration tests (10, all `#[ignore]`-gated)

`src/model/local/tokenizer.rs` — `token_to_id` helper

`build.rs` — optional `MLC_LLM_LIB_DIR` + `@loader_path` rpath

a) `MLC_LLM_LIB_DIR` (opt-in)

b) `@loader_path` rpath

`Cargo.toml` / `Cargo.lock` — bump `tvm-runtime` / `tvm-ffi`