Skip to content

feat(local): Qwen3.5 V2 hybrid runtime support#394

Draft
nuri-yoo wants to merge 1 commit into
mainfrom
feat/qwen35-v2-runtime
Draft

feat(local): Qwen3.5 V2 hybrid runtime support#394
nuri-yoo wants to merge 1 commit into
mainfrom
feat/qwen35-v2-runtime

Conversation

@nuri-yoo
Copy link
Copy Markdown

@nuri-yoo nuri-yoo commented May 1, 2026

Summary

Adds support in the local runtime for V2-bytecode artifacts that the matching mlc-llm release emits — Qwen3.5 dense (Gated DeltaNet hybrid: PagedKVCache + RNNState) in particular, plus parity with V1 artifacts (Qwen3 KvCache, BAAI/bge-m3 embedding). Prior to this change the local runtime was wired for V1-only artifacts; loading a V2 hybrid rt.dylib produced semantically incorrect output even though it loaded successfully and ran without runtime errors.

The fix is six small, independent code changes in src/model/local/, plus an opt-in build-infra knob and a dependency bump that pulls in the matching changes on tvm-runtime-rs (brekkylab/tvm-runtime-rs#2) and relax (brekkylab/relax#3).

Verified models

End-to-end on macOS arm64 + Metal:

Model Kind Result
Qwen3-0.6B V1 KvCache 1+1=2
Qwen3-8B V1 KvCache loads & runs ✅
Qwen3.5-0.8B V2 hybrid (Gated DeltaNet) The capital of France is Paris.
Qwen3.5-2B V2 hybrid Paris is the capital of France.
Qwen3.5-4B V2 hybrid The capital of France is Paris.
Qwen3.5-9B V2 hybrid loads (1-token, 24GB ceiling) ✅
BAAI/bge-m3 V1 embedding 1024-d L2-normalized vector ✅

Multi-turn (LCP rewind across PagedKVCache + RNNState) verified with think_effort=disable over 3 cumulative turns; context propagation correct (Paris → Paris population → Lyon).


Changes

src/model/local/chat_template.rscontent alias for HF-style templates

Why. HuggingFace-style chat templates address message bodies as message.content (singular, e.g. {{ message.content }}), but our Message struct serializes the field as contents (plural). When the template asked for message.content jinja saw undefined and emitted empty user / system bodies — the model received a near-empty prompt and returned content unrelated to the actual conversation.

What. Re-serialize each Message into a serde_json::Value and inject content as an alias mirroring contents before passing the messages array into the jinja context. The on-the-wire schema is unchanged; only what jinja sees is widened.

let messages_for_template: Vec<serde_json::Value> = messages
    .iter()
    .map(|m| {
        let mut v = serde_json::to_value(m).unwrap_or(serde_json::Value::Null);
        if let Some(obj) = v.as_object_mut() {
            if let Some(c) = obj.get("contents").cloned() {
                obj.insert("content".to_string(), c);
            }
        }
        v
    })
    .collect();

The previously-passed messages is replaced by messages_for_template in the jinja context! macro.

src/model/local/inferencer.rs — prefill returns logits, V2 params via global cache, host-built logit_positions

Three independent fixes in the V2 hybrid forward path.

a) prefill returns Option<Tensor> so the caller can skip a redundant decode pass

Why. The previous prefill(&mut self, tokens) -> Result<()> discarded the prefill function's return value; the caller obtained logits for the first generated token by calling decode(input_tokens.last()) separately. For V1 KvCache that worked because the additional decode step replayed an already-known token through a deterministic, non-recurrent attention path. For V2 hybrid, the same redundant decode pushed the prompt's last token into the RNNState a second time — RNNState being recurrent, the resulting hidden state diverged from where the prefill left it, and every sampled token from that point on was drawn from a position that no longer corresponded to the prompt tail.

What. Change the signature to prefill(...) -> Result<Option<Tensor>>. On the final prefill chunk, extract logits from the prefill function's heterogeneous return array (using tvm_runtime::get_from_any_array newly added on tvm-runtime-rs) and return it. Earlier chunks return None.

if j == new_tokens.len() {
    let logits: Tensor = unsafe {
        tvm_runtime::get_from_any_array(output, 0)
            .map_err(|e| anyhow!("Failed to get prefill logits: {:?}", e))?
    };
    return Ok(Some(logits));
}

b) V2 hybrid: load params via the global runtime tensor cache

Why. V2 hybrid's compiled batch_prefill packed function correlates the Array<Tensor> it receives with entries the global runtime tensor cache has just loaded. Building params through the in-tree TensorCache::from helper produced ObjectRefs that look identical at the dlpack level (same shape, dtype, data pointer) but cause subtle forward-output drift in V2 hybrid because the prefill function picks a different code path internally. V1 prefill/decode does not exhibit this because it does not tie params to the global cache instance.

What. Dispatch on the cache descriptor file name:

  • tensor-cache.json (V2 hybrid artifacts): call vm.builtin.tensor_cache.load(model_dir, dev_type, dev_id), then vm.builtin.param_array_from_cache_by_name(names) → Array<Tensor>, then vm.builtin.tensor_cache.clear(). This matches what mlc-llm's FunctionTable::LoadParams does for non-disco models.
  • ndarray-cache.json (V1 artifacts: Qwen3, BAAI/bge-m3, …): keep the existing in-tree TensorCache::from path. Forces no API-surface dependency on the new builtins for older artifacts.

c) make_logit_positions allocates on CPU first, then copies to device

Why. Mirrors the CopyToWorker0 pattern mlc-llm uses. The behavioural effect on Metal is small but the new shape removes one source of divergence between the two stacks while we are debugging V2 hybrid forward outputs — keeping the allocation pattern aligned makes future apples-to-apples comparison with mlc-llm straightforward.

What. Allocate a [1] int32 tensor on kDLCPU, write the position via data_as_slice_mut, then copy_from into a same-shape device-side buffer.

src/model/local/local_language_model.rs — first token from prefill, mode init by token id, integration tests

a) Sample the first generated token from the prefill logits

Why. Counterpart to the inferencer.rs change — without it the Option<Tensor> return value would be unused.

What. When prefill returns Some(logits), sample the first token from those logits and stash the id in prefilled_first_token: Option<u32>; the decode loop's first iteration consumes it instead of running its own decode(last_token) → sample(...). Subsequent iterations are unchanged.

let mut prefilled_first_token: Option<u32> = {
    let prefill_logits = self.inferencer.prefill(&input_tokens).unwrap();
    let temperature = config.temperature.unwrap_or(0.6);
    let top_p = config.top_p.unwrap_or(0.9);
    prefill_logits.map(|l| self.inferencer.sample(l, temperature, top_p).unwrap())
};

b) Decode-mode init by scanning the last <think> / </think> token id

Why. The decode loop tracks a mode of "reasoning" / "content" / "tool_call" and flips on emitted <think> / </think> tokens. The starting mode used to be hard-coded to "content", which was correct for templates that close <think>\n\n</think>\n\n before generation begins (older Qwen). Newer templates (Qwen3.5) leave <think>\n open and expect the model to stream reasoning tokens before emitting </think> itself; with the hard-coded "content" start, those reasoning tokens were classified as final content and surfaced as the assistant's answer.

What. Resolve the marker token ids via tokenizer.token_to_id("<think>") / ("</think>") (Rust's tokenizers crate suppresses special tokens through decode even with skip_special_tokens=false, so the rendered string is unreliable for this check). Scan input_tokens for the last position of each marker and pick the starting mode:

Last <think> Last </think> Starting mode
present, after </think> present "reasoning"
present absent "reasoning"
otherwise (incl. closed pair) "content"

c) Integration tests (10, all #[ignore]-gated)

Adds a #[cfg(test)] mod tests block exercising the full pipeline. Each test loads a real rt.dylib from ~/.cache/ailoy/<MODEL>/ and runs LocalLangModel::infer_delta with a representative prompt, asserting on FinishReason and that the assistant message ends up non-empty. Coverage:

  • local_infer_qwen3_0_6b / local_infer_qwen3_8b — V1 KvCache parity.
  • local_infer_qwen35_0_8b_hybrid / _2b_hybrid / _4b_hybrid / _9b_hybrid — V2 hybrid Gated DeltaNet across the dense Qwen3.5 size matrix.
  • local_infer_qwen3_8b_multi_turn — multi-turn LCP rewind across popN(...) for both PagedKVCache and RNNState.
  • local_infer_throughput_measurement — emits tok/s + peak RSS for both V1 and V2 hybrid; runs only with --ignored, asserts no specific number (host-dependent).

⚠️ All tests are gated by #[ignore = "requires locally-compiled <MODEL> artifacts"] and require the corresponding rt.dylib under ~/.cache/ailoy/<MODEL>--<TARGET>--<DEVICE>/. They never run in the default cargo test invocation, so CI is not affected. Run individually with cargo test --ignored -- local_infer_qwen35_2b_hybrid.

src/model/local/tokenizer.rstoken_to_id helper

Why. Required by the mode-init scan in local_language_model.rs. The Rust tokenizers crate exposes Tokenizer::token_to_id but our wrapper did not.

What. Thin pass-through:

/// Resolve a (special or regular) token literal to its id.
pub fn token_to_id(&self, token: &str) -> Option<u32> {
    self.inner.token_to_id(token)
}

build.rs — optional MLC_LLM_LIB_DIR + @loader_path rpath

a) MLC_LLM_LIB_DIR (opt-in)

Why. When ailoy is loaded into a Python process that also imports mlc_llm (e.g. for ad-hoc model compilation in the same session), both packages must share a single in-process GlobalFunctionTable and a single tvm-ffi static singleton. Otherwise the second init-block to load re-registers ffi.Module.load_from_bytes.const_loader and aborts at process load. Linking libmlc_llm_module.dylib (the same dylib mlc_llm's Python package loads via tvm.ffi.load_module) makes the runtime resolve to the single instance.

What. When the env var is set at build time, emit a cargo:rustc-link-search/-link-lib pair targeting libmlc_llm_module. The variant is deliberate: libmlc_llm_module.dylib (not plain libmlc_llm.dylib), because mlc_llm's Python package loads the _module variant; linking plain libmlc_llm would put two copies of every mlc.json_ffi.* function into the global registry and trip the duplicate-registration assertion at process load. When the env var is unset, the link line is unchanged from before.

b) @loader_path rpath

Why. Makes the resulting _core.abi3.so portable: dyld looks for @rpath/libmlc_llm_module.dylib etc. relative to the cdylib's own directory (bindings/python/ailoy/) rather than against an absolute venv path baked in at build time.

What. A single cargo:rustc-link-arg=-Wl,-rpath,@loader_path line.

📌 Compatibility note for reviewers. This adds a single LC_RPATH entry of @loader_path to the cdylib. The change is a no-op for users who don't co-locate dylibs next to _core.abi3.so (dyld silently ignores rpath entries it cannot resolve), and macOS-only in effect even though the linker flag is emitted on Linux too — -Wl,-rpath is honored on both, but only macOS uses @loader_path semantics; Linux's $ORIGIN would be the equivalent and is not affected here. No change to the cdylib's exported symbols, no change to the existing LC_LOAD_DYLIB table.

Cargo.toml / Cargo.lock — bump tvm-runtime / tvm-ffi

Why. Pulls in the new tvm_runtime::get_from_any_array Any-array helper plus the V2-bytecode reader and flat C-ABI shim that the local runtime needs.

What. Both tvm-runtime (metal / vulkan target dependencies) and tvm-ffi move to:

{ git = "https://github.com/brekkylab/tvm-runtime-rs", rev = "cea927a" }

cea927a is the head of brekkylab/tvm-runtime-rs#2 (feat: Qwen3.5 V2 hybrid runtime support), which transitively bumps 3rdparty/tvm to brekkylab/relax#3. Cargo.lock regenerates accordingly; no changes to other dependencies.


Cross-PR dependency

This PR is the third in a chain that needs to land in submodule order:

  1. feat: Qwen3.5 V2 VM bytecode runtime support relax#3feat: Qwen3.5 V2 VM bytecode runtime support
    Runtime accepts V2 VM bytecode (tirx / tir alias) and exposes a flat C-ABI shim for Rust consumers.
  2. feat: Qwen3.5 V2 hybrid runtime support tvm-runtime-rs#2feat: Qwen3.5 V2 hybrid runtime support
    Rust shell adds the Any-array helper and points 3rdparty/tvm at Feature: Integrate MCP tools #3 above.
  3. this PR — bumps the Rust dependency to (2) and ships the inferencer / chat-template / mode-init / build-infra changes.

Each PR can be reviewed independently; (3) cannot be merged until (1) and (2) are merged because the Cargo.toml rev pins the squash-commit head of (2), which in turn references the squash-commit head of (1) via its submodule pointer.

Tests

  • cargo build on macOS arm64 — succeeds with the new dependency rev.
  • cargo test (default, no --ignored) — unchanged from main; nothing in this PR touches non-#[ignore] tests.
  • cargo test -- --ignored local_infer_qwen35_2b_hybrid — passes locally with a Qwen3.5-2B rt.dylib in ~/.cache/ailoy/.
  • Manual end-to-end across the verified-models matrix above.

Out of scope

  • Wheel packaging changes — bindings/python/ailoy/ layout is unchanged. A separate follow-up will copy libmlc_llm_module / libtvm{,_runtime,_ffi[_testing]} next to _core.abi3.so automatically at wheel-build time so @loader_path lookup is fully self-contained.

Adds support in the local runtime for V2-bytecode artifacts that the
matching mlc-llm release emits — Qwen3.5 (Gated DeltaNet hybrid:
PagedKVCache + RNNState) in particular, plus parity with V1
artifacts (Qwen3 KvCache, BAAI/bge-m3 embedding).

Six load-bearing changes plus build-infra and dependency bumps:

- chat_template.rs: inject `content` alias so jinja templates that
  read `message.content` (singular) see the body our struct
  serializes as `contents`.
- inferencer.rs: prefill returns the final-chunk logits so the
  caller can skip a redundant decode pass on the prompt's last
  token; V2 hybrid params are loaded via the global tensor cache
  (vm.builtin.tensor_cache.*) so the packed prefill function gets
  the Tensor instances it expects; logit_positions buffer is
  allocated CPU-first then copied to device.
- local_language_model.rs: sample the first token from prefill
  logits when present; decide reasoning/content mode by scanning
  the prompt's last token ids for the most recent <think>/</think>
  marker (handles open `<think>\n` and closed `<think>...</think>`
  trailers alike).
- tokenizer.rs: add `token_to_id` helper used by mode init.
- build.rs: optional MLC_LLM_LIB_DIR knob to link
  libmlc_llm_module so a cohabiting mlc_llm Python package shares a
  single GlobalFunctionTable in process; @loader_path rpath so the
  cdylib is portable next to its dylibs.
- Cargo.toml: bump tvm-runtime / tvm-ffi to brekkylab/tvm-runtime-rs
  feat/qwen35-v2-runtime (cea927a; depends on
  brekkylab/tvm-runtime-rs#2 which transitively depends on
  brekkylab/relax#3).

Adds 10 #[ignore]-gated integration tests covering Qwen3-{0.6B,8B}
(V1 KvCache), Qwen3.5-{0.8B,2B,4B,9B} (V2 hybrid), and BAAI/bge-m3
(V1 embedding). All require locally-compiled artifacts and are
skipped in default test runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant