Sync upstream: speculative checkpointing for hybrid models by chad-loder · Pull Request #100 · TheTom/llama-cpp-turboquant

chad-loder · 2026-04-21T23:01:05Z

Summary

Merge upstream llama.cpp master to pick up speculative checkpointing
(ggml-org#19493), which enables speculative decoding for hybrid
MoE+SSM models like Qwen3.6-35B-A3B.

Without this, llama_memory_recurrent::seq_rm() returns false for partial
sequence removal, blocking all speculative decoding on hybrid architectures.

What changed

73 upstream commits merged (ggml-org/llama.cpp master as of 2026-04-21)
Key upstream PRs included:
- server : speculative checkpointing ggml-org/llama.cpp#19493 — speculative checkpointing (save/restore recurrent state)
- server : refactor "use checkpoint" logic ggml-org/llama.cpp#22114 — refactor "use checkpoint" logic
- arg : add --spec-default ggml-org/llama.cpp#22223 — --spec-default argument
- ngram-mod: Reset i_last when low acceptance streak occurs ggml-org/llama.cpp#22168 — reset i_last on low acceptance streak
One conflict in ggml/src/ggml-cuda/vendors/hip.h (HIP shfl macros) —
resolved by keeping the fork's variadic overloads

No regressions (M2 Pro, 32 GB)

Qwen3.6-35B-A3B-UD-IQ4_NL, -ctk q8_0 -ctv turbo4 -fa on -ngl 99:

Test	Pre-merge	Post-merge	Delta
pp512 (prefill)	548.20 ± 2.84	550.96 ± 0.50	+0.5%
pp2048 (prefill)	524.64 ± 0.64	524.39 ± 2.30	−0.05%
tg128 (decode)	31.51 ± 0.07	31.61 ± 0.05	+0.3%
PPL (wikitext-2)	6.3498	6.3498	0.0%

All deltas within noise. PPL bit-identical.

* server: use random media marker * nits * remove legacy <__image__> token * revert special char in random

…gml-org#21638) * [SYCL] Fix Q8_0 reorder: add missing dequantize path for GEMM The Q8_0 reorder optimization (ggml-org#21527) was missing a reorder-aware dequantizer for the GEMM code path used during prompt processing. After token generation reordered Q8_0 weights (via DMMV/MMVQ), the next prompt processing pass would read them with the standard dequantizer, producing garbage output. Add dequantize_block_q8_0_reorder() and wire it into both ggml_get_to_fp16_sycl() and ggml_get_to_fp32_sycl(), matching the pattern already used by Q4_0, Q4_K, and Q6_K. Fixes ggml-org#21589 AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. * SYCL: fix reorder crash when device memory is full The reorder optimization allocates a temporary buffer the full size of the weight tensor on the device. When VRAM is nearly full (large models on a single GPU), this allocation fails and the subsequent memcpy crashes on a NULL pointer. Fix: try device allocation first, fall back to host memory if device memory is full. The reorder kernel still works correctly reading from host memory over PCIe. This is slower for the one-time reorder (~21 t/s vs ~38 t/s on Intel Arc Pro B70), but the optimization is preserved for all subsequent inference. If both device and host allocation fail, skip the reorder and fall back to the unoptimized kernel path. Also fixes a bug where opt_for_reorder() marked tensors as reordered even when the reorder was skipped due to allocation failure. This caused DMMV/MMVQ kernels to read the original AoS data as if it were SoA, producing garbage output or NaN results. Tested on Intel Arc Pro B70 (32GB) with Q8_0, Q4_K_M models. Coding was AI-assisted (Claude), reviewed and tested on hardware by a human. Fixes ggml-org#20478 * SYCL: add RAII temp buffer class + macro guard for host fallback Replace sycl_ext_malloc_with_fallback/sycl_ext_free_fallback free functions with sycl_reorder_temp_buffer RAII class. The host_fallback bool is now a private member, and cleanup happens automatically at scope exit. Add GGML_SYCL_HOST_MEM_FALLBACK cmake option (default ON) to guard the host memory fallback code path. Device access to host memory requires Linux kernel 6.8+ (Ubuntu 26.04+); users on older kernels can set -DGGML_SYCL_HOST_MEM_FALLBACK=OFF to disable it. Addresses arthw's review on PR ggml-org#21638. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: document GGML_SYCL_HOST_MEM_FALLBACK build option in SYCL.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * SYCL: add reorder-aware DMMV dequantizers for Q4_K and Q6_K Q4_K and Q6_K had reorder support for MMVQ and GEMM paths but not DMMV. When the DMMV path encountered reordered data it would abort. Add DMMV kernels that read from the SOA reorder layout for both types. Same math as the non-reorder versions, different memory access pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…gml-org#21873) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function * Move to a single query set for GPU profiling * Move to batching compute passes when not profiling * Refactor build_multi * remove iOS throttling now that we're batching compute passes

…org#20627) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

…gml-org#20633) * ggml-cpu: add 128-bit impls for i-quants, ternary quants * ggml-cpu: add 128-bit impls for iq2_xs, iq3_s, iq3_xxs, tq2_0 Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: refactor; add rvv checks --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473. * update ops.md * update op docs

* ggml: add graph_reused * use versioning instead of reuse flag * increment version with atomic * use top bits for split numbering * add assert * move counter to ggml.c * set uid in split_graph only * fix windows * address further review comments * get next_uid rather than doing bit manipulation * rename + add comment about uid

* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code

* support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4

…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s

…g#21962) (ggml-org#21980) * server: tests: fetch random media marker via /apply-template (ggml-org#21962 fix) * server: allow pinning media marker via LLAMA_MEDIA_MARKER env var get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it as-is if set, falling back to the random marker otherwise. Tests no longer need to fetch the marker dynamically via /apply-template: the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts work as before. Address review feedback from ngxson * server: make get_media_marker() thread-safe via magic statics Use a C++11 static local with a lambda initializer instead of a global static with an empty-check. The runtime guarantees initialization exactly once without explicit locking. Address review feedback from ggerganov * nits * nits

* model: using single llm_build per arch * fix merge * nits

* optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679e. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip

…dreno (ggml-org#21938) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace

* model : Gemma4 model type detection * model : Gemma4 model type detection

* cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()

* server: respect the ignore eos flag * ci: add android arm64 build and release * patch * pin android-setup actions to v4 * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * lf in the suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

…ng (ggml-org#21052) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33.

…-org#22063)

* feat: (vocab) fix stray text appended in llama_decode_text Remove accidental concatenation of the full `text` string when formatting UNK_BYTE hex escapes. Only the closing "]" should be appended. * feat(mtmd): add Yasa2 vision encoder support Add a Yasa2 (ConvNeXtV2-based) vision encoder for reka-edge: - Register PROJECTOR_TYPE_YASA2 and tensor name definitions - Add yasa2_block/yasa2_stage model structs - Implement graph builder with ConvNeXt stages, GRN, adaptive pooling - Wire into clip.cpp switch statements and mtmd.cpp init_vision - Use mtmd_image_preprocessor_fixed_size for image preprocessing * feat(chat): add reka-edge template handler (tools, thinking) - Add chat-reka.cpp/h implementing PEG-based parser for reka-edge format - Add Reka-Edge.jinja chat template - Detect reka-edge template in try_specialized_template() - Add LLAMA_EXAMPLE_MTMD to chat-template-file arg * feat: add reka vlm to gguf conversion script Converts Reka Yasa2 hf checkpoints to GGUF format: - Text decoder: Llama-arch with tiktoken/BPE vocab - Mmproj (--mmproj): ConvNeXt vision backbone + language_projection - Generates 2D sincos positional embeddings for vision encoder * test: add Reka Edge chat template and parser tests - test-chat-template: oracle tests comparing Jinja engine output vs common_chat_templates_apply for text, tools, thinking, images, video - test-chat: PEG parser tests for Reka Edge format, round-trip tests for image/video content parts, common path integration tests * scripts: add Reka Edge mixed quantization helper Q4_0 base quantization with Q8_0 override for the last 8 transformer blocks (layers 24-31) via --tensor-type regex. * fix: adapt chat-reka and tests to upstream API - Use autoparser::generation_params (not templates_params) - Add p.prefix(generation_prompt) to PEG parser - Simplify reasoning parser to match LFM2 pattern - Remove image/video oracle tests (unsupported by oaicompat parser; no other multimodal models test this path) * fix: avoid duplicate tensor loading in yasa2 vision encoder TN_YASA_PATCH_W and TN_PATCH_EMBD both resolve to "v.patch_embd.weight", causing the same tensor to be loaded twice into ctx_data and overflowing the memory pool. Reuse the tensors already loaded by the common section. * chore: update image pre-processing settings The reka-edge model depends on the following settings in an older fork of llama.cpp: 1. Fixed square resize 2. BICUBIC 3. add_padding=false In current llama.cpp, this means setting: - image_resize_algo = RESIZE_ALGO_BICUBIC - image_resize_pad = false * chore: remove reka gguf conversion script * chore: remove reka quantization script * chore: remove unnecessary changes from PR scope This commit removes a couple of unnecessary changes for the PR scope: 1. BPE decoder bug fix - this affects reka edge because there's a bug in our tokenization that doesn't represent <think> tokens as special tokens. However this isn't meant to be a thinking model so when run with --reasoning off the edge case does not affect us 2. --chat-template-file support from llama-mtmd-cli - the focus is on llama-server and the reka edge gguf contains the necessary metadata to detect the chat template 3. reka edge oracle test cases - no other model has similar test cases, so I removed it for standardization * chore: remove unnecessary ggml_cast This commit removes unnecessary ggml_cast after updating the reka vlm -> gguf conversion script on hugging face. * chore: remove redundant code * chore: remove unnecessary ggml_cont calls This commit removes all ggml_cont calls except the four that precede ggml_reshape_3d/ggml_reshape_4d. Those are necessary because ggml_reshape recomputes strides assuming contiguous layout and asserts ggml_is_contiguous. Other operations (ggml_mean, ggml_add, ggml_mul etc.) use stride-based indexing and handle non-contiguous inputs correctly and so we are ok to remove ggml_cont for those. * chore: remove unnecessary ggml_repeat calls This commit removes unnecessary ggml_repeat calls because the underlying ops already broadcast automatically. Every ggml_repeat in yasa2.cpp was expanding a smaller tensor to match a larger one's shape before passing both to an elementwise op (ggml_add, ggml_sub, ggml_mul, or ggml_div). This is unnecessary because all four of these ops already support broadcasting internally. * chore: restore ggml_cont needed for cpu operations * refactor: locate reka chat template handler in chat.cpp * chore: remove unnecessary warmup tokens * chore: add code comments on image_resize_pad * chore: remove custom reka parsing code * chore: revert common/chat.cpp * Uncomment debug logging for PEG input parsing --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

* hexagon: Add DIAG op * hexagon: add HVX support and DMA double buffering * hexagon: fix fatal error * hexagon: remove as many pragma(s) as possible

…am-spec-checkpoint # Conflicts: # ggml/src/ggml-cuda/vendors/hip.h

chad-loder · 2026-04-30T03:37:09Z

Closing as superseded — PR #101 (upstream sync to b8871) brought in the speculative checkpointing commits from ggml-org#19493 along with the related spec refactoring (ggml-org#22114, ggml-org#22168, ggml-org#22223, ggml-org#22227). Everything this PR aimed to deliver is now in feature/turboquant-kv-cache.

Thanks for the great work on the upstream sync!

Xuan-Son Nguyen and others added 30 commits April 15, 2026 23:52

server: use random media marker (ggml-org#21962)

408225b

* server: use random media marker * nits * remove legacy <__image__> token * revert special char in random

ci : Use ggml-org/ccache-action on RISC-V as well (ggml-org#21632)

8612ed1

devops : added spirv-headers to nix (ggml-org#21965)

90fb96a

ggml : implemented simd_gemm kernel for riscv vector extension (ggml-…

5637536

…org#20627) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

metal: Implement ROLL op (ggml-org#21946)

ae2d348

* nix: support unified apple-sdk * Impl roll op for Metal * Revert "nix: support unified apple-sdk" This reverts commit abfa473. * update ops.md * update op docs

Convert: Fix NemotronH Config Parsing (ggml-org#21664)

03b3d07

* fix NemotronH vocab loading by using trust_remote_code for unsupported config patterns * fix NemotronH tokenizer loading by overriding set_vocab with trust_remote_code

codeowners: add team member comments (ggml-org#21714)

b572d1e

model : support NVFP4 tensors for Gemma4 (ggml-org#21971)

f772f6e

* support nvfp4 tensors for Gemma4 * add wo_s to build_attn * add wo_s to build_attn * fix glm4

model : refactor QKV into common build_qkv and create_tensor_qkv help…

9db77a0

…ers (ggml-org#21245) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s

opencl: add q5_K gemm and gemv kernels for Adreno (ggml-org#21595)

e45dbde

model: using single llm_build per arch (ggml-org#21970)

4fbdabd

* model: using single llm_build per arch * fix merge * nits

cmake: use glob to collect src/models sources (ggml-org#22005)

089dd41

cli : use get_media_marker (ggml-org#22017)

30dce2c

opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for A…

5e6c0e1

…dreno (ggml-org#21938) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace

model : Gemma4 model type detection (ggml-org#22027)

fcc7508

* model : Gemma4 model type detection * model : Gemma4 model type detection

mtmd: add missing struct tag (ggml-org#22023)

268d61e

CUDA: use LRU based eviction for cuda graphs (ggml-org#21611)

b94050e

* CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up

llama: fit ctx size for CPU only (ggml-org#21568)

fd1c0ec

convert : fix (ignore for now) typings errors (ggml-org#22002)

89a5474

ci : free disk space for rocm release (ggml-org#22012)

83d58e0

ggml-backend-meta: add multi-segment read support in get_tensor (ggml…

59accc8

…-org#22063)

ggerganov and others added 7 commits April 21, 2026 19:52

arg : add --spec-default (ggml-org#22223)

84652b8

spec : reset i_last when low acceptance streak occurs (ggml-org#22168)

72d693e

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

hexagon: fix missing v79 entry in libggml-htp.inf (ggml-org#22194)

2248799

Hexagon: DAIG op (ggml-org#22195)

5a4cd67

* hexagon: Add DIAG op * hexagon: add HVX support and DMA double buffering * hexagon: fix fatal error * hexagon: remove as many pragma(s) as possible

server: allow cancel loading model (ggml-org#21814)

04fe84b

Merge remote-tracking branch 'upstream/master' into experiment/upstre…

f903fe2

…am-spec-checkpoint # Conflicts: # ggml/src/ggml-cuda/vendors/hip.h

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU ggml examples server Apple Metal Vulkan testing devops python script model OpenCL SYCL build nix Hexagon WebGPU OpenVINO android labels Apr 21, 2026

chad-loder closed this Apr 30, 2026

chad-loder deleted the experiment/upstream-spec-checkpoint branch April 30, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync upstream: speculative checkpointing for hybrid models#100

Sync upstream: speculative checkpointing for hybrid models#100
chad-loder wants to merge 73 commits intoTheTom:feature/turboquant-kv-cachefrom
chad-loder:experiment/upstream-spec-checkpoint

chad-loder commented Apr 21, 2026

Uh oh!

chad-loder commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

chad-loder commented Apr 21, 2026

Summary

What changed

No regressions (M2 Pro, 32 GB)

Uh oh!

chad-loder commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants