UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech) by loci-dev · Pull Request #1359 · auroralabs-loci/llama.cpp

loci-dev · 2026-04-19T02:19:10Z

Note

Source pull request: ggml-org/llama.cpp#22101

Overview

Adds support for ibm-granite/granite-4.0-1b-speech.

Conformer encoder + QFormer projector (graph builder in granite-speech.cpp)
Audio preprocessor: log-mel spectrogram, dynamic range compression, frame stacking
GGUF converter: batch norm folding, K/V split, Conv1d reshape
Follows existing conformer/whisper patterns

Tested with greedy decoding on 30s/60s/120s/180s/360s clips, token-for-token match against HF transformers (following their script on the model card) for 30s and 60s. Too heavy for me to run for longer on HF but at 120s/180s there is noticeable degradation and at 360s it completely loops.

Test command:

ffmpeg -i input.wav -t 30 -ar 16000 -ac 1 test.wav

python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16
python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 --mmproj

./build/bin/llama-mtmd-cli -m models/granite-4.0-1b-speech/granite-4.0-1B-speech-F16.gguf --mmproj models/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-F16.gguf --audio test.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0 -c 4096

Notes: --jinja is required and the prompt "can you transcribe the speech into a written format?" is taken from the model card.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, used as a search tool to find relevant implementations for similar models and to find similar models that are unknown to me, for example I'd ask codex to find where and how the conformer calls log mel spectrogram. I copy pasted as much as possible from the existing architectures, when it made sense obviously. Also used codex for finding the different integration points across the multimodal part of the codebase, I'm familiar with the text-only part but not that much with the multimodal part before and how it's structured.

Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding.

loci-review · 2026-04-19T03:32:20Z

Overview

Performance Impact: Minor regression isolated to multi-modal library initialization (+0.651% power consumption in libmtmd.so). Core inference libraries unaffected.

Function Counts: 46,989 total (52 modified, 118 new, 0 removed, 46,819 unchanged)

Power Consumption by Binary:

build.bin.libmtmd.so: 207,850 → 209,202 nJ (+0.651%)
build.bin.libllama.so: 264,506 nJ (0% change)
build.bin.libggml-cpu.so: 179,881 nJ (0% change)
build.bin.libggml-base.so: 87,465 nJ (0% change)
All other binaries (llama-bench, llama-cvector-generator, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-tts, libggml.so): 0% change

Commit: 7b313dc - "mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)"

Function Analysis

clip_model_loader::load_tensors (clip.cpp:1563):

Response: 1,042,228 → 1,515,365 ns (+473,137 ns, +45.4%)
Throughput: 5,798 → 5,871 ns (+73 ns, +1.3%)
Loads 94+ new tensors for Granite Speech (input/output projections, attention, FFN, convolution layers). Calls std::vector::resize twice instead of once, causing doubled memory reallocation overhead.

clip_model::clip_model (clip-model.h:305):

Response: 2,158 → 2,301 ns (+143 ns, +6.6%)
Throughput: 727 → 783 ns (+55 ns, +7.6%)
Initializes new std::vector<granite_speech_proj_layer> (+88 ns) and 16 tensor pointers for audio processing.

clip_model::~clip_model (clip-model.h:305):

Response: 2,980 → 3,204 ns (+225 ns, +7.6%)
Throughput: 49 → 55 ns (+6 ns, +12.2%)
Cleans up new vector container (+239 ns destructor overhead).

mtmd_context::init_audio (mtmd.cpp:446):

Response: 3,229 → 3,300 ns (+71 ns, +2.2%)
Throughput: 516 → 594 ns (+78 ns, +15.1%)
Adds PROJECTOR_TYPE_GRANITE_SPEECH case with bitwise-masking logic, increasing switch statement overhead.

Compiler Optimizations (positive):

std::__make_move_if_noexcept_iterator: -66% response time (-185 ns)
std::vector<clip_layer>::operator[]: -9.3% (-1.43 ns per access)
std::__new_allocator<clip_layer>::deallocate: -24.7% (-12 ns)

Other analyzed functions showed negligible changes from standard library operations and compiler optimizations.

Flame Graph Comparison

Function: clip_model_loader::load_tensors (largest regression, +45.4% response time)

Base version:

Target version:

The flame graphs show the target version performs two std::vector::resize operations (3,844 ns + 3,981 ns) versus one in the base version (3,668 ns), explaining the 45% regression. Self-time remains stable (+1.3%), confirming the overhead is from doubled memory reallocation during Granite Speech layer loading.

Additional Findings

Critical Path Assessment: All regressions occur in model loading and initialization (one-time costs). Core inference paths (matrix operations, attention, KV cache, sampling) show zero change, confirmed by unchanged power consumption in libllama.so and GGML backends.

Optimization Opportunity: Pre-calculate total layer count in load_tensors to perform a single resize operation, eliminating ~4,000 ns overhead.

💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to PROD__AL_DEMO April 19, 2026 02:19 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 7638ab4 to f1b46d5 Compare April 20, 2026 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#1359

UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#1359
loci-dev wants to merge 1 commit intomainfrom
loci/pr-22101-add-granite-speech-support

loci-dev commented Apr 19, 2026

Uh oh!

loci-review Bot commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Apr 19, 2026

Overview

Requirements

Uh oh!

loci-review Bot commented Apr 19, 2026

Overview

Function Analysis

Flame Graph Comparison

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants