Skip to content

UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#1359

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-22101-add-granite-speech-support
Open

UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#1359
loci-dev wants to merge 1 commit intomainfrom
loci/pr-22101-add-granite-speech-support

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#22101

Overview

Adds support for ibm-granite/granite-4.0-1b-speech.

  • Conformer encoder + QFormer projector (graph builder in granite-speech.cpp)
  • Audio preprocessor: log-mel spectrogram, dynamic range compression, frame stacking
  • GGUF converter: batch norm folding, K/V split, Conv1d reshape
  • Follows existing conformer/whisper patterns

Tested with greedy decoding on 30s/60s/120s/180s/360s clips, token-for-token match against HF transformers (following their script on the model card) for 30s and 60s. Too heavy for me to run for longer on HF but at 120s/180s there is noticeable degradation and at 360s it completely loops.

Test command:

ffmpeg -i input.wav -t 30 -ar 16000 -ac 1 test.wav

python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16
python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 --mmproj

./build/bin/llama-mtmd-cli -m models/granite-4.0-1b-speech/granite-4.0-1B-speech-F16.gguf --mmproj models/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-F16.gguf --audio test.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0 -c 4096

Notes: --jinja is required and the prompt "can you transcribe the speech into a written format?" is taken from the model card.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, used as a search tool to find relevant implementations for similar models and to find similar models that are unknown to me, for example I'd ask codex to find where and how the conformer calls log mel spectrogram. I copy pasted as much as possible from the existing architectures, when it made sense obviously. Also used codex for finding the different integration points across the multimodal part of the codebase, I'm familiar with the text-only part but not that much with the multimodal part before and how it's structured.

Conformer encoder with Shaw relative position encoding,
QFormer projector, log-mel spectrogram with frame stacking.

Encoder uses GLU gating, folded batch norm, and SSM depthwise
conv. QFormer compresses encoder output via windowed
cross-attention (window=15, queries=3) into the LLM embedding
space.

Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank,
dynamic range compression, 2x frame stacking (80->160 mel).

GGUF converter handles batch norm folding at export time,
fused K/V split, and Conv1d weight reshaping.

Tested against HF transformers reference: token-for-token match
on 30s/60s audio clips with greedy decoding.
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Apr 19, 2026

Overview

Performance Impact: Minor regression isolated to multi-modal library initialization (+0.651% power consumption in libmtmd.so). Core inference libraries unaffected.

Function Counts: 46,989 total (52 modified, 118 new, 0 removed, 46,819 unchanged)

Power Consumption by Binary:

  • build.bin.libmtmd.so: 207,850 → 209,202 nJ (+0.651%)
  • build.bin.libllama.so: 264,506 nJ (0% change)
  • build.bin.libggml-cpu.so: 179,881 nJ (0% change)
  • build.bin.libggml-base.so: 87,465 nJ (0% change)
  • All other binaries (llama-bench, llama-cvector-generator, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-tts, libggml.so): 0% change

Commit: 7b313dc - "mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)"

Function Analysis

clip_model_loader::load_tensors (clip.cpp:1563):

  • Response: 1,042,228 → 1,515,365 ns (+473,137 ns, +45.4%)
  • Throughput: 5,798 → 5,871 ns (+73 ns, +1.3%)
  • Loads 94+ new tensors for Granite Speech (input/output projections, attention, FFN, convolution layers). Calls std::vector::resize twice instead of once, causing doubled memory reallocation overhead.

clip_model::clip_model (clip-model.h:305):

  • Response: 2,158 → 2,301 ns (+143 ns, +6.6%)
  • Throughput: 727 → 783 ns (+55 ns, +7.6%)
  • Initializes new std::vector<granite_speech_proj_layer> (+88 ns) and 16 tensor pointers for audio processing.

clip_model::~clip_model (clip-model.h:305):

  • Response: 2,980 → 3,204 ns (+225 ns, +7.6%)
  • Throughput: 49 → 55 ns (+6 ns, +12.2%)
  • Cleans up new vector container (+239 ns destructor overhead).

mtmd_context::init_audio (mtmd.cpp:446):

  • Response: 3,229 → 3,300 ns (+71 ns, +2.2%)
  • Throughput: 516 → 594 ns (+78 ns, +15.1%)
  • Adds PROJECTOR_TYPE_GRANITE_SPEECH case with bitwise-masking logic, increasing switch statement overhead.

Compiler Optimizations (positive):

  • std::__make_move_if_noexcept_iterator: -66% response time (-185 ns)
  • std::vector<clip_layer>::operator[]: -9.3% (-1.43 ns per access)
  • std::__new_allocator<clip_layer>::deallocate: -24.7% (-12 ns)

Other analyzed functions showed negligible changes from standard library operations and compiler optimizations.

Flame Graph Comparison

Function: clip_model_loader::load_tensors (largest regression, +45.4% response time)

Base version:
Flame Graph

Target version:
Flame Graph

The flame graphs show the target version performs two std::vector::resize operations (3,844 ns + 3,981 ns) versus one in the base version (3,668 ns), explaining the 45% regression. Self-time remains stable (+1.3%), confirming the overhead is from doubled memory reallocation during Granite Speech layer loading.

Additional Findings

Critical Path Assessment: All regressions occur in model loading and initialization (one-time costs). Core inference paths (matrix operations, attention, KV cache, sampling) show zero change, confirmed by unchanged power consumption in libllama.so and GGML backends.

Optimization Opportunity: Pre-calculate total layer count in load_tensors to perform a single resize operation, eliminating ~4,000 ns overhead.

💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants