UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#1359
UPSTREAM PR #22101: mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)#1359
Conversation
Conformer encoder with Shaw relative position encoding, QFormer projector, log-mel spectrogram with frame stacking. Encoder uses GLU gating, folded batch norm, and SSM depthwise conv. QFormer compresses encoder output via windowed cross-attention (window=15, queries=3) into the LLM embedding space. Audio preprocessing: reflect-padded STFT, 80-bin mel filterbank, dynamic range compression, 2x frame stacking (80->160 mel). GGUF converter handles batch norm folding at export time, fused K/V split, and Conv1d weight reshaping. Tested against HF transformers reference: token-for-token match on 30s/60s audio clips with greedy decoding.
OverviewPerformance Impact: Minor regression isolated to multi-modal library initialization (+0.651% power consumption in Function Counts: 46,989 total (52 modified, 118 new, 0 removed, 46,819 unchanged) Power Consumption by Binary:
Commit: 7b313dc - "mtmd: add granite-speech support (ibm-granite/granite-4.0-1b-speech)" Function Analysis
Compiler Optimizations (positive):
Other analyzed functions showed negligible changes from standard library operations and compiler optimizations. Flame Graph ComparisonFunction: The flame graphs show the target version performs two Additional FindingsCritical Path Assessment: All regressions occur in model loading and initialization (one-time costs). Core inference paths (matrix operations, attention, KV cache, sampling) show zero change, confirmed by unchanged power consumption in Optimization Opportunity: Pre-calculate total layer count in 💬 Questions? Tag @loci-dev |


Note
Source pull request: ggml-org/llama.cpp#22101
Overview
Adds support for ibm-granite/granite-4.0-1b-speech.
granite-speech.cpp)Tested with greedy decoding on 30s/60s/120s/180s/360s clips, token-for-token match against HF transformers (following their script on the model card) for 30s and 60s. Too heavy for me to run for longer on HF but at 120s/180s there is noticeable degradation and at 360s it completely loops.
Test command:
ffmpeg -i input.wav -t 30 -ar 16000 -ac 1 test.wav python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 python convert_hf_to_gguf.py models/granite-4.0-1b-speech --outtype f16 --mmproj ./build/bin/llama-mtmd-cli -m models/granite-4.0-1b-speech/granite-4.0-1B-speech-F16.gguf --mmproj models/granite-4.0-1b-speech/mmproj-granite-4.0-1b-speech-F16.gguf --audio test.wav -p "can you transcribe the speech into a written format?" --jinja --temp 0 -c 4096Notes:
--jinjais required and the prompt "can you transcribe the speech into a written format?" is taken from the model card.Requirements