Skip to content

cactus_complete with Gemma 4 E4B + audio input throws unordered_map::at: key not found #584

@imitation-alpha

Description

@imitation-alpha

Summary

Calling cactus_complete with the google/gemma-4-E4B-it model and an audio input (either via pcm_data or an "audio": ["path"] field on a user message) consistently fails with:

[WARN] [npu] [gemma4] model.mlpackage not found; using CPU prefill
[WARN] [npu] [gemma4-audio] unsupported NPU input shape; falling back to CPU audio encoder
[ERROR] [complete] Exception: unordered_map::at: key not found

Text-only cactus_complete on the same loaded handle works fine. The error is reproducible across two API shapes and multiple audio durations (1.64 s, 4.54 s, 30 s padded), all int16 mono 16 kHz PCM.

Environment

  • cactus CLI installed via brew install cactus-compute/cactus/cactus, version reported by brew: cactus 1.13_1
  • macOS 26.3.1, Apple Silicon (M4 Pro), 64 GB RAM
  • Xcode command line tools present; Python 3.12.x venv created via source ./setup
  • Weights downloaded via cactus download google/gemma-4-E4B-it; source zip: Cactus-Compute/gemma-4-E4B-it/gemma-4-e4b-it-int4-apple.zip. Extracted directory contains 2088 files, including audio_encoder.mlpackage and audio_encoder.mlmodelc, but no model.mlpackage at the root of the weights dir.
  • libcactus.dylib built from source with cactus build --python (commit HEAD of main, fetched 2026-04-14).

Reproducer (Python, minimal)

import sys, json, wave, subprocess
sys.path.insert(0, "cactus/python/src")
from cactus import cactus_init, cactus_complete, cactus_destroy

# 1. Generate a test WAV using macOS `say`
subprocess.run(["say", "-v", "Samantha", "-o", "/tmp/q.aiff",
                "What is the capital of France?"], check=True)
subprocess.run(["afconvert", "-f", "WAVE", "-d", "LEI16@16000", "-c", "1",
                "/tmp/q.aiff", "/tmp/q.wav"], check=True)

# 2. Load Gemma 4 E4B
model = cactus_init("/opt/homebrew/Cellar/cactus/1.13_1/libexec/weights/gemma-4-e4b-it",
                    None, False)

# 3a. Try native audio-in via pcm_data (6th arg)
with wave.open("/tmp/q.wav", "rb") as w:
    pcm = w.readframes(w.getnframes())

msgs = json.dumps([{"role": "user",
                    "content": "Answer the question I just spoke in one short sentence."}])
opts = json.dumps({"max_tokens": 40})

try:
    raw = cactus_complete(model, msgs, opts, None, None, pcm)
    print("pcm_data path:", json.loads(raw).get("response"))
except Exception as e:
    print("pcm_data path FAILED:", e)

# 3b. Try native audio-in via "audio" field on the message
msgs_audio = json.dumps([{
    "role": "user",
    "content": "Answer the question I just spoke in one short sentence.",
    "audio": ["/tmp/q.wav"]
}])
try:
    raw = cactus_complete(model, msgs_audio, opts, None, None)
    print("audio-field path:", json.loads(raw).get("response"))
except Exception as e:
    print("audio-field path FAILED:", e)

# 4. Control: text-only call on the same handle succeeds
msgs_text = json.dumps([{"role": "user", "content": "What is the capital of France?"}])
raw = cactus_complete(model, msgs_text, opts, None, None)
print("text-only path:", json.loads(raw).get("response"))

cactus_destroy(model)

Output on my machine:

[WARN] [npu] [gemma4] model.mlpackage not found; using CPU prefill
[WARN] [npu] [gemma4-audio] unsupported NPU input shape; falling back to CPU audio encoder
[ERROR] [complete] Exception: unordered_map::at: key not found
pcm_data path FAILED: Completion failed
[ERROR] [complete] Exception: unordered_map::at: key not found
audio-field path FAILED: Completion failed
text-only path: Paris.

What I checked

  • The weights dir contains audio_encoder.mlpackage, audio_encoder.mlmodelc, vision_encoder.mlpackage, vision_encoder.mlmodelc, and ~2088 per-tensor .weights files. It does not contain a top-level model.mlpackage, which I suspect is related to the first warning ([npu] [gemma4] model.mlpackage not found; using CPU prefill). Re-running cactus download google/gemma-4-E4B-it --reconvert did not add it.
  • Padding the audio to exactly 30 × 16000 × 2 = 960000 bytes did not change the outcome.
  • Changing the system/user prompt, adding or removing a system message, and adjusting temperature/max_tokens did not change the outcome.
  • cactus_transcribe(whisper-small, ...) on the same WAV works perfectly and returns the expected text; cactus_complete(gemma-4-e4b-it, ...) with the resulting text also works perfectly (the cascade path is fine).

What I'd like help with

  1. Is native audio input to Gemma 4 via cactus_complete supported in 1.13_1, or only in a later release?
  2. If supported, is the "audio": [path] field the correct message shape, or is there a different prompt template (e.g., with an <audio> placeholder token)?
  3. Is the missing model.mlpackage a known issue with the -apple zip, and is there a way to get the full NPU path for Gemma 4 E4B?

Happy to collect more logs, run with extra tracing, or try a different precision. I'm building a voice agent for the Gemma 4 Voice Agents Hackathon and the on-device audio-in latency story is central to the demo.

Thanks for the excellent engine — looking forward to hacking on it this weekend.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions