Mtmd implementation #1261

SignalRT · 2025-09-27T16:47:24Z

Prototype implementation:

Minimally tested on macOS.
Tested unsuccessfully with CUDA 13 (seems to be an issue in llama.cpp itself).
Unit test
The test does not render images.

Copilot

Pull Request Overview

This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.

Migration from LLaVA-specific classes to generic MTMD wrapper classes
Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
Updated executors to use MTMD tokenization instead of direct image embedding evaluation
Comprehensive test coverage for the new MTMD functionality

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
SafeMtmdWeights.cs	New wrapper class for MTMD multimodal weights replacing LLavaWeights
NativeApi.Mtmd.cs	Native P/Invoke surface for MTMD helper functions
SafeMtmdModelHandle.cs	Native handle management for MTMD models with tokenization and evaluation
SafeMtmdInputChunks.cs	Managed wrapper for native chunk collections returned by tokenizer
SafeMtmdInputChunk.cs	Individual chunk wrapper with metadata access and token span views
SafeMtmdEmbed.cs	Media embedding wrapper supporting images, audio, and raw data buffers
LLamaInteractExecutor.cs	Updated interactive executor to use MTMD tokenization workflow
LLamaInstructExecutor.cs	Updated instruct executor with MTMD preprocessing logic
BatchedExecutor.cs	Added MTMD batch evaluation support for batched inference
Conversation.cs	Extended conversation class with multimodal prompting and media queueing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

LLama/Native/NativeApi.cs

LLama/Native/MtmdContextParams.cs

LLama/MtmdWeights.cs

LLama/Native/SafeMtmdModelHandle.cs

Copilot · 2025-09-28T15:53:12Z

LLama/LLamaInteractExecutor.cs

+                if (inferenceParams.MaxTokens == 0)
+                {
+                    _embeds.Clear();
+                    args.WaitForInput = true;
+                    args.ReturnValue = false;
+                    return;
+                }


This MaxTokens == 0 check and its logic is duplicated in InstructExecutor. Consider extracting this into a shared method in the base class.

LLama/Batched/Conversation.cs

martindevans · 2025-10-19T16:14:40Z

LLama/Batched/BatchedExecutor.cs

    : IDisposable
 {
    private int _nextSequenceId;
-    private readonly List<IBatch> _batchQueue = [];


It looks like the changes from #1262 have been undone here? Probably an accident that requires rebasing?

I’ve rebased the branch.

martindevans · 2025-10-19T16:17:35Z

LLama/Batched/BatchedExecutor.cs

        }
    }
+
+    private class MtmdChunkBatch : IBatch


The way the batched executor worked with llava previously was llava embedded the image, and then you could prompt with the raw embeddings (EmbeddingBatch). That was nice, since the batched executor was totally independent from llava. Is that no longer possible with MTMD?

martindevans · 2025-10-19T16:24:33Z

LLama/Batched/Conversation.cs

    /// </summary>
    private bool _forked;
+    private readonly List<SafeMtmdEmbed> _mtmdEmbeds = new();
+    private int? _mtmdLogitsIndex;


This only ever seems to be set to -1 or null?

martindevans · 2025-10-19T16:27:48Z

LLama/Batched/Conversation.cs

+        _mtmdLogitsIndex = null;
+    }
+
+    public void QueueMedia(SafeMtmdEmbed embed)


Could we add an overload of Prompt that takes a Span<SafeMtmdEmbed> instead of enqueuing them? This removes some state and makes the ownership of the embed objects clearer (e.g. at the moment you could call QueueMedia, then dispose the media, which would trigger an error on the next call to Prompt when you tried to use that disposed object).

martindevans · 2025-10-19T16:30:37Z

LLama/Batched/ConversationExtensions.cs

-        return sampler.Sample(conversation.Executor.Context.NativeHandle, conversation.GetSampleIndex(offset));
+        var ctx = conversation.Executor.Context.NativeHandle;
+        if (conversation.MtmdLogitsIndex == -1)
+            return sampler.Sample(ctx, -1);


This looks very odd to me - if you had multiple active conversations and prompted them all with an image then next time you would be sampling from index=-1 for all of the conversations! Maybe that's right, it just caught my eye.

LLama/Native/MtmdContextParams.cs

LLama/Native/NativeApi.Mtmd.cs

LLama/Native/SafeMtmdEmbed.cs

LLama/Native/SafeMtmdInputChunk.cs

LLama/Native/SafeMtmdInputChunks.cs

LLama/SafeMtmdWeights.cs

martindevans

Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍

Webslug · 2025-10-25T17:09:39Z

Version 25.0 just breaks multi modal capabilities.

Qwen2.5-VL-3B won't work at all.

How do we load the weights from other multimodal models?

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)'

        string multiModalProj = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-mmproj-f16.gguf";
        string modelPath = "F:\\AI\\models\\Qwen2.5-VL-3B-Instruct-q4_k_m.gguf";

        var parameters = new ModelParams(modelPath);
        NativeApi.llama_log_set((level, message) => { });
        Environment.SetEnvironmentVariable("LLAMA_LOG", "0");
        using var model = LLamaWeights.LoadFromFile(parameters);
        using var context = model.CreateContext(parameters);
        using var clipModel = LLavaWeights.LoadFromFile(multiModalProj);
        var executor = new InteractiveExecutor(context, clipModel, logger: null);

        var inferenceParams = new InferenceParams()
        {
            MaxTokens = 512,
            AntiPrompts = new List<string> { "\nUSER:" },
            SamplingPipeline = new DefaultSamplingPipeline { Temperature = 0.1f }
        };

Co-authored-by: Copilot <[email protected]>

To manage PtrToString conversions

SignalRT · 2025-10-26T21:36:52Z

@martindevans , At this point, it would be helpful to have the final binaries so that we can perform a complete test and check for any issues with the rebase

This was referenced Sep 27, 2025

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)' #1255

Open

Multimodal embedding #1193

Open

SignalRT force-pushed the mtmd_implementation branch from fcce175 to 9931d0e Compare September 28, 2025 14:54

SignalRT requested a review from Copilot September 28, 2025 15:51

Copilot AI reviewed Sep 28, 2025

View reviewed changes

SignalRT added a commit to SignalRT/LLamaSharp that referenced this pull request Sep 29, 2025

Resolve comment: SciSharp#1261 (comment)

3c92b07

SignalRT mentioned this pull request Sep 30, 2025

Qwen2.5-VL gguf model output garbled code #1194

Open

SignalRT marked this pull request as ready for review October 5, 2025 12:27

SignalRT mentioned this pull request Oct 6, 2025

[BUG]: Error in version 0.25.0 - LLama.Exceptions.RuntimeError: Failed to load the native library. #1275

Open

SignalRT requested a review from martindevans October 19, 2025 16:05