Skip to content

Conversation

SignalRT
Copy link
Collaborator

@SignalRT SignalRT commented Sep 27, 2025

Prototype implementation:

  • Minimally tested on macOS.
  • Tested unsuccessfully with CUDA 13 (seems to be an issue in llama.cpp itself).
  • Unit test
  • The test does not render images.
  • Audio and video handling not tested.
  • Overcomplicated batch management?

@SignalRT SignalRT requested a review from Copilot September 28, 2025 15:51
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.

  • Migration from LLaVA-specific classes to generic MTMD wrapper classes
  • Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
  • Updated executors to use MTMD tokenization instead of direct image embedding evaluation
  • Comprehensive test coverage for the new MTMD functionality

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
SafeMtmdWeights.cs New wrapper class for MTMD multimodal weights replacing LLavaWeights
NativeApi.Mtmd.cs Native P/Invoke surface for MTMD helper functions
SafeMtmdModelHandle.cs Native handle management for MTMD models with tokenization and evaluation
SafeMtmdInputChunks.cs Managed wrapper for native chunk collections returned by tokenizer
SafeMtmdInputChunk.cs Individual chunk wrapper with metadata access and token span views
SafeMtmdEmbed.cs Media embedding wrapper supporting images, audio, and raw data buffers
LLamaInteractExecutor.cs Updated interactive executor to use MTMD tokenization workflow
LLamaInstructExecutor.cs Updated instruct executor with MTMD preprocessing logic
BatchedExecutor.cs Added MTMD batch evaluation support for batched inference
Conversation.cs Extended conversation class with multimodal prompting and media queueing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +150 to +155
if (result != 0)
{
foreach (var media in _pendingMedia)
media.Dispose();
_pendingMedia.Clear();
}
Copy link
Preview

Copilot AI Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error handling block duplicates the cleanup logic from lines 141-143. Consider extracting this into a private method to avoid code duplication.

Copilot uses AI. Check for mistakes.

Comment on lines +457 to +463
if (inferenceParams.MaxTokens == 0)
{
_embeds.Clear();
args.WaitForInput = true;
args.ReturnValue = false;
return;
}
Copy link
Preview

Copilot AI Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MaxTokens == 0 check and its logic is duplicated in InstructExecutor. Consider extracting this into a shared method in the base class.

Copilot uses AI. Check for mistakes.

Comment on lines +288 to 298
public void Prompt(string promptText, bool addBos = true, bool special = true)
{
if (Executor.ClipModel != null && _mtmdEmbeds.Count > 0)
{
PromptMultimodal(promptText, addBos);
return;
}

var tokens = Executor.Context.Tokenize(promptText, addBos, special);
Prompt(tokens);
}
Copy link
Preview

Copilot AI Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special parameter is ignored when ClipModel is available and multimodal processing is used. This inconsistency could confuse API consumers. Consider passing the special parameter to PromptMultimodal or documenting this behavior.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant