MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals) is an agentic medical AI system composed of three specialized expert models and a coordinating orchestrator. Each expert is a 3-billion-parameter vision-language model fine-tuned from Qwen 2.5-VL-3B-Instruct on one cardiac imaging modality. The orchestrator is a stateless agentic layer that interfaces with all three experts via an OpenAI-compatible chat completion API, decomposes clinical queries, aggregates expert responses, and applies counterfactual probing to detect and suppress mirage reasoning.
┌────────────────────────────────────────┐
│ Clinical Input │
│ (ECG PNG / Echo video / CMR video / │
│ free-text query / combined) │
└──────────────────┬─────────────────────┘
│
┌──────────────────▼─────────────────────┐
│ Agentic Orchestrator │
│ │
│ 1. Query Decomposition │
│ Break clinical question into │
│ modality-specific sub-queries │
│ │
│ 2. Expert Routing │
│ Route each sub-query + media to │
│ appropriate expert via REST API │
│ │
│ 3. Mirage Probing │
│ Counterfactual (image-absent) │
│ consistency check per expert │
│ │
│ 4. Confidence Scoring │
│ Per-modality confidence from │
│ rephrase agreement + image delta │
│ │
│ 5. Answer Aggregation │
│ Confidence-weighted synthesis of │
│ all expert responses │
└───┬──────────────┬──────────────┬──────┘
│ │ │
┌────────────────▼──┐ ┌────────▼───────┐ ┌──▼────────────────┐
│ ECG Expert │ │ Echo Expert │ │ CMR Expert │
│ │ │ │ │ │
│ SigLIP Encoder │ │ ViT Encoder │ │ ViT Encoder │
│ 2-layer MLP proj │ │ Temporal agg │ │ Metadata routing │
│ Qwen2 LM │ │ Cross-view │ │ Qwen2 LM │
│ │ │ Qwen2 LM │ │ │
│ Port: 8020 │ │ Port: 8010 │ │ Port: 8000 │
└───────────────────┘ └────────────────┘ └───────────────────┘
All three expert models share the same base: Qwen/Qwen2.5-VL-3B-Instruct. They differ in their visual preprocessing pipeline, training data, and fine-tuning recipe.
Checkpoint: saves/Qwen2.5-VL-3B-Instruct/full/ecg_sft
HuggingFace: stanford-cardiac-ai/MARCUS-ECG
Training: SFT only (Stages 1 and 2; no GRPO)
12-lead ECG waveform (12 × N float array, mV)
│
▼
ECG Renderer (25 mm/s, 10 mm/mV, 10 s)
4-row × 3-lead hospital-style PNG
224 × 224 px per lead patch
│
▼
SigLIP Vision Encoder
(patch-based, ViT-style, pretrained on biomedical images)
│
▼
2-Layer MLP Projection Head
(maps vision tokens to LLM embedding dimension)
│
▼
Qwen2 Language Model
(3B parameter decoder; generates free-text clinical interpretation)
- File format:
.npy(shape(12, N), float32, units mV) or XML (PhilipsXML / Muse) - Rendered format: Hospital-style 12-lead ECG PNG
- Paper speed: 25 mm/s (standard clinical convention)
- Gain: 10 mm/mV
- Duration: 10 seconds
- Layout: 4 rows × 3 leads (standard clinical layout)
- Patch size: 224 × 224 px per lead
- 249,785 12-lead ECGs with paired physician interpretation reports
- Source: Stanford University Medical Center
- Preprocessing: XML/numeric → hospital-style PNG via
video_chat_ui.preprocessing.ecg
| Stage | Method | Details |
|---|---|---|
| 1 | Encoder pretraining | Freeze LLM; train vision encoder + MLP on (ECG, report) pairs |
| 2 | SFT | Fine-tune full model on MCQ + VQA instruction pairs |
| Metric | Stanford | UCSF |
|---|---|---|
| MCQ Accuracy | 87% | 91% |
| VQA Likert (1–5) | 3.65 | — |
Checkpoint: saves/Qwen2.5-VL-3B-Instruct/full/echo_grpo
HuggingFace: stanford-cardiac-ai/MARCUS-Echo
Training: SFT + GRPO (all 3 stages)
Multi-view Echo DICOM study (.tgz)
│
▼
DICOM Extractor + View Selector
(attention-based, no manual annotation)
│
▼
Multi-view Video Grid Assembly
(key views assembled into grid MP4)
│
▼
Visual Encoder (ViT, 16 × 16 patches)
per-frame spatial encoding
│
▼
Temporal Aggregation Module
(pools frame-level features across time)
│
▼
Cross-View Fusion Module
(combines features across views via cross-attention)
│
▼
Adapter Layer (Cross-Attention)
vision embeddings cross-attend to text tokens
residual connections between ViT layers and LLM blocks
│
▼
Qwen2 Language Model
- File format: Multi-view DICOM
.tgz(standard clinical export) - Rendered format: Multi-view grid video (MP4 or AVI)
- View selection: Attention-based automatic selection; no manual labeling required
- Patch decomposition: 16 × 16 spatial patches per frame
- 1,266,144 echocardiography images from 10,823 studies
- Source: Stanford University Medical Center
- Views include: parasternal long axis, parasternal short axis, apical 4-chamber, apical 2-chamber, subcostal, and others
| Stage | Method | Details |
|---|---|---|
| 1 | Encoder pretraining | Freeze LLM; train visual encoder on (echo frame, report) pairs |
| 2 | SFT | Full fine-tuning on MCQ + VQA pairs from physician reports |
| 3 | GRPO | Group Relative Policy Optimization with MCQ binary correctness reward |
| Metric | Stanford | UCSF |
|---|---|---|
| MCQ Accuracy | 67% | 86% |
| VQA Likert (1–5) | 2.41 | — |
Checkpoint: saves/Qwen2.5-VL-3B-Instruct/full/cmr_grpo
HuggingFace: stanford-cardiac-ai/MARCUS-CMR
Training: SFT + GRPO (all 3 stages)
Multi-sequence CMR DICOM study (.tgz)
│
▼
DICOM Extractor + Metadata Parser
(reads SeriesDescription, SequenceName, etc.)
│
▼
Metadata-Driven Sequence Router
(selects cine / LGE / T2 / T1 / other based on query)
│
▼
Multi-Slice Grid Assembly
(selected sequences assembled into grid MP4 via OpenCV)
│
▼
Visual Encoder (ViT, 16 × 16 patches)
│
▼
Adapter Layer (Cross-Attention)
vision embeddings → cross-attention → concat with text tokens
residual connections between ViT layers and LLM blocks
│
▼
Qwen2 Language Model
- File format: Multi-sequence, multi-slice DICOM
.tgz - Rendered format: Multi-slice grid video (MP4, OpenCV-only; no FFmpeg required)
- Sequences supported: Cine (SSFP), LGE, T2-weighted, T1-mapping, T1-weighted, phase-contrast
- Metadata routing: SeriesDescription and SequenceName DICOM tags drive sequence selection
- 12,191,751 CMR images from 9,473 studies
- Source: Stanford University Medical Center
- Sequences: cine short-axis, cine long-axis, LGE, T2 STIR, T1 mapping, and others
Identical to Echo: Stage 1 encoder pretraining → Stage 2 SFT → Stage 3 GRPO.
| Metric | Stanford | UCSF |
|---|---|---|
| MCQ Accuracy | 88% | 85% |
| VQA Likert (1–5) | 2.91 | — |
Both the Echo and CMR experts use a cross-attention adapter between the visual encoder and the language model:
Vision Encoder output (sequence of patch embeddings)
│
▼
Cross-Attention Layer
┌─────────────────────────────────────────┐
│ Query: text token embeddings │
│ Key/Value: vision patch embeddings │
│ Output: vision-enriched text tokens │
└─────────────────────────────────────────┘
│
▼ (+ residual from ViT layer output)
Concatenated with text token sequence
│
▼
Qwen2 Transformer Blocks
Residual connections carry visual information from each ViT block into the corresponding LLM block, enabling layer-wise visual grounding.
The orchestrator is a Python process that coordinates the three experts via HTTP. It does not have its own learned weights; its behavior is governed by prompt engineering and deterministic logic.
When a user submits a multimodal clinical query (e.g., "Summarize this patient's cardiac workup"), the orchestrator uses a chain-of-thought decomposition prompt to break it into modality-specific sub-queries:
Input: "What are the key findings from this patient's ECG, echo, and CMR?"
Output:
ecg_query: "Identify the rhythm, rate, axis, and any interval abnormalities."
echo_query: "Report ventricular function, wall motion, and valve status."
cmr_query: "Describe myocardial morphology, LGE pattern, and any fibrosis."
Each sub-query and its associated media file are routed to the appropriate expert via the OpenAI-compatible /v1/chat/completions endpoint:
# Pseudocode
ecg_response = post_to_api(ecg_url, ecg_query, ecg_image)
echo_response = post_to_api(echo_url, echo_query, echo_video)
cmr_response = post_to_api(cmr_url, cmr_query, cmr_video)Per-modality confidence is a scalar in [0, 1] computed from two components:
-
Rephrase agreement (α): Three semantically equivalent rephrasings of each sub-query are sent to the expert. Agreement across responses is measured by token-level n-gram overlap (ROUGE-L). High agreement → high confidence.
-
Image-delta score (β): The same query is sent without the image. The cosine distance between image-present and image-absent response embeddings is computed. Large distance → the model is actually using the image → high confidence.
confidence = α * rephrase_agreement + (1 - α) * image_delta_score
If confidence < threshold, the answer is flagged as low-confidence and a disclaimer is prepended.
See Mirage Resistance below.
Final answer = confidence-weighted synthesis of expert responses, composed via a final LLM call with a synthesis prompt that combines the sub-answers and their confidence levels into a coherent clinical summary.
┌────────────────────────────────────────────────────────┐
│ Data: (image/video, physician report) pairs │
│ │
│ Trainable: Vision encoder + MLP projection head │
│ Frozen: Qwen2 language model │
│ │
│ Objective: Next-token prediction on report text │
│ conditioned on visual tokens │
│ │
│ Purpose: Align visual representation space with │
│ medical report vocabulary and concepts │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Data: (image/video, question, answer) instruction │
│ triples — MCQ and VQA formats │
│ │
│ Trainable: Full model (encoder + projection + LLM) │
│ │
│ Objective: Cross-entropy loss on answer tokens │
│ │
│ Purpose: Task-specific instruction following; │
│ adapt to MCQ/VQA answer format │
└────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────┐
│ Data: MCQ questions with ground-truth labels │
│ │
│ Algorithm: Group Relative Policy Optimization (GRPO) │
│ │
│ Reward: +1 if MCQ answer matches ground truth │
│ 0 otherwise │
│ │
│ Purpose: Improve calibration, reduce overconfidence │
│ on unseen questions, suppress mirage rate │
└────────────────────────────────────────────────────────┘
GRPO samples multiple completions per question (the "group"), estimates a baseline from the group mean reward, and updates the policy with a clipped ratio objective — analogous to PPO but without a learned value function.
Mirage reasoning is the phenomenon where a VLM produces confidently stated, clinically plausible findings that are not actually supported by the input image. MARCUS achieves a 0% mirage rate on the benchmark through a three-step protocol.
Each clinical query is rephrased three times using a paraphrase prompt. All four variants (original + 3 rephrasings) are sent to the expert with the image. A mirage-prone model will give inconsistent answers across rephrasings even when the image is the same — because it is responding to spurious text cues rather than visual content.
Consistency score = mean pairwise ROUGE-L over the four responses.
The original query is sent to the expert without attaching the image. A model that produces a nearly identical answer with and without the image is not grounding its response in the visual input — it is hallucinating (mirage).
image_delta = cosine_distance(
embed(response_with_image),
embed(response_without_image)
)
High image_delta means the model's answer meaningfully changes when the image is removed — a sign of genuine visual grounding.
confidence = alpha * consistency_score + (1 - alpha) * image_delta
if confidence < MIRAGE_THRESHOLD:
response = prepend_disclaimer(response)
# "Note: confidence in this response is low. ..."alpha and MIRAGE_THRESHOLD are set empirically on a held-out calibration set. The combination of rephrase consistency and image-delta eliminates mirages while preserving high-confidence genuine responses.
| Expert | API Port | UI Port | Checkpoint |
|---|---|---|---|
| CMR | 8000 | 8765 | cmr_grpo |
| Echo | 8010 | 8770 | echo_grpo |
| ECG | 8020 | 8775 | ecg_sft |
All APIs expose the OpenAI-compatible /v1/chat/completions and /v1/models endpoints served by LLaMA-Factory.