💭🤖 MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models 🏥📊
📄 Paper: MedHEval on arXiv
🧠 What’s inside? A comprehensive suite of evaluation tools, (Med)-LVLMs, curated datasets, and implementations of hallucination mitigation baselines for several Med-LVLMs.
Medical Large Vision-Language Models (Med-LVLMs) offer great promise in clinical AI by combining image understanding and language generation. However, they frequently generate hallucinations—plausible but ungrounded or incorrect outputs—which can undermine trust and safety in medical applications.
MedHEval addresses this challenge by introducing a comprehensive benchmark to:
- Categorize hallucinations in Med-LVLMs,
- Evaluate model behavior across hallucination types,
- Compare mitigation strategies on multiple Med-LVLMs.
MedHEval provides a modular and extensible codebase. Here's a high-level breakdown of the repository:
MedHEval/
│
├── benchmark_data/ # Benchmark VQA and report data (excluding raw images)
│
├── code/
│ ├── baselines/
│ │ ├── (Med)-LVLMs/ # Inference code for baseline LVLMs (e.g., LLaVA, LLM-CXR, etc.)
│ │ └── mitigation/ # Implementations of hallucination mitigation strategies
│ │
│ ├── data_generation/ # Scripts to generate benchmark data split by dataset and hallucination type
│ │
│ └── evaluation/
│ ├── close_ended_evaluation/ # Evaluation pipeline for close-ended tasks (all hallucination types)
│ ├── open_ended_evaluation/ # Knowledge hallucination (open-ended)
│ └── report_eval/ # Visual hallucination evaluation from generated reports
│
└── README.md
Each component may include its own README with detailed instructions.
MedHEval classifies hallucinations into three interpretable types:
-
Visual Misinterpretation
Misunderstanding or inaccurate reading of visual input (e.g., identifying a lesion that doesn’t exist). -
Knowledge Deficiency
Errors stemming from gaps or inaccuracies in medical knowledge (e.g., incorrect visual feature-disease associations). -
Context Misalignment
Failure to align visual understanding with medical contexts (e.g., answering without considering the medical history).
MedHEval consists of the following key components:
-
📊 Diverse Medical VQA Datasets and Fine-Grained Metrics
Includes both close-ended (yes/no, multiple choice) and open-ended (free-text, report generation) tasks. The benchmark provides structured metrics for each hallucination category. -
🧠 Comprehensive Evaluation on Diverse (Med)-LVLMs
MedHEval supports a broad range of models, including:- Generalist models (e.g., LLaVA, MiniGPT-4)
- Medical-domain models (e.g., LLaVA-Med, LLM-CXR, CheXagent)
-
🛠️ Evaluation of Hallucination Mitigation Strategies
Benchmarked techniques include diverse mitigation methods focusing on visual bias or LLM bias:- OPERA
- VCD
- DoLa
- AVISC
- M3ID M3ID Paper (No official code released; implementation follows AVISC.)
- DAMRO
(No official code released; implemented by ourself based on the paper: DAMRO Paper) - PAI
git clone https://github.com/yourusername/MedHEval.git
cd MedHEval
MedHEval supports a modular and flexible setup. You can selectively evaluate any Med-LVLM and any hallucination type.
Please refer to the links below for the environment setup corresponding to the model you wish to evaluate. Some baseline models, such as LLM-CXR, may involve complex installation steps. In such cases, we recommend setting up the environment according to the official instructions and downloading the required model checkpoints. Once the environment is ready, you can run inference using our implementation provided under ./code/baselines/(Med)-LVLMs/
.
Each Med-LVLM has specific requirements. Please follow the official or customized setup instructions accordingly, including environment configuration and model checkpoint preparation.
- LLaVA-Med
- LLaVA-Med-1.5 -- We provide our customized environment setup and implementation, which includes modifications to the transformers package for hallucination mitigation baselines. Please refer to the corresponding folder
code/baselines/(Med)-LVLMs/llava-med-1.5
for detailed instructions. - MiniGPT-4
- LLM-CXR
- CheXagent (Note: use via HuggingFace Transformers) This model can be loaded directly via Hugging Face transformers as no model and training codes are provided in their official repo. No special environment setup is needed beyond installing torch and transformers. We use the same environment as LLaVA-Med-1.5.
- RadFM
- XrayGPT (Note: MiniGPT-4 environment) This model shares the same structure and environment setup as MiniGPT-4.
Each model’s folder under code/baselines/(Med)-LVLMs/
contains:
- Inference scripts
- Config files
- Notes for modified packages (e.g.,
transformers
)
Note: The setup is separate from the (Med)-LVLMs and does not require installing the full environment of any specific Med-LVLM. Refer to code/evaluation
for how to set up environments and how to run the evaluations.
-
Close-ended evaluation: Lightweight, easy to use and model-agnostic
→code/evaluation/close_ended_evaluation
-
Open-ended (Report): This setup is more involved, as some tools depend on older versions of Python and other packages.
→code/evaluation/report_eval
-
Open-ended (Knowledge): This setup is straightforward. Required packages:
langchain
,pydantic
, and access to an LLM (e.g., Claude 3.5 Sonnet used in our experiments).
→code/evaluation/open_ended_evaluation
The benchmark_data/
folder includes annotation and split files.
Note: Due to license restrictions, image data must be downloaded separately:
If you use MedHEval in your research, please cite:
@article{chang2025medheval,
title={MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models},
author={Chang, Aofei and Huang, Le and Bhatia, Parminder and Kass-Hout, Taha and Ma, Fenglong and Xiao, Cao},
journal={arXiv preprint arXiv:2503.02157},
year={2025}
}