💭🤖 MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models 🏥📊

📄 Paper: MedHEval on arXiv
🧠 What’s inside? A comprehensive suite of evaluation tools, (Med)-LVLMs, curated datasets, and implementations of hallucination mitigation baselines for several Med-LVLMs.

Overview

Medical Large Vision-Language Models (Med-LVLMs) offer great promise in clinical AI by combining image understanding and language generation. However, they frequently generate hallucinations—plausible but ungrounded or incorrect outputs—which can undermine trust and safety in medical applications.

MedHEval addresses this challenge by introducing a comprehensive benchmark to:

Categorize hallucinations in Med-LVLMs,
Evaluate model behavior across hallucination types,
Compare mitigation strategies on multiple Med-LVLMs.

Code Structure

MedHEval provides a modular and extensible codebase. Here's a high-level breakdown of the repository:

MedHEval/
│
├── benchmark_data/                  # Benchmark VQA and report data (excluding raw images)
│
├── code/
│   ├── baselines/
│   │   ├── (Med)-LVLMs/             # Inference code for baseline LVLMs (e.g., LLaVA, LLM-CXR, etc.)
│   │   └── mitigation/             # Implementations of hallucination mitigation strategies
│   │
│   ├── data_generation/            # Scripts to generate benchmark data split by dataset and hallucination type
│   │
│   └── evaluation/
│       ├── close_ended_evaluation/ # Evaluation pipeline for close-ended tasks (all hallucination types)
│       ├── open_ended_evaluation/  # Knowledge hallucination (open-ended)
│       └── report_eval/            # Visual hallucination evaluation from generated reports
│
└── README.md

Each component may include its own README with detailed instructions.

Hallucination Categories

MedHEval classifies hallucinations into three interpretable types:

Visual Misinterpretation
Misunderstanding or inaccurate reading of visual input (e.g., identifying a lesion that doesn’t exist).
Knowledge Deficiency
Errors stemming from gaps or inaccuracies in medical knowledge (e.g., incorrect visual feature-disease associations).
Context Misalignment
Failure to align visual understanding with medical contexts (e.g., answering without considering the medical history).

Benchmark Components

MedHEval consists of the following key components:

📊 Diverse Medical VQA Datasets and Fine-Grained Metrics
Includes both close-ended (yes/no, multiple choice) and open-ended (free-text, report generation) tasks. The benchmark provides structured metrics for each hallucination category.
🧠 Comprehensive Evaluation on Diverse (Med)-LVLMs
MedHEval supports a broad range of models, including:
- Generalist models (e.g., LLaVA, MiniGPT-4)
- Medical-domain models (e.g., LLaVA-Med, LLM-CXR, CheXagent)
🛠️ Evaluation of Hallucination Mitigation Strategies
Benchmarked techniques include diverse mitigation methods focusing on visual bias or LLM bias:
- OPERA
- VCD
- DoLa
- AVISC
- M3ID M3ID Paper (No official code released; implementation follows AVISC.)
- DAMRO
  (No official code released; implemented by ourself based on the paper: DAMRO Paper)
- PAI

Getting Started

1. Clone the Repository

git clone https://github.com/yourusername/MedHEval.git
cd MedHEval

2. Install Dependencies

MedHEval supports a modular and flexible setup. You can selectively evaluate any Med-LVLM and any hallucination type.

Baseline Models Setup

Please refer to the links below for the environment setup corresponding to the model you wish to evaluate. Some baseline models, such as LLM-CXR, may involve complex installation steps. In such cases, we recommend setting up the environment according to the official instructions and downloading the required model checkpoints. Once the environment is ready, you can run inference using our implementation provided under ./code/baselines/(Med)-LVLMs/.

Each Med-LVLM has specific requirements. Please follow the official or customized setup instructions accordingly, including environment configuration and model checkpoint preparation.

LLaVA-Med
LLaVA-Med-1.5 -- We provide our customized environment setup and implementation, which includes modifications to the transformers package for hallucination mitigation baselines. Please refer to the corresponding folder code/baselines/(Med)-LVLMs/llava-med-1.5 for detailed instructions.
MiniGPT-4
LLM-CXR
CheXagent (Note: use via HuggingFace Transformers) This model can be loaded directly via Hugging Face transformers as no model and training codes are provided in their official repo. No special environment setup is needed beyond installing torch and transformers. We use the same environment as LLaVA-Med-1.5.
RadFM
XrayGPT (Note: MiniGPT-4 environment) This model shares the same structure and environment setup as MiniGPT-4.

Each model’s folder under code/baselines/(Med)-LVLMs/ contains:

Inference scripts
Config files
Notes for modified packages (e.g., transformers)

Evaluation Modules

Note: The setup is separate from the (Med)-LVLMs and does not require installing the full environment of any specific Med-LVLM. Refer to code/evaluation for how to set up environments and how to run the evaluations.

Close-ended evaluation: Lightweight, easy to use and model-agnostic
→ code/evaluation/close_ended_evaluation
Open-ended (Report): This setup is more involved, as some tools depend on older versions of Python and other packages.
→ code/evaluation/report_eval
Open-ended (Knowledge): This setup is straightforward. Required packages: langchain, pydantic, and access to an LLM (e.g., Claude 3.5 Sonnet used in our experiments).
→ code/evaluation/open_ended_evaluation

Data Access

The benchmark_data/ folder includes annotation and split files.
Note: Due to license restrictions, image data must be downloaded separately:

Citation

If you use MedHEval in your research, please cite:

@article{chang2025medheval,
  title={MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models},
  author={Chang, Aofei and Huang, Le and Bhatia, Parminder and Kass-Hout, Taha and Ma, Fenglong and Xiao, Cao},
  journal={arXiv preprint arXiv:2503.02157},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmark_data		benchmark_data
code		code
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💭🤖 MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models 🏥📊

Overview

Code Structure

Hallucination Categories

Benchmark Components

Getting Started

1. Clone the Repository

2. Install Dependencies

Baseline Models Setup

Evaluation Modules

Data Access

Citation

About

Releases

Packages

Languages

Aofei-Chang/MedHEval

Folders and files

Latest commit

History

Repository files navigation

💭🤖 MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models 🏥📊

Overview

Code Structure

Hallucination Categories

Benchmark Components

Getting Started

1. Clone the Repository

2. Install Dependencies

Baseline Models Setup

Evaluation Modules

Data Access

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages