Skip to content

Repo for preprint 2025 "MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models"

Notifications You must be signed in to change notification settings

Aofei-Chang/MedHEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

💭🤖 MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models 🏥📊

📄 Paper: MedHEval on arXiv
🧠 What’s inside? A comprehensive suite of evaluation tools, (Med)-LVLMs, curated datasets, and implementations of hallucination mitigation baselines for several Med-LVLMs.


Overview

Medical Large Vision-Language Models (Med-LVLMs) offer great promise in clinical AI by combining image understanding and language generation. However, they frequently generate hallucinations—plausible but ungrounded or incorrect outputs—which can undermine trust and safety in medical applications.

MedHEval addresses this challenge by introducing a comprehensive benchmark to:

  • Categorize hallucinations in Med-LVLMs,
  • Evaluate model behavior across hallucination types,
  • Compare mitigation strategies on multiple Med-LVLMs.

Code Structure

MedHEval provides a modular and extensible codebase. Here's a high-level breakdown of the repository:

MedHEval/
│
├── benchmark_data/                  # Benchmark VQA and report data (excluding raw images)
│
├── code/
│   ├── baselines/
│   │   ├── (Med)-LVLMs/             # Inference code for baseline LVLMs (e.g., LLaVA, LLM-CXR, etc.)
│   │   └── mitigation/             # Implementations of hallucination mitigation strategies
│   │
│   ├── data_generation/            # Scripts to generate benchmark data split by dataset and hallucination type
│   │
│   └── evaluation/
│       ├── close_ended_evaluation/ # Evaluation pipeline for close-ended tasks (all hallucination types)
│       ├── open_ended_evaluation/  # Knowledge hallucination (open-ended)
│       └── report_eval/            # Visual hallucination evaluation from generated reports
│
└── README.md

Each component may include its own README with detailed instructions.


Hallucination Categories

MedHEval classifies hallucinations into three interpretable types:

  1. Visual Misinterpretation
    Misunderstanding or inaccurate reading of visual input (e.g., identifying a lesion that doesn’t exist).

  2. Knowledge Deficiency
    Errors stemming from gaps or inaccuracies in medical knowledge (e.g., incorrect visual feature-disease associations).

  3. Context Misalignment
    Failure to align visual understanding with medical contexts (e.g., answering without considering the medical history).


Benchmark Components

MedHEval consists of the following key components:

  • 📊 Diverse Medical VQA Datasets and Fine-Grained Metrics
    Includes both close-ended (yes/no, multiple choice) and open-ended (free-text, report generation) tasks. The benchmark provides structured metrics for each hallucination category.

  • 🧠 Comprehensive Evaluation on Diverse (Med)-LVLMs
    MedHEval supports a broad range of models, including:

    • Generalist models (e.g., LLaVA, MiniGPT-4)
    • Medical-domain models (e.g., LLaVA-Med, LLM-CXR, CheXagent)
  • 🛠️ Evaluation of Hallucination Mitigation Strategies
    Benchmarked techniques include diverse mitigation methods focusing on visual bias or LLM bias:


Getting Started

1. Clone the Repository

git clone https://github.com/yourusername/MedHEval.git
cd MedHEval

2. Install Dependencies

MedHEval supports a modular and flexible setup. You can selectively evaluate any Med-LVLM and any hallucination type.

Baseline Models Setup

Please refer to the links below for the environment setup corresponding to the model you wish to evaluate. Some baseline models, such as LLM-CXR, may involve complex installation steps. In such cases, we recommend setting up the environment according to the official instructions and downloading the required model checkpoints. Once the environment is ready, you can run inference using our implementation provided under ./code/baselines/(Med)-LVLMs/.

Each Med-LVLM has specific requirements. Please follow the official or customized setup instructions accordingly, including environment configuration and model checkpoint preparation.

  • LLaVA-Med
  • LLaVA-Med-1.5 -- We provide our customized environment setup and implementation, which includes modifications to the transformers package for hallucination mitigation baselines. Please refer to the corresponding folder code/baselines/(Med)-LVLMs/llava-med-1.5 for detailed instructions.
  • MiniGPT-4
  • LLM-CXR
  • CheXagent (Note: use via HuggingFace Transformers) This model can be loaded directly via Hugging Face transformers as no model and training codes are provided in their official repo. No special environment setup is needed beyond installing torch and transformers. We use the same environment as LLaVA-Med-1.5.
  • RadFM
  • XrayGPT (Note: MiniGPT-4 environment) This model shares the same structure and environment setup as MiniGPT-4.

Each model’s folder under code/baselines/(Med)-LVLMs/ contains:

  • Inference scripts
  • Config files
  • Notes for modified packages (e.g., transformers)

Evaluation Modules

Note: The setup is separate from the (Med)-LVLMs and does not require installing the full environment of any specific Med-LVLM. Refer to code/evaluation for how to set up environments and how to run the evaluations.


Data Access

The benchmark_data/ folder includes annotation and split files.
Note: Due to license restrictions, image data must be downloaded separately:


Citation

If you use MedHEval in your research, please cite:

@article{chang2025medheval,
  title={MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models},
  author={Chang, Aofei and Huang, Le and Bhatia, Parminder and Kass-Hout, Taha and Ma, Fenglong and Xiao, Cao},
  journal={arXiv preprint arXiv:2503.02157},
  year={2025}
}

About

Repo for preprint 2025 "MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published