EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Guankun Wang†, Long Bai†, Junyi Wang†, Kun Yuan†, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Hongbin Liu, Nicolas Padoy, Nassir Navab, and Hongliang Ren*

Overview

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.

Environment Setup(Linux)

Clone this repository and navigate to the Endochat folder

git clone https://github.com/gkw0010/EndoChat
cd EndoChat/

Install required packages

Basic Setup

conda create -n endochat python=3.10 -y
conda activate endochat
pip install -r requirements.txt

Install Flash-Attention(Optional)

If you want to use flash-attention to increase computation efficiency, use the following command:

pip install flash-attn==2.5.6 --no-build-isolation

Install LLaMA2-Accessory as Python Packege

pip install -e .

Data Download

The Surg-396K dataset can be downloaded through this link.

Fine-tuning on Surg-396K dataset

To fine-tune the Sphinx-Tiny-1k model on Surg-396K dataset with image size 1024, use the following commands:

cd accessory/
bash exps/finetune/finetune_ens5_13b.sh

Inference

To run inference using the fine-tuned models, use the following command:

cd accessory/
python inference.py

Citation

If you find EndoChat useful for your research or development, please cite the following:

@article{wang2025endochat,
  title={EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery},
  author={Wang, Guankun and Bai, Long and Wang, Junyi and Yuan, Kun and Li, Zhen and Jiang, Tianxu and He, Xiting and Wu, Jinlin and Chen, Zhen and Lei, Zhen and others},
  journal={arXiv preprint arXiv:2501.11347},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
SPHINX		SPHINX
accessory		accessory
figures		figures
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Overview

Environment Setup(Linux)

Clone this repository and navigate to the Endochat folder

Install required packages

Data Download

Fine-tuning on Surg-396K dataset

Inference

Citation

About

Releases

Packages

Contributors 2

Languages

gkw0010/EndoChat

Folders and files

Latest commit

History

Repository files navigation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Overview

Environment Setup(Linux)

Clone this repository and navigate to the Endochat folder

Install required packages

Data Download

Fine-tuning on Surg-396K dataset

Inference

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages