Guankun Wang†, Long Bai†, Junyi Wang†, Kun Yuan†, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Hongbin Liu, Nicolas Padoy, Nassir Navab, and Hongliang Ren*
Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.
git clone https://github.com/gkw0010/EndoChat
cd EndoChat/
- Basic Setup
conda create -n endochat python=3.10 -y
conda activate endochat
pip install -r requirements.txt
- Install Flash-Attention(Optional)
If you want to use flash-attention to increase computation efficiency, use the following command:
pip install flash-attn==2.5.6 --no-build-isolation
- Install LLaMA2-Accessory as Python Packege
pip install -e .
The Surg-396K dataset can be downloaded through this link.
To fine-tune the Sphinx-Tiny-1k model on Surg-396K dataset with image size 1024, use the following commands:
cd accessory/
bash exps/finetune/finetune_ens5_13b.sh
To run inference using the fine-tuned models, use the following command:
cd accessory/
python inference.py
If you find EndoChat
useful for your research or development, please cite the following:
@article{wang2025endochat,
title={EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery},
author={Wang, Guankun and Bai, Long and Wang, Junyi and Yuan, Kun and Li, Zhen and Jiang, Tianxu and He, Xiting and Wu, Jinlin and Chen, Zhen and Lei, Zhen and others},
journal={arXiv preprint arXiv:2501.11347},
year={2025}
}