Skip to content

gkw0010/EndoChat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery

Guankun Wang†, Long Bai†, Junyi Wang†, Kun Yuan†, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Hongbin Liu, Nicolas Padoy, Nassir Navab, and Hongliang Ren*

Overview

Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.

Environment Setup(Linux)

Clone this repository and navigate to the Endochat folder

git clone https://github.com/gkw0010/EndoChat
cd EndoChat/

Install required packages

  1. Basic Setup
conda create -n endochat python=3.10 -y
conda activate endochat
pip install -r requirements.txt
  1. Install Flash-Attention(Optional)

If you want to use flash-attention to increase computation efficiency, use the following command:

pip install flash-attn==2.5.6 --no-build-isolation
  1. Install LLaMA2-Accessory as Python Packege
pip install -e .

Data Download

The Surg-396K dataset can be downloaded through this link.

Fine-tuning on Surg-396K dataset

To fine-tune the Sphinx-Tiny-1k model on Surg-396K dataset with image size 1024, use the following commands:

cd accessory/
bash exps/finetune/finetune_ens5_13b.sh

Inference

To run inference using the fine-tuned models, use the following command:

cd accessory/
python inference.py

Citation

If you find EndoChat useful for your research or development, please cite the following:

@article{wang2025endochat,
  title={EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery},
  author={Wang, Guankun and Bai, Long and Wang, Junyi and Yuan, Kun and Li, Zhen and Jiang, Tianxu and He, Xiting and Wu, Jinlin and Chen, Zhen and Lei, Zhen and others},
  journal={arXiv preprint arXiv:2501.11347},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published