CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection
Guankun Wang∗, Han Xiao∗, Huxin Gao, Renrui Zhang, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li†, Hongliang Ren†
With the advances in surgical robotics, robot-assisted endoscopic submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Large Visual-Language Models (LVLMs) offer promising decision support and predictive planning capabilities for robotic systems, which can augment the accuracy of ESD and reduce procedural risks. However, existing datasets for multi-level fine-grained ESD surgical motion understanding are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training LVLMs as the robotic Co-Pilot of Endoscopic Submucosal Dissection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. CoPESD enables granular analysis of ESD motions, focusing on the complex task of submucosal dissection. Extensive experiments on the LVLMs demonstrate the effectiveness of CoPESD in training LVLMs to predict following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD instruction-following and surgical automation.
- CoPESD is built based on a granular decomposition of surgical motions, providing precise motion definitions for ESD.
- CoPESD is a fine-grained multi-level surgical motion dataset including 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions.
- We provide the link to download CoPESD.
We offer the download link for all the data involved in CoPESD. We offer two categories of data for download and use. The first category is free for download. This sort of data has been involved in our previous work and can be downloaded and used by anyone. The second category of data is request for download, which needs to submit a request to us before downloading. For data security review purposes, we need to identify the applicant's usage of the data and determine whether to provide a download link based on that usage.
The free download data can be downloaded through this link.
The data will be publicly accessible upon acceptance.
If you wish to have access to full CoPESD dataset, please kindly fill the request form.
-
Follow the instructions provided in the LLaMA2-Accessory repository to set up the environment.
-
Download the pretrained Sphinx-Tiny-1k models from huggingface and place them in the
sphinx_esd/accessory/data/SPHINX-Tiny
directory.
To fine-tune Sphinx-ESD-13B with different image sizes, use the following commands:
cd sphinx_esd/accessory
bash exps/finetune_ens1_13b.sh
cd sphinx_esd/accessory
bash exps/finetune_ens5_13b.sh
To run inference and evaluate using the fine-tuned models, use the following commands:
cd sphinx_esd/accessory
bash exps/generate_action.sh
Follow the instructions provided in the LLaVA repository to set up the environment and download the pretrained LLaVA-1.5 models.
To fine-tune LLaVA-ESD-7B and LLaVA-ESD-13B models, use the following commands:
cd llava_esd
bash scripts/v1_5/finetune_copesd_7b.sh
cd llava_esd
bash scripts/v1_5/finetune_copesd_13b.sh
To run inference and evaluate using the fine-tuned models, use the following commands:
cd llava_esd
bash scripts/v1_5/eval/eval_copesd_7b.sh
cd llava_esd
bash scripts/v1_5/eval/eval_copesd_13b.sh
We have released the fine-tuned model checkpoints on huggingface. You can download them and perform evaluations directly.
If you have any questions, feel free to reach out to [email protected]
. Please try to specify the problem with details so we can help you better and quicker!
If you find CoPESD
useful for your research or development, please cite the following:
@article{wang2024copesd,
title={CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection},
author={Wang, Guankun and Xiao, Han and Gao, Huxin and Zhang, Renrui and Bai, Long and Yang, Xiaoxiao and Li, Zhen and Li, Hongsheng and Ren, Hongliang},
journal={arXiv preprint arXiv:2410.07540},
year={2024}
}
The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).