We present Robin3D, a state-of-the-art 3D Large Language Model trained on large-scale instruction-following data generated by our novel Robust Instruction Generation (RIG) data engine. To handle our RIG-generated complex data, our Robin3D further enhances its spatial understanding by Relation-Augmented Projector and improves the object referring and grounding ability by ID-Feature Bonding.
[2024.09] We release Robin3D [paper][code], a new SOTA 3D LLM for 3D scenes.
-
Prepare the environment:
conda create -n robin3d python=3.9.17 conda activate robin3d conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt
-
Download LLM backbone:
- We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
-
Annotations and extracted features:
Please follow the instructions in Chat-Scene's Preparation.
- Coming soon.
Our paper has disappeared from Google Scholar, and we don't know why. We have emailed the Google Scholar team but have not received a response yet.
If you find our work useful in your research, please consider citing:
@misc{kang2025robin3dimproving3dlarge,
title={Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning},
author={Weitai Kang and Haifeng Huang and Yuzhang Shang and Mubarak Shah and Yan Yan},
year={2025},
eprint={2410.00255},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.00255},
}
Stay tuned for our project. π₯
If you have any questions or suggestions, feel free to drop us an email ([email protected]
) or open an issue.
Thanks to the open source of the following projects:
3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer, Grounded-3DLLM, Chat-Scene
Detectors: Mask3D,
Representations: Uni3D, DINOv2
3D Models: OpenScene