(SuiT) Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens
Jaihyun Lew* ·
Soohyuk Jang* ·
Jaehoon Lee* ·
Seungryong Yoo ·
Eunji Kim
Saehyung Lee ·
Jisoo Mok ·
Siwon Kim ·
Sungroh Yoon

🔥 In this work, we propose a novel tokenization pipeline that replaces the grid-based tokenization with superpixels, encouraging each token to capture a distinct visual concept. Unlike square image patches, superpixels are formed in varying shapes, sizes, and locations, making direct substitution challenging. To address this, our pipeline first generates pixel-level embeddings and efficiently aggregates them within superpixel clusters, producing superpixel tokens that seamlessly replace patch tokens in ViT.
To set up the environment, run the following commands:
git clone https://github.com/jangsoohyuk/SuiT.git
cd SuiT
conda create -n suit python=3.10 -y
conda activate suit
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install torch_scatter-2.1.2+pt21cu121-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txt
The dataset directory should be structured as follows:
datasets/
└── imagenet-1k/
- checkpoint files are saved at
./outputs
- logs are saved under
./logs
To train our model, run the corresponding bash script based on the model size. For example, to train SuiT-Base on ImageNet-1k, run the following command:
bash scripts/train_base.sh
To evaluate a pre-trained model, run the following command:
bash scripts/eval.sh
Pretrained models can be downloaded here.
You can visualize the generated superpixels and self-attention maps using the jupyter notebook file, attention_visualization.ipynb
.
This repository is based on the original DEiT repository.
We sincerely thank the authors for their great work.