(SuiT) Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Jaihyun Lew* · Soohyuk Jang* · Jaehoon Lee* · Seungryong Yoo · Eunji Kim
Saehyung Lee · Jisoo Mok · Siwon Kim · Sungroh Yoon

🔥 In this work, we propose a novel tokenization pipeline that replaces the grid-based tokenization with superpixels, encouraging each token to capture a distinct visual concept. Unlike square image patches, superpixels are formed in varying shapes, sizes, and locations, making direct substitution challenging. To address this, our pipeline first generates pixel-level embeddings and efficiently aggregates them within superpixel clusters, producing superpixel tokens that seamlessly replace patch tokens in ViT.

Environmental Setup

To set up the environment, run the following commands:

git clone https://github.com/jangsoohyuk/SuiT.git
cd SuiT
conda create -n suit python=3.10 -y
conda activate suit
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install torch_scatter-2.1.2+pt21cu121-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txt

Structure

The dataset directory should be structured as follows:

    datasets/

    └── imagenet-1k/

checkpoint files are saved at ./outputs
logs are saved under ./logs

Training

To train our model, run the corresponding bash script based on the model size. For example, to train SuiT-Base on ImageNet-1k, run the following command:

bash scripts/train_base.sh

Evaluation

To evaluate a pre-trained model, run the following command:

bash scripts/eval.sh

Pretrained weights

Pretrained models can be downloaded here.

Attention Map Visualization

You can visualize the generated superpixels and self-attention maps using the jupyter notebook file, attention_visualization.ipynb.

Acknowledgment

This repository is based on the original DEiT repository.
We sincerely thank the authors for their great work.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
datasets		datasets
scripts		scripts
LICENSE		LICENSE
README.md		README.md
attention_visualization.ipynb		attention_visualization.ipynb
augment.py		augment.py
datasets.py		datasets.py
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt
samplers.py		samplers.py
suit.py		suit.py
torch_scatter-2.1.2+pt21cu121-cp310-cp310-linux_x86_64.whl		torch_scatter-2.1.2+pt21cu121-cp310-cp310-linux_x86_64.whl
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

(SuiT) Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Environmental Setup

Structure

Training

Evaluation

Pretrained weights

Attention Map Visualization

Acknowledgment

About

Releases

Packages

Contributors 3

Languages

License

jangsoohyuk/SuiT

Folders and files

Latest commit

History

Repository files navigation

(SuiT) Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Environmental Setup

Structure

Training

Evaluation

Pretrained weights

Attention Map Visualization

Acknowledgment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages