Skip to content

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

License

Notifications You must be signed in to change notification settings

jangsoohyuk/SuiT

Repository files navigation

(SuiT) Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens arXiv

Jaihyun Lew* · Soohyuk Jang* · Jaehoon Lee* · Seungryong Yoo · Eunji Kim
Saehyung Lee · Jisoo Mok · Siwon Kim · Sungroh Yoon

main

attention vis.

🔥 In this work, we propose a novel tokenization pipeline that replaces the grid-based tokenization with superpixels, encouraging each token to capture a distinct visual concept. Unlike square image patches, superpixels are formed in varying shapes, sizes, and locations, making direct substitution challenging. To address this, our pipeline first generates pixel-level embeddings and efficiently aggregates them within superpixel clusters, producing superpixel tokens that seamlessly replace patch tokens in ViT.

Environmental Setup

To set up the environment, run the following commands:

git clone https://github.com/jangsoohyuk/SuiT.git
cd SuiT
conda create -n suit python=3.10 -y
conda activate suit
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install torch_scatter-2.1.2+pt21cu121-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txt

Structure

The dataset directory should be structured as follows:

    datasets/

    └── imagenet-1k/
  • checkpoint files are saved at ./outputs
  • logs are saved under ./logs

Training

To train our model, run the corresponding bash script based on the model size. For example, to train SuiT-Base on ImageNet-1k, run the following command:

bash scripts/train_base.sh

Evaluation

To evaluate a pre-trained model, run the following command:

bash scripts/eval.sh

Pretrained weights

Pretrained models can be downloaded here.

Attention Map Visualization

You can visualize the generated superpixels and self-attention maps using the jupyter notebook file, attention_visualization.ipynb.

Acknowledgment

This repository is based on the original DEiT repository.
We sincerely thank the authors for their great work.

About

Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •