robust_transformer

Implementation of the robust transformer paper in submission to ICLR 2023.

Requirements

This toolkit requires PyTorch torch and Ninja ninja (to compile the cuda kernels).

The experiments for the paper were conducted with Python 3.6 and PyTorch >= 1.4.0.

The toolkit supports Weights & Biases for monitoring jobs. If you use it, also install wandb.

Instructions

Run sh getdata.sh to download the Wikitext-103 data.

Training

Standard Softmax Transformer

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./src/train.py --cuda --data {your data directory}/wikitext-103/ --dataset wt103 --adaptive --n_layer 16 --d_model 128 --n_head 8 --d_head 16 --d_inner 2048 --dropout 0.1 --dropatt 0.0 --optim adam --lr 0.00025 --warmup_step 2000 --max_step 500000 --attn_type 2 --tgt_len 256 --mem_len 0 --eval_tgt_len 256 --batch_size 96 --multi_gpu --use_wandb --project_name 'mgk' --seed 1111 --job_name softmax --work_dir {your directory}

KDE Transformer

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./src/train.py --cuda --data {your data directory}/wikitext-103/ --dataset wt103 --adaptive --n_layer 16 --d_model 128 --n_head 8 --d_head 16 --d_inner 2048 --dropout 0.1 --dropatt 0.0 --optim adam --lr 0.00025 --warmup_step 2000 --max_step 500000 --attn_type 8 --tgt_len 256 --mem_len 0 --eval_tgt_len 256 --batch_size 96 --multi_gpu --use_wandb --project_name 'mgk' --seed 1111 --job_name kde --work_dir {your directory}

Robust Transformer

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./src/train.py --cuda --data {your data directory}/wikitext-103/ --dataset wt103 --adaptive --n_layer 16 --d_model 128 --n_head 8 --d_head 16 --d_inner 2048 --dropout 0.1 --dropatt 0.0 --optim adam --lr 0.00025 --warmup_step 2000 --max_step 500000 --attn_type 9 --tgt_len 256 --mem_len 0 --eval_tgt_len 256 --batch_size 96 --multi_gpu --use_wandb --project_name 'mgk' --seed 1111 --huber_a 0.1 --job_name robust --work_dir {your directory}/files_0.1

Evaluation

Train/valid perplexity values displayed during training are not exact in general, in the sense that the part of the text which does not fit to batch-size/backpropagation-span is discarded. In addition, for models with a limited context size, the perplexity is computed by splitting the text into segments which are treated independently (during training). The model thus has no context at the beginning of a new segment. This is avoided by using a sliding window. The commands to run the evaluation of a trained model on the test and validation set are given below.

CUDA_VISIBLE_DEVICES=0 python ./src/eval_sliding_window.py --cuda --data {your data directory}/wikitext-103/ --dataset wt103 --split 'test' --batch_size 1 --tgt_len 256 --mem_len 0 --clamp_len 256 --work_dir {model_dir}
CUDA_VISIBLE_DEVICES=0 python ./src/eval_sliding_window.py --cuda --data {your data directory}/wikitext-103/ --dataset wt103 --split 'valid' --batch_size 1 --tgt_len 256 --mem_len 0 --clamp_len 256 --work_dir {model_dir}

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
RVT		RVT
imagenet		imagenet
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

robust_transformer

Requirements

Instructions

Training

Evaluation

About

Releases

Packages

Contributors 2

Languages

License

aaronhan223/robust_transformer

Folders and files

Latest commit

History

Repository files navigation

robust_transformer

Requirements

Instructions

Training

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages