Took 100 images per samples of cifar-10 dataset.
Resnet 18 architecture
Overfit the model
-> Training accuracy: 99%~100%
-> Testing accuray: 46%
Showed the magnitude of Kernels weights at different layers
Visualized the Kernels at different layers.
Additional : Visualize the Feature map of horse image at different layers
: Go to Assignment 2 Folder
This repository contains code and notebooks for different fine-tuning methods of the Vision Transformer (ViT) model on the EuroSAT dataset, along with visualization of attention maps.
- Fine-tunes the last fully connected layer of the ViT model.
- Fine-tunes only the 8th to 11th transformer layers and the fully connected layer.
- Fine-tunes all layers of the ViT model.
- Does not perform fine-tuning on the ViT model.
- Visualization.ipynb
- Visualizes attention maps for different labels and models.
Model | Train Accuracy | Validation Accuracy | Test Accuracy |
Only Last Layer Fine-Tune ViT | 98.70% | 96.48% | 95.96% |
Bottom Layer Fine-Tune ViT | 99.53% | 97.04% | 97.11% |
All Layer Fine-Tune ViT | 91.48% | 91.85% | 90.74% |
Do Not Fine-Tune ViT | 13.55% | 15.22% | 13.92% |
- Achieved a high validation accuracy of 95.96% quickly.
- Efficient and competitive performance.
- Outperformed last layer fine-tuning with a validation accuracy of 97.33%.
- Captured complex features effectively.
- Initially lower performance but gradually improved.
- Requires more training time.
- Last layer fine-tuning and no fine-tuning exhibit similar attention maps across all transformer layers in the model. This is because we freeze all the transformer layers while fine-tuning.
- Across all models, the initial layers of the transformer consistently show better attention map visualizations. This suggests that these layers focus on capturing low-level and fundamental features in the images, which are essential for understanding the dataset.
- The attention map visualizations in the all layer fine-tuning case are not as informative compared to the others, as this strategy requires more training epochs.
Go to Assigment 3 folder
- Learn image feature representation through solving Jigsaw puzzle of jumbled patches of images.
- Defined Custom CNN model to handle 32 * 32 images.
- Jigsaw creation
- Resize the given CIFAR-10 32 * 32 Image to 128 * 128 image shape.
- Do center crop with 105 * 105 image shape.
- Divide the image into 3 * 3 image patch, where Each patch size is 32 * 32.
- Pass all 9, 32 * 32 image to model and predict the permutation index.
- Permutation Creation:
- Choose the 1000 permutation of value 9 such that hamming distance between choose permutation is greater than 0.9
- Downstream task : Image Classification.
- GradCam Analysis
- Jigsaw SSL and Downstream Task Inference on Test dataset
- Learn image feature representation via finding the relative patch location of patches of images.
- Used resnet-18 model architecture.
- Resized the given Image to 96 * 96 shape and divide into 3 * 3 patches.
- Pass the center patch and neighbour patch to the model and it will predict the relative patch position with respect to center position.
- Downstream task : Image Classification.
- Inference on Test dataset.
- GradCam Analysis
Model | Validation Accuracy | Test Accuracy |
Jigsaw SSL | 95% | 95% |
Jigsaw SSL - Downstream Classification Task | 67% | 67% |
Relative patch location SSL | 98% | 97% |
Relative patch location SSL - Downsteam Classification Task | 74% | 74% |
Paper: ViViT: A Video Vision Transformer Paper Review
Paper:Universal Domain Adaptation through Self-Supervision
- Visual Entities Empowered Zero-Shot Image-to-Text Generation Transfer Across Domains