Vision Transformer (ViT) on CIFAR-10

This project presents a from-scratch implementation of the Vision Transformer (ViT) architecture in PyTorch, trained on the CIFAR-10 dataset. The primary objective was to achieve the highest possible test accuracy by leveraging state-of-the-art training and regularization techniques from recent research, notably the two papers "AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE" and "Training data-efficient image transformers" were the backbone for this implementation

Final Results

The model was trained for up to 300 epochs, and the best-performing checkpoint was evaluated on the held-out test set.

Model Configuration	Training Duration	Best Validation Acc.	Final Test Accuracy
ViT (DeiT-Ti Recipe)	252 Epochs*	91.00%	90.9%
ViT (DeiT-Ti Recipe)	300 Epochs	91.00%	90.7%
*The best performing model (90.9% test accuracy) was saved from a run that achieved its peak validation accuracy at 252 epochs.

Below is the confusion matrix from the final evaluation on the 10,000 test images for the best model.

How to Run

This project is designed to be run in a Google Colab environment.

Open the q1.ipynb notebook in Google Colab.
Ensure the runtime is set to a GPU instance (e.g., T4) via Runtime > Change runtime type.
Run all cells from top to bottom. The notebook will automatically:
- Download and prepare the CIFAR-10 dataset.
- Build the ViT model and the training harness.
- Train the model for the specified number of epochs, saving the best checkpoint.
- Load the best checkpoint and run a final evaluation, printing metrics and generating plots.

Note on Reproducibility:
Training for the full 300 epochs takes ~4 hours and may be interrupted by Colab’s runtime limits.
To facilitate quick evaluation, the pre-trained weights from my best run (best_vit_model.pth) are provided here.
The evaluation section of the q1.ipynb notebook can be run independently after uploading this folder to your Google Drive, allowing you to reproduce the final test results in minutes.

Best Model Configuration

The best results were achieved using a model architecture and training recipe inspired by the DeiT-Ti (Tiny) variant.

Parameter	Value
Architecture	Vision Transformer (Pre-Norm)
Patch Size	4x4
Embedding Dimension	192
Transformer Depth	12 Layers
Attention Heads	3
Optimizer	AdamW
Learning Rate	0.001 (Linearly scaled: `5e-4 * batch_size/512`)
LR Scheduler	OneCycleLR (Warmup + Cosine Annealing)
Training
Epochs	300
Batch Size	1024
Weight Decay	0.05
Label Smoothing	N/A (CrossEntropyLoss)
Regularization
Augmentations	RandAugment, RandomHorizontalFlip, RandomCrop
Batch-Level Augs	Mixup & CutMix (`combine_fn`)
Dropout Rates	MLP: 0.1, Embedding: 0.1, Attention: 0.0
Trainable Parameters	~5M

Methodology & Implementation

1. Architecture

The model is a standard Vision Transformer as described in "An Image is Worth 16x16 Words", with a Pre-Norm configuration (LayerNorm applied before the attention/MLP blocks) for improved training stability.

PatchEmbedding: Images of size (3, 32, 32) are converted into a sequence of 64 flattened patches (4x4), which are then linearly projected into a 192-dimensional embedding space.
CLS Token & Positional Embeddings: A learnable [CLS] token is prepended to the sequence, and learnable positional embeddings are added to provide the model with spatial information.
Transformer Encoder: The core of the model is a stack of 12 standard Transformer Encoder blocks, each containing Multi-Head Self-Attention and an MLP sub-layer.

_{Diagram of the Vision Transformer (ViT) architecture, adapted from Dosovitskiy et al., 2021.}
_{Source: "An Image is Worth 16x16 Words" paper}

2. Training Strategy

The key challenge with ViTs is their data-hungriness. To overcome this on a small dataset like CIFAR-10, a sophisticated training recipe inspired by the DeiT paper was adopted. The core of this strategy is aggressive regularization to prevent overfitting.

Heavy Data Augmentation: The training pipeline uses RandAugment in conjunction with Mixup and CutMix (applied at the batch level via a custom collate_fn). This forces the model to learn robust and generalizable features.
Optimizer & Scheduler: The AdamW optimizer was used with a OneCycleLR scheduler, which automatically handles a learning rate warmup phase followed by a cosine decay. This disciplined LR schedule is crucial for stable and effective training.

Ablation Study: The Indispensable Role of Batch-Level Augmentations

To quantify the impact of batch-level augmentations, I ran the following experiments:

Model Configuration	Training Duration	Best Validation Accuracy	Final Test Accuracy
ViT Baseline (No Mixup/CutMix)	30 Epochs	74.18%	73.29%
ViT + DeiT Recipe (Full)	30 Epochs	63.74%	63.61%
ViT + DeiT Recipe (Full)	252 Epochs*	91.00%	90.9%

*Note: The 30-epoch results reflect early training dynamics; the 252-epoch run represents the final model.

Key Observations

Baseline performance is strong:
A ViT trained for 30 epochs without Mixup/CutMix achieves 73.29% test accuracy, showing that AdamW + OneCycleLR alone produces reasonable results.
Aggressive augmentations initially slow convergence:
Using Mixup and CutMix in the first 30 epochs drops test accuracy to 63.61%. This counter-intuitive decrease occurs because batch-level augmentations make the training task harder, preventing the model from memorizing the data.
Long-term benefits are significant:
Over extended training (252 epochs), the full DeiT recipe achieves 90.9% test accuracy, highlighting that Mixup and CutMix enable the model to learn robust, generalizable features rather than overfitting to noisy or small datasets.

Takeaway

Early training may appear worse, but batch-level augmentations are critical for sustained performance.
Aggressive regularization strategies, even if they delay initial convergence, unlock the full potential of ViTs on limited data without requiring large-scale pre-training.

Analysis: Key to High Performance without Pre-training

The final test accuracy of 90.9% demonstrates that Vision Transformers can indeed be trained effectively on smaller datasets if the right strategy is employed. The key takeaways are:

Architecture Matters: An initial consideration was the model size. By choosing a smaller, more efficient DeiT-Ti architecture (192-dim, 12 layers) instead of a larger ViT-Base, the model had fewer parameters (~5M), making it more suitable for the limited size of the CIFAR-10 dataset and reducing the risk of overfitting.
Regularization is Paramount: The success of this project hinges on the aggressive regularization strategy borrowed from DeiT. The combination of RandAugment, Mixup, CutMix, and AdamW's weight decay successfully prevented the model from overfitting, a primary risk for ViTs.
Training Dynamics as a Signal: A key observation was that the training accuracy was consistently lower than the validation accuracy. This counter-intuitive result is a direct consequence of the regularization. Regularization makes the training task artificially difficult, forcing the model to learn robust features that generalize well.

Modern Schedulers are Non-Negotiable: The stability and performance of the training run were heavily reliant on the OneCycleLR scheduler. Without its intelligent management of the learning rate, the model would likely have converged much slower or to a less optimal result.
Peak Performance and Onset of Overfitting: An interesting observation arose from comparing two long training runs. The model that achieved a peak validation accuracy of 91.00% (saved from a run that completed 252 epochs) yielded a final test accuracy of 90.9%. A subsequent full 300-epoch run achieved a similar peak but a slightly lower test accuracy of 90.7%. This suggests the model reached its optimal generalization capability around the ~250-270 epoch mark, and further training offered no benefit, demonstrating the onset of minor overfitting. Consequently, the model from the earlier run is reported as the best-performing model.

This project validates that with a thoughtful, modern approach to data augmentation and training dynamics, the performance gap for Vision Transformers on smaller datasets can be significantly closed, reducing the dependency on massive pre-training corpora.

Acknowledgements

This implementations were made possible by studying the following seminal papers and high-quality open-source repositories.

Papers

Dosovitskiy et al. (2021). AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE.
Touvron et al. (2021). Training data-efficient image transformers & distillation through attention (DeiT).
Steiner et al. (2022). How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers.
Chen et al. (2022). Better plain ViT baselines for ImageNet-1k.

Repositories

Official Google Research ViT: google-research/vision_transformer
PyTorch Image Models (timm) by Ross Wightman: huggingface/pytorch-image-models
Phil Wang's (lucidrains) ViT implementation: lucidrains/vit-pytorch
Jeon's ViT implementation: jeonsworld/ViT-pytorch

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
vit_from_scratch.ipynb		vit_from_scratch.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Vision Transformer (ViT) on CIFAR-10

Final Results

How to Run

Best Model Configuration

Methodology & Implementation

1. Architecture

2. Training Strategy

Ablation Study: The Indispensable Role of Batch-Level Augmentations

Key Observations

Takeaway

Analysis: Key to High Performance without Pre-training

Acknowledgements

Papers

Repositories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Vision Transformer (ViT) on CIFAR-10

Final Results

How to Run

Best Model Configuration

Methodology & Implementation

1. Architecture

2. Training Strategy

Ablation Study: The Indispensable Role of Batch-Level Augmentations

Key Observations

Takeaway

Analysis: Key to High Performance without Pre-training

Acknowledgements

Papers

Repositories

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages