An implementation based on Transformer Interpretability Beyond Attention Visualization (Chefer et al., CVPR 2021).
This repository studies the explainability of Vision Transformers (ViTs) trained with different methods:
- Supervised learning on ImageNet (original ViT)
- Self-supervised learning (MAE, DINOv2)
- Vision-language models (CLIP)
Several approaches exist for computing heatmaps that reflect model explainability (XAI):
| Method | Description |
|---|---|
| GradCAM | Primarily designed for CNNs |
| Attention Rollout | Aggregates attention weights across layers |
| Attention Rollout + Gradients | Approximated Layer-wise Relevance Propagation (LRP) for Transformers |
| AttnLRP | Exact LRP implementation for ViT |
This implementation uses the approximated LRP method from Hila Chefer's Transformer-Explainability due to its simplicity and adaptability across different ViT architectures.
Note: To produce heatmaps, the ViT must have a classification head. For self-supervised models (MAE, CLIP), linear probing on ImageNet is required first.
| Model | Pretraining | Head | Classes |
|---|---|---|---|
vit_base_patch16_224 |
Supervised (ImageNet) | Linear probe | 1000 |
dinov2_base_imagenet1k_1layer_lrp |
DINOv2 self-supervised | Linear probe | 1000 |
mae_vit_base_patch16_224 |
MAE self-supervised | Linear probe (by me) | 300 |
clip_vit_base_patch16_224 |
CLIP vision-language | Linear probe (by me) | 300 |
The core Transformer-Explainability code was developed by Hila Chefer. My contributions include:
- Adapting the LRP method to different ViT implementations (DINOv2, MAE, CLIP)
- Linear probing on ImageNet subsets to enable heatmap visualization for self-supervised models
- Unified interface for loading and visualizing different model architectures
Python version 3.10.19
# Install dependencies
uv sync
# Generate heatmap for an image
uv run python main.py --model <model_name> --image <path_to_image> --method transformer_attribution
# Example
uv run python main.py --model mae_vit_base_patch16_224 --image input_images/catdog.pngAvailable methods:
transformer_attribution(recommended) - Gradient-weighted attention rolloutrollout- Pure attention rollout (not tested)full- Full LRP propagation to pixels (not tested)last_layer- Last layer attention only (not tested)last_layer_attn- Last layer attention without gradients (not tested)
ViT Base Patch16 224 (Supervised):
DINOv2 ViT Base Patch14 224 (Self-supervised + Linear Probe):
MAE ViT Base Patch16 224 (Self-supervised + Linear Probe):
CLIP ViT Base Patch16 224 (Vision-Language + Linear Probe):
- 2025-01-22 - Added CLIP ViT-B/16 support (linear probe on ImageNet first 300 classes)
- 2025-01-20 - Added MAE ViT-B/16 support (linear probe on ImageNet first 300 classes)
- 2025-01-16 - Added DINOv2 ViT-B/14 support. Note that the ViT backbone was trained on 518 image size with patch size 14. However, the linear probing was done on ImageNet-1K dataset with 224 inputs.
- Chefer, H., Gur, S., & Wolf, L. (2021). Transformer Interpretability Beyond Attention Visualization. CVPR 2021.
- Abnar, S., & Zuidema, W. (2020). Quantifying Attention Flow in Transformers. ACL 2020.
- Achtibat, R., et al. (2024). AttnLRP: Attention-aware Layer-wise Relevance Propagation. ICML 2024.
- Jacob Gildenblat's blog post: Exploring Explainability for Vision Transformers
This work builds upon Hila Chefer's Transformer-Explainability repository.



