This repository contains the code for our paper:
Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers. Akiyoshi Tomihari and Issei Sato. arXiv
The main dependencies are:
Python 3.10 or higher
torch = 2.4.0
Please refer to the pyproject.toml file for more details.
To set up and run the project, follow these steps:
# Configure the project to create virtual environments within the project directory
poetry config virtualenvs.in-project true
# Set the local python version using pyenv
pyenv local 3.12.6
# Install dependencies and activate the virtual environment
poetry install
poetry shellTo train the models, run the following script:
bash shell_scripts/train_<task>.sh <dataset_name> <optimizer_name> <model_name> [<lr_scheduler_type>]<task>: Specifynlporvision.<dataset_name>: Name of the dataset (e.g.,rte,flowers102).<optimizer_name>: Name of the optimizer (e.g.,adam,sgd_momentum).<model_name>: Name of the model to be trained (e.g.,roberta-base,resnet18).<lr_scheduler_type>(optional): Learning rate scheduler type, applicable only for NLP tasks. Defaults todefault(meaning linear) if not provided.
bash shell_scripts/train_nlp.sh rte adam roberta-baseTo calculate the maximum Hessian values for each parameter, run the following script:
bash shell_scripts/hessian_per_param.sh <dataset_name> <optimizer_name> <model_name> <domain> [<training_mode>]<domain>: Specifynlporvision.<training_mode>(optional): Specifypretrainedto use a pre-trained model. If omitted, a trained model will be used.
When using a trained model, you need to specify the directory in "results/hessian_per_param/model_dir_dict.json".
bash shell_scripts/hessian_per_param.sh rte adam roberta-base pretrainedWe use the following resources and libraries:
-
Base code structure: lp-ft_ntk
-
Libraries for NLP tasks: Hugging Face Transformers
-
Calculation of Hessian: PyHessian
-
Signum optimizer: Signum
@misc{tomihari2025understandingadamoutperformssgd,
title={Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers},
author={Akiyoshi Tomihari and Issei Sato},
year={2025},
eprint={2502.00213},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.00213},
}