Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

This repository contains the code for our paper:

Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers. Akiyoshi Tomihari and Issei Sato. arXiv

Dependencies

The main dependencies are:

Python 3.10 or higher
torch = 2.4.0

Please refer to the pyproject.toml file for more details.

Setup

To set up and run the project, follow these steps:

# Configure the project to create virtual environments within the project directory
poetry config virtualenvs.in-project true

# Set the local python version using pyenv
pyenv local 3.12.6

# Install dependencies and activate the virtual environment
poetry install
poetry shell

Training the Models

To train the models, run the following script:

bash shell_scripts/train_<task>.sh <dataset_name> <optimizer_name> <model_name> [<lr_scheduler_type>]

<task>: Specify nlp or vision.
<dataset_name>: Name of the dataset (e.g., rte, flowers102).
<optimizer_name>: Name of the optimizer (e.g., adam, sgd_momentum).
<model_name>: Name of the model to be trained (e.g., roberta-base, resnet18).
<lr_scheduler_type> (optional): Learning rate scheduler type, applicable only for NLP tasks. Defaults to default (meaning linear) if not provided.

Example

bash shell_scripts/train_nlp.sh rte adam roberta-base

Calculation of Hessian per Parameter

To calculate the maximum Hessian values for each parameter, run the following script:

bash shell_scripts/hessian_per_param.sh <dataset_name> <optimizer_name> <model_name> <domain> [<training_mode>]

<domain>: Specify nlp or vision.
<training_mode> (optional): Specify pretrained to use a pre-trained model. If omitted, a trained model will be used.

When using a trained model, you need to specify the directory in "results/hessian_per_param/model_dir_dict.json".

Example

bash shell_scripts/hessian_per_param.sh rte adam roberta-base pretrained

Acknowledgments

We use the following resources and libraries:

Base code structure: lp-ft_ntk
Libraries for NLP tasks: Hugging Face Transformers
Calculation of Hessian: PyHessian
Signum optimizer: Signum

Citation

@misc{tomihari2025understandingadamoutperformssgd,
      title={Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers},
      author={Akiyoshi Tomihari and Issei Sato},
      year={2025},
      eprint={2502.00213},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.00213},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
conf		conf
model		model
optimizer		optimizer
pyhessian		pyhessian
shell_scripts		shell_scripts
utils_nlp		utils_nlp
utils_shared		utils_shared
utils_vision		utils_vision
.gitignore		.gitignore
README.md		README.md
hessian_per_param.py		hessian_per_param.py
pyproject.toml		pyproject.toml
train_nlp.py		train_nlp.py
train_vision.py		train_vision.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

Dependencies

Setup

Training the Models

Example

Calculation of Hessian per Parameter

Example

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

Dependencies

Setup

Training the Models

Example

Calculation of Hessian per Parameter

Example

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages