AdaptiveAdamW

Memory-efficient LLM training with adaptive non-uniform rank allocation for gradient low-rank projection.

Overview

AdaptiveAdamW extends GaLore by dynamically distributing a global rank budget across layers based on gradient spectra. While GaLore applies uniform rank allocation, our method recognizes that different layers exhibit varying degrees of low-rank structure and allocates ranks accordingly using a greedy explained-variance metric.

The core insight is that gradient matrices across transformer layers have heterogeneous singular value distributions. Some layers display "top-heavy" spectra where few singular values capture most variance, while others maintain flatter spectra requiring higher ranks. Uniform allocation underfits layers with flat spectra and over-allocates to inherently low-rank layers.

Method

Adaptive Rank Allocation: Every T iterations, we perform full SVD on accumulated gradients and execute a greedy allocation procedure. Starting from rank 1 per layer, we iteratively assign each rank increment to the layer with the largest explained variance ratio:

score(l) = (s_r+1)^2 / ||G||_F^2

This prioritizes layers where additional dimensions capture the most significant remaining gradient information, normalized by total spectral energy.

Momentum Transformation: When projection subspaces change due to rank reallocation, we preserve momentum statistics via transformation matrix M = P_new^T * P_old, applying m_new = M * m_old and v_new = M^2 * v_old.

Norm-Growth Limiting: For training stability, we incorporate a norm-growth limiter to cap relative increases in normalized gradient norm.

Installation

git clone https://github.com/helcig/AdaptiveAdamW.git
cd AdaptiveAdamW
pip install -r requirements.txt

Usage

LLaMA Pretraining on C4

Run AdaptiveAdamW with explained variance metric:

# Set rank (e.g., 4, 8, 256)
./experiments/c4_pretraining/scripts/run_llama_pretraining_adaptive.sh explained_variance 8

Run baseline GaLore:

./experiments/c4_pretraining/scripts/run_llama_pretraining_galore.sh 8

Key hyperparameters (configurable in scripts):

--rank: Total rank budget per layer (default varies by experiment)
--update_proj_gap: SVD recomputation frequency (default: 200)
--galore_scale: Gradient scaling factor (default: 0.25)
--rank_metric: Allocation metric (explained_variance, singular_value, conditioning, gap, combined, uniform_variance)

GLUE Fine-tuning

Run AdaptiveAdamW across all GLUE tasks:

./experiments/glue_finetuning/scripts/run_glue_no_trainer_adaptive.sh 4

Run baseline GaLore:

./experiments/glue_finetuning/scripts/run_glue_no_trainer_galore.sh 4

Available Optimizers

Optimizer	Script	Description
`adaptive`	`run_*_adaptive.sh`	AdaptiveAdamW with greedy rank allocation
`galore`	`run_*_galore.sh`	Baseline GaLore with uniform rank
`galore_kfac`	`run_*_kfac.sh`	GaLore with KFAC preconditioning
`rsadamw`	`run_*_rsadamw.sh`	Random subspace drift
`gha`	`run_*_gha.sh`	Generalized Hebbian Algorithm

Allocation Metrics

Pass the metric name as the first argument to adaptive scripts:

explained_variance: Relative variance contribution per layer (recommended)
singular_value: Raw squared singular values
conditioning: Ratio to dominant singular value
gap: Ratio between consecutive singular values
combined: Weighted combination of variance and conditioning
uniform_variance: Allocate to lowest explained variance layer

Results

LLaMA 130M Pretraining on C4 (Perplexity ↓)

Method	Rank 4	Rank 8	Rank 256
GaLore	66.66	56.49	29.75
AdaptiveAdamW	55.75	46.96	27.27
Improvement	16.4%	16.9%	8.3%

Adaptive rank allocation provides the largest gains in low-rank regimes where budget pressure is highest. At Rank 4 and 8, the method achieves 16.4% and 16.9% perplexity improvements respectively. As rank increases, differences narrow since sufficient budget reduces allocation pressure.

Training Loss

Evaluation Perplexity

Allocation Metric Comparison (LLaMA 130M)

Metric	Rank 4	Rank 8	Rank 256
Explained Variance	55.75	46.96	27.27
Singular Value	58.42	52.31	29.64
Conditioning	60.21	50.05	27.33
Uniform Variance	56.30	52.00	48.16
Combined	59.76	50.72	27.39
GAP	66.83	66.10	37.08

Explained variance performs best across rank regimes by treating all layers equally regardless of gradient magnitude.

Project Structure

├── optimizer_torch/
│   ├── adaptive_adamw.py      # AdaptiveAdamW optimizer
│   ├── adaptive_projector.py  # Projection and rank allocation
│   ├── galore_kfac.py         # KFAC preconditioning variant
│   ├── kfac_projector.py      # KFAC projector
│   ├── gha_adamw.py           # Generalized Hebbian Algorithm optimizer
│   ├── gha_projector.py       # GHA projector
│   ├── rsadamw.py             # Random Subspace AdamW
│   └── rs_projector.py        # Random subspace projector
├── experiments/
│   ├── c4_pretraining/
│   │   ├── configs/           # Model configs (llama_130m, 350m, 1b)
│   │   ├── scripts/           # Training scripts
│   │   ├── pept_utils/        # Training utilities, dataloader, model
│   │   └── run_llama_pretraining.py
│   └── glue_finetuning/
│       ├── scripts/           # Fine-tuning scripts
│       └── run_glue_no_trainer_HF.py
├── requirements.txt
└── README.md

Citation

@article{adaptiveadamw2025,
  title={Adaptability in Optimizer State Compression},
  author={Helcig, Michael and Kaiser, Tobias and Rubes, Jan},
  year={2025}
}

Contributors

Michael Helcig (ETH Zurich) - mhelcig@ethz.ch
Tobias Kaiser (ETH Zurich) - tokaiser@ethz.ch
Jan Rubes (ETH Zurich) - jrubes@ethz.ch

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
experiments		experiments
optimizer_torch		optimizer_torch
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdaptiveAdamW

Overview

Method

Installation

Usage

LLaMA Pretraining on C4

GLUE Fine-tuning

Available Optimizers

Allocation Metrics

Results

LLaMA 130M Pretraining on C4 (Perplexity ↓)

Training Loss

Evaluation Perplexity

Allocation Metric Comparison (LLaMA 130M)

Project Structure

Citation

Contributors

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AdaptiveAdamW

Overview

Method

Installation

Usage

LLaMA Pretraining on C4

GLUE Fine-tuning

Available Optimizers

Allocation Metrics

Results

LLaMA 130M Pretraining on C4 (Perplexity ↓)

Training Loss

Evaluation Perplexity

Allocation Metric Comparison (LLaMA 130M)

Project Structure

Citation

Contributors

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages