Skip to content

helcig/AdaptiveAdamW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AdaptiveAdamW

Memory-efficient LLM training with adaptive non-uniform rank allocation for gradient low-rank projection.

Overview

AdaptiveAdamW extends GaLore by dynamically distributing a global rank budget across layers based on gradient spectra. While GaLore applies uniform rank allocation, our method recognizes that different layers exhibit varying degrees of low-rank structure and allocates ranks accordingly using a greedy explained-variance metric.

The core insight is that gradient matrices across transformer layers have heterogeneous singular value distributions. Some layers display "top-heavy" spectra where few singular values capture most variance, while others maintain flatter spectra requiring higher ranks. Uniform allocation underfits layers with flat spectra and over-allocates to inherently low-rank layers.

Method

Adaptive Rank Allocation: Every T iterations, we perform full SVD on accumulated gradients and execute a greedy allocation procedure. Starting from rank 1 per layer, we iteratively assign each rank increment to the layer with the largest explained variance ratio:

score(l) = (s_r+1)^2 / ||G||_F^2

This prioritizes layers where additional dimensions capture the most significant remaining gradient information, normalized by total spectral energy.

Momentum Transformation: When projection subspaces change due to rank reallocation, we preserve momentum statistics via transformation matrix M = P_new^T * P_old, applying m_new = M * m_old and v_new = M^2 * v_old.

Norm-Growth Limiting: For training stability, we incorporate a norm-growth limiter to cap relative increases in normalized gradient norm.

Installation

git clone https://github.com/helcig/AdaptiveAdamW.git
cd AdaptiveAdamW
pip install -r requirements.txt

Usage

LLaMA Pretraining on C4

Run AdaptiveAdamW with explained variance metric:

# Set rank (e.g., 4, 8, 256)
./experiments/c4_pretraining/scripts/run_llama_pretraining_adaptive.sh explained_variance 8

Run baseline GaLore:

./experiments/c4_pretraining/scripts/run_llama_pretraining_galore.sh 8

Key hyperparameters (configurable in scripts):

  • --rank: Total rank budget per layer (default varies by experiment)
  • --update_proj_gap: SVD recomputation frequency (default: 200)
  • --galore_scale: Gradient scaling factor (default: 0.25)
  • --rank_metric: Allocation metric (explained_variance, singular_value, conditioning, gap, combined, uniform_variance)

GLUE Fine-tuning

Run AdaptiveAdamW across all GLUE tasks:

./experiments/glue_finetuning/scripts/run_glue_no_trainer_adaptive.sh 4

Run baseline GaLore:

./experiments/glue_finetuning/scripts/run_glue_no_trainer_galore.sh 4

Available Optimizers

Optimizer Script Description
adaptive run_*_adaptive.sh AdaptiveAdamW with greedy rank allocation
galore run_*_galore.sh Baseline GaLore with uniform rank
galore_kfac run_*_kfac.sh GaLore with KFAC preconditioning
rsadamw run_*_rsadamw.sh Random subspace drift
gha run_*_gha.sh Generalized Hebbian Algorithm

Allocation Metrics

Pass the metric name as the first argument to adaptive scripts:

  • explained_variance: Relative variance contribution per layer (recommended)
  • singular_value: Raw squared singular values
  • conditioning: Ratio to dominant singular value
  • gap: Ratio between consecutive singular values
  • combined: Weighted combination of variance and conditioning
  • uniform_variance: Allocate to lowest explained variance layer

Results

LLaMA 130M Pretraining on C4 (Perplexity ↓)

Method Rank 4 Rank 8 Rank 256
GaLore 66.66 56.49 29.75
AdaptiveAdamW 55.75 46.96 27.27
Improvement 16.4% 16.9% 8.3%

Adaptive rank allocation provides the largest gains in low-rank regimes where budget pressure is highest. At Rank 4 and 8, the method achieves 16.4% and 16.9% perplexity improvements respectively. As rank increases, differences narrow since sufficient budget reduces allocation pressure.

Training Loss

Training Loss

Evaluation Perplexity

Evaluation Perplexity

Allocation Metric Comparison (LLaMA 130M)

Metric Rank 4 Rank 8 Rank 256
Explained Variance 55.75 46.96 27.27
Singular Value 58.42 52.31 29.64
Conditioning 60.21 50.05 27.33
Uniform Variance 56.30 52.00 48.16
Combined 59.76 50.72 27.39
GAP 66.83 66.10 37.08

Explained variance performs best across rank regimes by treating all layers equally regardless of gradient magnitude.

Project Structure

├── optimizer_torch/
│   ├── adaptive_adamw.py      # AdaptiveAdamW optimizer
│   ├── adaptive_projector.py  # Projection and rank allocation
│   ├── galore_kfac.py         # KFAC preconditioning variant
│   ├── kfac_projector.py      # KFAC projector
│   ├── gha_adamw.py           # Generalized Hebbian Algorithm optimizer
│   ├── gha_projector.py       # GHA projector
│   ├── rsadamw.py             # Random Subspace AdamW
│   └── rs_projector.py        # Random subspace projector
├── experiments/
│   ├── c4_pretraining/
│   │   ├── configs/           # Model configs (llama_130m, 350m, 1b)
│   │   ├── scripts/           # Training scripts
│   │   ├── pept_utils/        # Training utilities, dataloader, model
│   │   └── run_llama_pretraining.py
│   └── glue_finetuning/
│       ├── scripts/           # Fine-tuning scripts
│       └── run_glue_no_trainer_HF.py
├── requirements.txt
└── README.md

Citation

@article{adaptiveadamw2025,
  title={Adaptability in Optimizer State Compression},
  author={Helcig, Michael and Kaiser, Tobias and Rubes, Jan},
  year={2025}
}

Contributors

License

MIT License

About

Memory-efficient LLM training with adaptive non-uniform rank allocation for gradient low-rank projection.

Resources

Stars

Watchers

Forks

Contributors