Memory-efficient LLM training with adaptive non-uniform rank allocation for gradient low-rank projection.
AdaptiveAdamW extends GaLore by dynamically distributing a global rank budget across layers based on gradient spectra. While GaLore applies uniform rank allocation, our method recognizes that different layers exhibit varying degrees of low-rank structure and allocates ranks accordingly using a greedy explained-variance metric.
The core insight is that gradient matrices across transformer layers have heterogeneous singular value distributions. Some layers display "top-heavy" spectra where few singular values capture most variance, while others maintain flatter spectra requiring higher ranks. Uniform allocation underfits layers with flat spectra and over-allocates to inherently low-rank layers.
Adaptive Rank Allocation: Every T iterations, we perform full SVD on accumulated gradients and execute a greedy allocation procedure. Starting from rank 1 per layer, we iteratively assign each rank increment to the layer with the largest explained variance ratio:
score(l) = (s_r+1)^2 / ||G||_F^2
This prioritizes layers where additional dimensions capture the most significant remaining gradient information, normalized by total spectral energy.
Momentum Transformation: When projection subspaces change due to rank reallocation, we preserve momentum statistics via transformation matrix M = P_new^T * P_old, applying m_new = M * m_old and v_new = M^2 * v_old.
Norm-Growth Limiting: For training stability, we incorporate a norm-growth limiter to cap relative increases in normalized gradient norm.
git clone https://github.com/helcig/AdaptiveAdamW.git
cd AdaptiveAdamW
pip install -r requirements.txtRun AdaptiveAdamW with explained variance metric:
# Set rank (e.g., 4, 8, 256)
./experiments/c4_pretraining/scripts/run_llama_pretraining_adaptive.sh explained_variance 8Run baseline GaLore:
./experiments/c4_pretraining/scripts/run_llama_pretraining_galore.sh 8Key hyperparameters (configurable in scripts):
--rank: Total rank budget per layer (default varies by experiment)--update_proj_gap: SVD recomputation frequency (default: 200)--galore_scale: Gradient scaling factor (default: 0.25)--rank_metric: Allocation metric (explained_variance, singular_value, conditioning, gap, combined, uniform_variance)
Run AdaptiveAdamW across all GLUE tasks:
./experiments/glue_finetuning/scripts/run_glue_no_trainer_adaptive.sh 4Run baseline GaLore:
./experiments/glue_finetuning/scripts/run_glue_no_trainer_galore.sh 4| Optimizer | Script | Description |
|---|---|---|
adaptive |
run_*_adaptive.sh |
AdaptiveAdamW with greedy rank allocation |
galore |
run_*_galore.sh |
Baseline GaLore with uniform rank |
galore_kfac |
run_*_kfac.sh |
GaLore with KFAC preconditioning |
rsadamw |
run_*_rsadamw.sh |
Random subspace drift |
gha |
run_*_gha.sh |
Generalized Hebbian Algorithm |
Pass the metric name as the first argument to adaptive scripts:
explained_variance: Relative variance contribution per layer (recommended)singular_value: Raw squared singular valuesconditioning: Ratio to dominant singular valuegap: Ratio between consecutive singular valuescombined: Weighted combination of variance and conditioninguniform_variance: Allocate to lowest explained variance layer
| Method | Rank 4 | Rank 8 | Rank 256 |
|---|---|---|---|
| GaLore | 66.66 | 56.49 | 29.75 |
| AdaptiveAdamW | 55.75 | 46.96 | 27.27 |
| Improvement | 16.4% | 16.9% | 8.3% |
Adaptive rank allocation provides the largest gains in low-rank regimes where budget pressure is highest. At Rank 4 and 8, the method achieves 16.4% and 16.9% perplexity improvements respectively. As rank increases, differences narrow since sufficient budget reduces allocation pressure.
| Metric | Rank 4 | Rank 8 | Rank 256 |
|---|---|---|---|
| Explained Variance | 55.75 | 46.96 | 27.27 |
| Singular Value | 58.42 | 52.31 | 29.64 |
| Conditioning | 60.21 | 50.05 | 27.33 |
| Uniform Variance | 56.30 | 52.00 | 48.16 |
| Combined | 59.76 | 50.72 | 27.39 |
| GAP | 66.83 | 66.10 | 37.08 |
Explained variance performs best across rank regimes by treating all layers equally regardless of gradient magnitude.
├── optimizer_torch/
│ ├── adaptive_adamw.py # AdaptiveAdamW optimizer
│ ├── adaptive_projector.py # Projection and rank allocation
│ ├── galore_kfac.py # KFAC preconditioning variant
│ ├── kfac_projector.py # KFAC projector
│ ├── gha_adamw.py # Generalized Hebbian Algorithm optimizer
│ ├── gha_projector.py # GHA projector
│ ├── rsadamw.py # Random Subspace AdamW
│ └── rs_projector.py # Random subspace projector
├── experiments/
│ ├── c4_pretraining/
│ │ ├── configs/ # Model configs (llama_130m, 350m, 1b)
│ │ ├── scripts/ # Training scripts
│ │ ├── pept_utils/ # Training utilities, dataloader, model
│ │ └── run_llama_pretraining.py
│ └── glue_finetuning/
│ ├── scripts/ # Fine-tuning scripts
│ └── run_glue_no_trainer_HF.py
├── requirements.txt
└── README.md
@article{adaptiveadamw2025,
title={Adaptability in Optimizer State Compression},
author={Helcig, Michael and Kaiser, Tobias and Rubes, Jan},
year={2025}
}- Michael Helcig (ETH Zurich) - mhelcig@ethz.ch
- Tobias Kaiser (ETH Zurich) - tokaiser@ethz.ch
- Jan Rubes (ETH Zurich) - jrubes@ethz.ch
MIT License

