Status: Reference Document Date: 2026-02-07 Authors: Jason Kempf, Claude (Anthropic)
Historical note (2026-02-22): This document captures a pre-MASS training-era synthesis. Any
eta = 1/Llanguage is historical and superseded. Active LR control is MASS (eta_step = min(eta_ceiling, eta_sps, eta_weyl)). Canonical LR history:docs/research/lr_derivation_analysis.md. Integration log:docs/research/deep_research_integration_2026_02.md.
There are no "knobs" in LLMs. Temperature, top-p, top-k are noise injection mechanisms that obscure the deterministic geometric structure. The model's weights define a fixed high-dimensional landscape. With greedy decoding (temp=0), every input traces exactly one path through that landscape.
Every training hyperparameter is either:
- Derived from the spectral structure of weight matrices
- Derived from dtype machine precision (float32 constants)
- Removed because the geometric optimizer makes it unnecessary
This document maps each traditional hyperparameter to its geometric replacement.
| # | Hyperparameter | Industry Standard | Geometric Replacement | Formula | Status |
|---|---|---|---|---|---|
| Optimizer | |||||
| 1 | Learning Rate | 1e-4 (grid search) |
MASS step-size controller | η_step = min(η_ceiling, η_sps, η_weyl) |
Implemented |
| 2 | Adam Epsilon | 1e-8 (never questioned) |
Spectral noise floor | max(sigma_k^2, sqrt(eps) * sigma_max^2) |
Implemented |
| 3 | Adam/Momentum | 0.9 / 0.999 |
Cayley-Stiefel retraction | Orthogonality constraint on NB-LoRA factors | Superseded: ScaledGD by Cayley-Stiefel |
| 4 | Weight Decay | 0.01 (uniform) |
Condition-aware scaling | sigma_k / sigma_max |
Implemented |
| 5 | Gradient Clipping | clip=1.0 |
REMOVED | Cayley-Stiefel retraction + MASS budget monitoring prevent explosion | Removed |
| Training Loop | |||||
| 6 | Warmup | 5-10% of steps | REMOVED | Geometric LR stable from step 0 | Removed |
| 7 | LR Schedule | Cosine decay | OPTIONAL | Condition ratio is static; cosine marginal | Optional |
| 8 | Batch Size | "As big as fits" | Gradient noise scale | B_crit = Var(g) / ||E[g]||^2 |
Implemented |
| 9 | Early Stopping | Val loss patience | Geometric convergence | Loss < sqrt(eps) OR spectral budget > 0.9 | Implemented |
| LoRA | |||||
| 10 | Scale | alpha/rank = 2.0 |
Spectral bound per-layer | sigma_k(W) / ||BA||_spectral |
Implemented |
| 11 | Rank | 8 (arbitrary) |
Null-space capacity | tail_dims = full_rank - floor(shannon_effective_rank) |
Implemented |
| 12 | Target Modules | q_proj + v_proj |
Spectral decay analysis | Layers where tail_dims > 0 |
Implemented |
| 13 | Dropout | 0.1 (arbitrary) |
Two spectral ratios | redundancy * adapter_fraction |
Implemented |
| 14 | Weight Init | Random A, zeros B | Spectral normalized | ||BA||_spectral = sigma_k from step 0 |
Implemented |
| Architecture | |||||
| 15 | Residual Scaling | alpha = 1 |
Spectral ratio per-layer | sigma_max(x) / sigma_max(f(x)) |
Implemented |
All formulas reference constants derived from IEEE 754 float32:
| Constant | Symbol | Value | Derivation | Used By |
|---|---|---|---|---|
| Machine epsilon | eps |
2^-23 = 1.19e-7 |
Smallest float32 ULP at 1.0 | All formulas |
| Significance threshold | sqrt(eps) |
~3.45e-4 |
SVD noise floor | Noise floor, sigma_k |
| LR minimum | eps |
1.19e-7 |
Can't represent smaller changes | LR bounds |
| LR maximum | 1/sqrt(eps) |
~2896 |
Numerical stability ceiling | LR bounds |
| Eigengap threshold | - | sigma_k/sigma_{k+1} > 2.0 |
Meaningful spectral structure | Scale bound refinement |
| Residual alpha range | - | [sqrt(eps), 1/sqrt(eps)] |
Precision-derived | Residual scaling clamp |
Code: core/domain/training/hyperparameter_validation.py:45-46 (_EPS, _SQRT_EPS)
Industry: 1e-4 or 3e-4, chosen by grid search or "what worked last time."
Historical geometric path: η = 1/L where L = λ_max(Hessian) is measured
via power iteration on Hessian-vector products.
Why superseded: This path is brittle under stochastic nonsmooth training and is no longer the active controller in ModelCypher.
Active replacement: MASS (Weyl ceiling + SPS + Weyl displacement bound).
See docs/research/lr_derivation_analysis.md for ablations and derivation
history.
Previous approaches (superseded): Per-layer LR = σ_k/σ_max (condition ratio), 1/σ_max (inverse spectral norm), σ_k/σ_max² — all guesses based on weight spectral norms, which are not the Lipschitz constant of the loss gradient. Single-batch L measurement with 2× drift re-measurement threshold — noise dressed as adaptation.
Code: scripts/validate_geometric_training.py (measure_lipschitz_constant), hessian_estimator.py:280 (top_eigenvalue)
Industry: 1e-8, the default from Kingma & Ba (2014). Never questioned.
Geometric: eps = max(sigma_k^2, sqrt(eps_mach) * sigma_max^2)
Two floors, take the larger:
sigma_k^2: the noise floor of the weight's eigenspectrumsqrt(eps_mach) * sigma_max^2: the numerical precision floor
Mathematical Basis: Epsilon prevents division by zero in the Adam denominator (1 / (sqrt(v) + eps)). It must be large enough for stability but small enough to preserve gradient signal. Both floors come from the weight's spectral structure.
Code: geometric_optimizer.py:129 (compute_geometric_epsilon)
Superseded (2026-02-23): For NB-LoRA, Cayley-Stiefel retraction replaced ScaledGD. The Cayley constraint enforces orthogonality on NB-LoRA factors directly, making preconditioning unnecessary (weight space is Euclidean — P = MM^T ≈ I, Fisher degenerate). ScaledGD remains mathematically valid for standard LoRA but is not used in the active training pipeline. See
geometric_optimizer.pydocstring.
Industry: beta1=0.9, beta2=0.999, empirically chosen by Kingma & Ba (2014).
Historical geometric path: ScaledGD (Tong, Ma, Chi — JMLR 2021). For factored low-rank problems X = AB, preconditioning each factor's gradient by the pseudoinverse of the other factor achieves condition-number-free convergence. This is Riemannian gradient descent on the rank-r manifold:
grad_A_preconditioned = grad_A @ (B Bᵀ + εI)⁻¹
grad_B_preconditioned = (Aᵀ A + εI)⁻¹ @ grad_B
This simultaneously satisfies three proven requirements:
- LoRA+ (Hayou et al., ICML 2024): A and B provably need different learning rates (
η_B/η_A = Θ(n)). ScaledGD produces this automatically — the preconditioning by the other factor's spectral structure creates asymmetric effective rates. - Mu & Klabjan (Dec 2025): Step size must scale as
1/(L × ||adapters||²). ScaledGD satisfies this — as one factor grows, the preconditioner shrinks the effective step for the other. - Condition-number-free convergence (Tong et al.): No momentum or adaptive methods needed; the preconditioning normalizes the optimization landscape.
The ε regularization in the inverse uses the geometric epsilon max(σ_k², √ε_mach × σ_max²) — the same value computed for numerical stability throughout the pipeline.
Code: scripts/validate_geometric_training.py (apply_scaled_gd)
Industry: 0.01, applied uniformly to all parameters.
Geometric: decay_scale = sigma_k / sigma_max (condition ratio). Poorly-conditioned layers (high kappa = sigma_max / sigma_k) get less decay because their small singular values are already near the noise floor. Decaying them further destroys useful signal.
Code: geometric_optimizer.py:157 (compute_decay_scale)
Industry: clip=1.0, from Pascanu et al. (2013). No theoretical basis for the threshold.
Geometric: REMOVED. Cayley-Stiefel retraction bounds update norms by construction. The spectral budget monitor (check_budget_exhausted) catches any violation and halts training — this is cleaner than clipping because halting respects the bound while clipping is a heuristic correction. Experiments showed 0% clipping events across all layers.
Deep Dive: training_heuristics_analysis.md, Experiment 1
Industry: Linear warmup for 5-10% of total steps.
Geometric: REMOVED. Geometric LR is bounded by spectral structure from step 0. The condition ratio sigma_k/sigma_max is computed once from base weights and never changes. There is no "cold start" problem because the step size is not learned - it's measured. Ma & Yarats (AAAI 2021) showed warmup compensates for Adam's initial update magnitude starting at exactly alpha. The geometric optimizer has no such initialization artifact.
Deep Dive: training_heuristics_analysis.md, Experiment 2
Industry: Cosine decay to 0 (standard in transformer training).
Geometric: OPTIONAL. The per-layer condition ratio is static (computed once from base weights), so there's no implicit schedule. Experiments showed cosine decay gave only marginal improvement (0.008 loss) over flat geometric LR. Defazio et al. (NeurIPS 2024) showed schedules are equivalent to iterate averaging. For the geometric optimizer, the static LR works because the condition ratio already captures the correct scale.
Deep Dive: training_heuristics_analysis.md, Experiment 3
Industry: "As big as fits in memory."
Geometric: B_crit = Var(g) / ||E[g]||^2 (gradient noise scale). Below B_crit: linear speedup. Above B_crit: diminishing returns. The gradient covariance encodes sample redundancy. Low effective rank of gradient covariance means samples are redundant and larger batches are safe.
Status: Partial. Formula proven (McCandlish et al. 2018), not wired to training loop.
Deep Dive: training_heuristics_analysis.md, Experiment 4
Industry: "Stop when validation loss hasn't improved for N epochs" (patience).
Geometric: Two criteria, no validation set required:
should_stop = loss_stable OR budget_exhausted
Where:
loss_stable: Loss change belowsqrt(eps)(numerical precision floor)budget_exhausted: Spectral bound ratio > 0.9 (90% of geometric budget consumed, i.e.||BA||_spectral / sigma_k > 0.9)
All thresholds are dtype-derived (sqrt(eps)) or geometry-derived (spectral bound ratio).
Deep Dive: training_heuristics_analysis.md, Phase 2b
Industry: scale = alpha/rank (typically alpha=16, rank=8, scale=2.0).
Geometric: scale <= sigma_k(W) / ||B @ A||_spectral per layer. All 9 tested adapters were 22-2700x over this bound. The standard scale of 2.0 caused catastrophic model degradation (gibberish output, repetitive loops). The geometric scale restored correct reasoning.
Mathematical Basis: By Weyl's inequality, |sigma_i(W') - sigma_i(W)| <= ||scale * Delta||_2. To preserve W's spectral structure, the perturbation must not exceed sigma_k. For crossing at the structural boundary, the Weyl no-crossing condition is ||E||_2 < gap_k / 2, giving scale <= gap / (2 * ||Delta||_2).
Code: lora_safety_service.py (compute_geometric_scale, apply_lora_geometric)
Deep Dive: lora_spectral_scale_bound.md (empirics), lora_spectral_theory.md (3 theorems)
Industry: 8 or 16, chosen arbitrarily.
Geometric: rank = tail_dims = full_rank - floor(shannon_effective_rank) per layer. The Shannon effective rank captures structural spectral utilization, while precision rank (max(m,n) * eps * sigma_max) is a secondary numerical diagnostic. The tail dimensions are the null-space capacity where LoRA can add information without interfering with the base model's learned structure. Standard rank-8 is typically under-parameterized by geometry.
Per-layer adaptive rank: each layer gets its own rank based on its spectral structure. Layers with more null space get higher rank.
Code: geometric_lora.py:251 (compute_per_layer_ranks), :230 (compute_geometric_rank)
Industry: q_proj + v_proj (convention from Hu et al. 2021).
Geometric: Target layers where tail_dims > 0 (non-zero null-space capacity). Spectral decay analysis of LFM2-350M attention:
| Projection | sigma_k | Decay Ratio | Scale Bound |
|---|---|---|---|
| v_proj | 0.46 | 10x | ~0.5 |
| k_proj | 0.30 | 42x | ~0.3 |
| q_proj | 0.005 | 2,810x | ~0.002 |
| o_proj | 0.003 | 2,508x | ~0.002 |
v_proj/k_proj have 100x more room for perturbation than q_proj/o_proj. The standard practice of targeting q_proj + v_proj is geometrically inconsistent.
Code: geometric_lora.py:213 (select_target_modules)
Deep Dive: lora_projection_targeting.md
Industry: 0.1, arbitrary.
Geometric: Product of two spectral ratios, no arbitrary constants:
dropout = redundancy * adapter_fraction
Where:
redundancy = 1 - shannon_eff_rank / full_rank(spectral concentration, 0 = flat spectrum, 1 = single dominant SV)adapter_fraction = rank / full_rank(how much of the weight's space LoRA occupies)shannon_eff_rank = exp(H(sigma^2))(Roy & Vetterli 2007)
Self-calibrating: layers with more null-space capacity get both higher rank AND higher dropout. The two ratios multiply to give values that scale with the geometry. NB-LoRA (Cayley transform) uses 0.0 because its spectral norm bound is strictly tighter than dropout's nuclear norm regularization.
Validated on 7 real models across 4 architectures. Dropout ranges from 0.001 (small rank) to 0.11 (large rank layers).
Code: geometric_lora.py:283 (compute_geometric_dropout)
Deep Dive: training_heuristics_analysis.md, Experiment 5
Industry: Random A (Gaussian), zeros B (Hu et al. 2021). Product B @ A starts at zero and must "grow into" the budget during training.
Geometric: Spectral normalized initialization:
||A||_spectral = sqrt(sigma_k)
||B||_spectral = sqrt(sigma_k)
||B @ A||_spectral = sigma_k
Uses the full geometric budget from step 0. Each matrix gets sqrt(sigma_k) spectral norm so the product respects the bound immediately.
Code: spectral_init.py:162 (spectral_normalized_lora_init)
Industry: alpha = 1 (no scaling, standard residual output = x + f(x)).
Geometric: alpha_i = sigma_max(x) / sigma_max(f(x)) per layer. Normalizes so ||alpha * f(x)|| ~ ||x||, making residual contributions comparable across layers. Hook-based (non-invasive, no model modification). Clamped to precision-derived range [sqrt(eps), 1/sqrt(eps)]. Spectral norms computed via fast power iteration (3 iterations).
Code: residual_scaling.py:184 (compute_residual_scale), :39 (spectral_norm_power_iteration)
| Heuristic | Why Removed / Replaced | Evidence |
|---|---|---|
| Gradient Clipping | Cayley-Stiefel retraction bounds updates; budget monitor halts on violation | 0% clip events across all layers in experiments |
| Warmup | MASS is stable from step 0, no warmup needed | No divergence without warmup; convergence immediate |
| Adam / Momentum | Replaced by Cayley-Stiefel retraction (NB-LoRA). Historical: ScaledGD (Tong et al. JMLR 2021) | Orthogonality constraint makes preconditioning unnecessary |
| Per-layer LR heuristics | Replaced by MASS controller + Weyl bounds | σ_k/σ_max, 1/σ_max, σ_k/σ_max² were guesses, not measured controls |
| Heuristic | Formula | What Exists | What's Missing |
|---|---|---|---|
| Weight Decay | sigma_k / sigma_max |
compute_decay_scale() in optimizer |
Full integration across all training paths |
| Topic | Document | Content |
|---|---|---|
| LoRA scale empirical validation | lora_spectral_scale_bound.md |
9 adapter analysis, GSM8K test case |
| 3 Theorems (Weyl, Weyl no-crossing, Sufficiency) | lora_spectral_theory.md |
Full proofs for necessity, no-crossing, sufficiency |
| Rank/target/scale original derivation | lora_geometric_derivation.md |
Original derivation (superseded by scale bound) |
| Projection targeting | lora_projection_targeting.md |
q_proj vs v_proj spectral analysis |
| All 5 experiments + Phase 2b | training_heuristics_analysis.md |
Phase-by-phase results, conclusions |
- Cavazza, J. et al. (2018). Dropout as a Low-Rank Regularizer for Matrix Factorization. AISTATS.
- Davis, C. & Kahan, W.M. (1970). The Rotation of Eigenvectors by a Perturbation. III. SIAM J. Numer. Anal.
- Defazio, A. et al. (2024). The Road Less Scheduled. NeurIPS.
- Golub, G.H. & Van Loan, C.F. (2013). Matrix Computations (4th ed.). Johns Hopkins.
- Hayou, S. et al. (2024). LoRA+: Efficient Low-Rank Adaptation of Large Models. ICML.
- Hu, E.J. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
- Kingma, D.P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
- Ma, J. & Yarats, D. (2021). On the Adequacy of Untuned Warmup for Adaptive Optimization. AAAI.
- McCandlish, S. et al. (2018). An Empirical Model of Large-Batch Training. arXiv:1812.06162.
- Miyato, T. et al. (2018). Spectral Normalization for Generative Adversarial Networks. ICLR.
- Mu, T. & Klabjan, D. (2025). Convergence Analysis of LoRA Fine-Tuning. arXiv (Dec 2025).
- Nesterov, Y. (2004). Introductory Lectures on Convex Optimization. Springer.
- Pascanu, R. et al. (2013). On the difficulty of training recurrent neural networks. ICML.
- Roy, O. & Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. EUSIPCO.
- Tong, T., Ma, C. & Chi, Y. (2021). Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent. JMLR.
- Tran, H. et al. (2025). Spectral Perturbation Bounds Under Eigengap Conditions. arXiv:2510.25670.
- Wang, X. et al. (2025). NB-LoRA: Norm-Bounded Low-Rank Adaptation. arXiv:2501.19050.
- Weyl, H. (1912). Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen. Math. Ann.
All code paths relative to src/modelcypher/. All formulas verified against source.