Geometric Hyperparameter Rosetta Stone `[EMPIRICAL]`

Status: Reference Document Date: 2026-02-07 Authors: Jason Kempf, Claude (Anthropic)

Historical note (2026-02-22): This document captures a pre-MASS training-era synthesis. Any eta = 1/L language is historical and superseded. Active LR control is MASS (eta_step = min(eta_ceiling, eta_sps, eta_weyl)). Canonical LR history: docs/research/lr_derivation_analysis.md. Integration log: docs/research/deep_research_integration_2026_02.md.

Thesis

There are no "knobs" in LLMs. Temperature, top-p, top-k are noise injection mechanisms that obscure the deterministic geometric structure. The model's weights define a fixed high-dimensional landscape. With greedy decoding (temp=0), every input traces exactly one path through that landscape.

Every training hyperparameter is either:

Derived from the spectral structure of weight matrices
Derived from dtype machine precision (float32 constants)
Removed because the geometric optimizer makes it unnecessary

This document maps each traditional hyperparameter to its geometric replacement.

Master Reference Table

#	Hyperparameter	Industry Standard	Geometric Replacement	Formula	Status
	Optimizer
1	Learning Rate	`1e-4` (grid search)	MASS step-size controller	`η_step = min(η_ceiling, η_sps, η_weyl)`	Implemented
2	Adam Epsilon	`1e-8` (never questioned)	Spectral noise floor	`max(sigma_k^2, sqrt(eps) * sigma_max^2)`	Implemented
3	Adam/Momentum	`0.9 / 0.999`	Cayley-Stiefel retraction	Orthogonality constraint on NB-LoRA factors	Superseded: ScaledGD by Cayley-Stiefel
4	Weight Decay	`0.01` (uniform)	Condition-aware scaling	`sigma_k / sigma_max`	Implemented
5	Gradient Clipping	`clip=1.0`	REMOVED	Cayley-Stiefel retraction + MASS budget monitoring prevent explosion	Removed
	Training Loop
6	Warmup	5-10% of steps	REMOVED	Geometric LR stable from step 0	Removed
7	LR Schedule	Cosine decay	OPTIONAL	Condition ratio is static; cosine marginal	Optional
8	Batch Size	"As big as fits"	Gradient noise scale	`B_crit = Var(g) / \|\|E[g]\|\|^2`	Implemented
9	Early Stopping	Val loss patience	Geometric convergence	Loss < sqrt(eps) OR spectral budget > 0.9	Implemented
	LoRA
10	Scale	`alpha/rank = 2.0`	Spectral bound per-layer	`sigma_k(W) / \|\|BA\|\|_spectral`	Implemented
11	Rank	`8` (arbitrary)	Null-space capacity	`tail_dims = full_rank - floor(shannon_effective_rank)`	Implemented
12	Target Modules	`q_proj + v_proj`	Spectral decay analysis	Layers where `tail_dims > 0`	Implemented
13	Dropout	`0.1` (arbitrary)	Two spectral ratios	`redundancy * adapter_fraction`	Implemented
14	Weight Init	Random A, zeros B	Spectral normalized	`\|\|BA\|\|_spectral = sigma_k` from step 0	Implemented
	Architecture
15	Residual Scaling	`alpha = 1`	Spectral ratio per-layer	`sigma_max(x) / sigma_max(f(x))`	Implemented

Constants Derived from Machine Precision `[PROVEN]`

All formulas reference constants derived from IEEE 754 float32:

Constant	Symbol	Value	Derivation	Used By
Machine epsilon	`eps`	`2^-23 = 1.19e-7`	Smallest float32 ULP at 1.0	All formulas
Significance threshold	`sqrt(eps)`	`~3.45e-4`	SVD noise floor	Noise floor, sigma_k
LR minimum	`eps`	`1.19e-7`	Can't represent smaller changes	LR bounds
LR maximum	`1/sqrt(eps)`	`~2896`	Numerical stability ceiling	LR bounds
Eigengap threshold	-	`sigma_k/sigma_{k+1} > 2.0`	Meaningful spectral structure	Scale bound refinement
Residual alpha range	-	`[sqrt(eps), 1/sqrt(eps)]`	Precision-derived	Residual scaling clamp

Code: core/domain/training/hyperparameter_validation.py:45-46 (_EPS, _SQRT_EPS)

Detailed Derivations

1. Learning Rate (Historical Path Superseded by MASS) `[DISPROVEN]`

Industry: 1e-4 or 3e-4, chosen by grid search or "what worked last time."

Historical geometric path: η = 1/L where L = λ_max(Hessian) is measured via power iteration on Hessian-vector products.

Why superseded: This path is brittle under stochastic nonsmooth training and is no longer the active controller in ModelCypher.

Active replacement: MASS (Weyl ceiling + SPS + Weyl displacement bound).

See docs/research/lr_derivation_analysis.md for ablations and derivation history.

Previous approaches (superseded): Per-layer LR = σ_k/σ_max (condition ratio), 1/σ_max (inverse spectral norm), σ_k/σ_max² — all guesses based on weight spectral norms, which are not the Lipschitz constant of the loss gradient. Single-batch L measurement with 2× drift re-measurement threshold — noise dressed as adaptation.

Code: scripts/validate_geometric_training.py (measure_lipschitz_constant), hessian_estimator.py:280 (top_eigenvalue)

2. Adam Epsilon `[EMPIRICAL]`

Industry: 1e-8, the default from Kingma & Ba (2014). Never questioned.

Geometric: eps = max(sigma_k^2, sqrt(eps_mach) * sigma_max^2)

Two floors, take the larger:

sigma_k^2: the noise floor of the weight's eigenspectrum
sqrt(eps_mach) * sigma_max^2: the numerical precision floor

Mathematical Basis: Epsilon prevents division by zero in the Adam denominator (1 / (sqrt(v) + eps)). It must be large enough for stability but small enough to preserve gradient signal. Both floors come from the weight's spectral structure.

Code: geometric_optimizer.py:129 (compute_geometric_epsilon)

3. Adam / Momentum (Beta1/Beta2) -> ScaledGD (Superseded)

Superseded (2026-02-23): For NB-LoRA, Cayley-Stiefel retraction replaced ScaledGD. The Cayley constraint enforces orthogonality on NB-LoRA factors directly, making preconditioning unnecessary (weight space is Euclidean — P = MM^T ≈ I, Fisher degenerate). ScaledGD remains mathematically valid for standard LoRA but is not used in the active training pipeline. See geometric_optimizer.py docstring.

Industry: beta1=0.9, beta2=0.999, empirically chosen by Kingma & Ba (2014).

Historical geometric path: ScaledGD (Tong, Ma, Chi — JMLR 2021). For factored low-rank problems X = AB, preconditioning each factor's gradient by the pseudoinverse of the other factor achieves condition-number-free convergence. This is Riemannian gradient descent on the rank-r manifold:

grad_A_preconditioned = grad_A @ (B Bᵀ + εI)⁻¹
grad_B_preconditioned = (Aᵀ A + εI)⁻¹ @ grad_B

This simultaneously satisfies three proven requirements:

LoRA+ (Hayou et al., ICML 2024): A and B provably need different learning rates (η_B/η_A = Θ(n)). ScaledGD produces this automatically — the preconditioning by the other factor's spectral structure creates asymmetric effective rates.
Mu & Klabjan (Dec 2025): Step size must scale as 1/(L × ||adapters||²). ScaledGD satisfies this — as one factor grows, the preconditioner shrinks the effective step for the other.
Condition-number-free convergence (Tong et al.): No momentum or adaptive methods needed; the preconditioning normalizes the optimization landscape.

The ε regularization in the inverse uses the geometric epsilon max(σ_k², √ε_mach × σ_max²) — the same value computed for numerical stability throughout the pipeline.

Code: scripts/validate_geometric_training.py (apply_scaled_gd)

4. Weight Decay `[EMPIRICAL]`

Industry: 0.01, applied uniformly to all parameters.

Geometric: decay_scale = sigma_k / sigma_max (condition ratio). Poorly-conditioned layers (high kappa = sigma_max / sigma_k) get less decay because their small singular values are already near the noise floor. Decaying them further destroys useful signal.

Code: geometric_optimizer.py:157 (compute_decay_scale)

5. Gradient Clipping `[EMPIRICAL]`

Industry: clip=1.0, from Pascanu et al. (2013). No theoretical basis for the threshold.

Geometric: REMOVED. Cayley-Stiefel retraction bounds update norms by construction. The spectral budget monitor (check_budget_exhausted) catches any violation and halts training — this is cleaner than clipping because halting respects the bound while clipping is a heuristic correction. Experiments showed 0% clipping events across all layers.

Deep Dive: training_heuristics_analysis.md, Experiment 1

6. Warmup `[EMPIRICAL]`

Industry: Linear warmup for 5-10% of total steps.

Geometric: REMOVED. Geometric LR is bounded by spectral structure from step 0. The condition ratio sigma_k/sigma_max is computed once from base weights and never changes. There is no "cold start" problem because the step size is not learned - it's measured. Ma & Yarats (AAAI 2021) showed warmup compensates for Adam's initial update magnitude starting at exactly alpha. The geometric optimizer has no such initialization artifact.

Deep Dive: training_heuristics_analysis.md, Experiment 2

7. LR Schedule `[EMPIRICAL]`

Industry: Cosine decay to 0 (standard in transformer training).

Geometric: OPTIONAL. The per-layer condition ratio is static (computed once from base weights), so there's no implicit schedule. Experiments showed cosine decay gave only marginal improvement (0.008 loss) over flat geometric LR. Defazio et al. (NeurIPS 2024) showed schedules are equivalent to iterate averaging. For the geometric optimizer, the static LR works because the condition ratio already captures the correct scale.

Deep Dive: training_heuristics_analysis.md, Experiment 3

8. Batch Size `[CONJECTURAL]`

Industry: "As big as fits in memory."

Geometric: B_crit = Var(g) / ||E[g]||^2 (gradient noise scale). Below B_crit: linear speedup. Above B_crit: diminishing returns. The gradient covariance encodes sample redundancy. Low effective rank of gradient covariance means samples are redundant and larger batches are safe.

Status: Partial. Formula proven (McCandlish et al. 2018), not wired to training loop.

Deep Dive: training_heuristics_analysis.md, Experiment 4

9. Early Stopping `[EMPIRICAL]`

Industry: "Stop when validation loss hasn't improved for N epochs" (patience).

Geometric: Two criteria, no validation set required:

should_stop = loss_stable OR budget_exhausted

Where:

loss_stable: Loss change below sqrt(eps) (numerical precision floor)
budget_exhausted: Spectral bound ratio > 0.9 (90% of geometric budget consumed, i.e. ||BA||_spectral / sigma_k > 0.9)

All thresholds are dtype-derived (sqrt(eps)) or geometry-derived (spectral bound ratio).

Deep Dive: training_heuristics_analysis.md, Phase 2b

10. LoRA Scale `[VALIDATED]`

Industry: scale = alpha/rank (typically alpha=16, rank=8, scale=2.0).

Geometric: scale <= sigma_k(W) / ||B @ A||_spectral per layer. All 9 tested adapters were 22-2700x over this bound. The standard scale of 2.0 caused catastrophic model degradation (gibberish output, repetitive loops). The geometric scale restored correct reasoning.

Mathematical Basis: By Weyl's inequality, |sigma_i(W') - sigma_i(W)| <= ||scale * Delta||_2. To preserve W's spectral structure, the perturbation must not exceed sigma_k. For crossing at the structural boundary, the Weyl no-crossing condition is ||E||_2 < gap_k / 2, giving scale <= gap / (2 * ||Delta||_2).

Code: lora_safety_service.py (compute_geometric_scale, apply_lora_geometric)

Deep Dive: lora_spectral_scale_bound.md (empirics), lora_spectral_theory.md (3 theorems)

11. LoRA Rank `[EMPIRICAL]`

Industry: 8 or 16, chosen arbitrarily.

Geometric: rank = tail_dims = full_rank - floor(shannon_effective_rank) per layer. The Shannon effective rank captures structural spectral utilization, while precision rank (max(m,n) * eps * sigma_max) is a secondary numerical diagnostic. The tail dimensions are the null-space capacity where LoRA can add information without interfering with the base model's learned structure. Standard rank-8 is typically under-parameterized by geometry.

Per-layer adaptive rank: each layer gets its own rank based on its spectral structure. Layers with more null space get higher rank.

Code: geometric_lora.py:251 (compute_per_layer_ranks), :230 (compute_geometric_rank)

12. LoRA Target Modules `[EMPIRICAL]`

Industry: q_proj + v_proj (convention from Hu et al. 2021).

Geometric: Target layers where tail_dims > 0 (non-zero null-space capacity). Spectral decay analysis of LFM2-350M attention:

Projection	sigma_k	Decay Ratio	Scale Bound
v_proj	0.46	10x	~0.5
k_proj	0.30	42x	~0.3
q_proj	0.005	2,810x	~0.002
o_proj	0.003	2,508x	~0.002

v_proj/k_proj have 100x more room for perturbation than q_proj/o_proj. The standard practice of targeting q_proj + v_proj is geometrically inconsistent.

Code: geometric_lora.py:213 (select_target_modules)

Deep Dive: lora_projection_targeting.md

13. LoRA Dropout `[VALIDATED]`

Industry: 0.1, arbitrary.

Geometric: Product of two spectral ratios, no arbitrary constants:

dropout = redundancy * adapter_fraction

Where:

redundancy = 1 - shannon_eff_rank / full_rank (spectral concentration, 0 = flat spectrum, 1 = single dominant SV)
adapter_fraction = rank / full_rank (how much of the weight's space LoRA occupies)
shannon_eff_rank = exp(H(sigma^2)) (Roy & Vetterli 2007)

Self-calibrating: layers with more null-space capacity get both higher rank AND higher dropout. The two ratios multiply to give values that scale with the geometry. NB-LoRA (Cayley transform) uses 0.0 because its spectral norm bound is strictly tighter than dropout's nuclear norm regularization.

Validated on 7 real models across 4 architectures. Dropout ranges from 0.001 (small rank) to 0.11 (large rank layers).

Code: geometric_lora.py:283 (compute_geometric_dropout)

Deep Dive: training_heuristics_analysis.md, Experiment 5

14. LoRA Weight Initialization `[EMPIRICAL]`

Industry: Random A (Gaussian), zeros B (Hu et al. 2021). Product B @ A starts at zero and must "grow into" the budget during training.

Geometric: Spectral normalized initialization:

||A||_spectral = sqrt(sigma_k)
||B||_spectral = sqrt(sigma_k)
||B @ A||_spectral = sigma_k

Uses the full geometric budget from step 0. Each matrix gets sqrt(sigma_k) spectral norm so the product respects the bound immediately.

Code: spectral_init.py:162 (spectral_normalized_lora_init)

15. Residual Connection Scaling `[EMPIRICAL]`

Industry: alpha = 1 (no scaling, standard residual output = x + f(x)).

Geometric: alpha_i = sigma_max(x) / sigma_max(f(x)) per layer. Normalizes so ||alpha * f(x)|| ~ ||x||, making residual contributions comparable across layers. Hook-based (non-invasive, no model modification). Clamped to precision-derived range [sqrt(eps), 1/sqrt(eps)]. Spectral norms computed via fast power iteration (3 iterations).

Code: residual_scaling.py:184 (compute_residual_scale), :39 (spectral_norm_power_iteration)

What Was Removed and Why

Heuristic	Why Removed / Replaced	Evidence
Gradient Clipping	Cayley-Stiefel retraction bounds updates; budget monitor halts on violation	0% clip events across all layers in experiments
Warmup	MASS is stable from step 0, no warmup needed	No divergence without warmup; convergence immediate
Adam / Momentum	Replaced by Cayley-Stiefel retraction (NB-LoRA). Historical: ScaledGD (Tong et al. JMLR 2021)	Orthogonality constraint makes preconditioning unnecessary
Per-layer LR heuristics	Replaced by MASS controller + Weyl bounds	σ_k/σ_max, 1/σ_max, σ_k/σ_max² were guesses, not measured controls

Partially Implemented

Heuristic	Formula	What Exists	What's Missing
Weight Decay	`sigma_k / sigma_max`	`compute_decay_scale()` in optimizer	Full integration across all training paths

Cross-Reference Map

Topic	Document	Content
LoRA scale empirical validation	`lora_spectral_scale_bound.md`	9 adapter analysis, GSM8K test case
3 Theorems (Weyl, Weyl no-crossing, Sufficiency)	`lora_spectral_theory.md`	Full proofs for necessity, no-crossing, sufficiency
Rank/target/scale original derivation	`lora_geometric_derivation.md`	Original derivation (superseded by scale bound)
Projection targeting	`lora_projection_targeting.md`	q_proj vs v_proj spectral analysis
All 5 experiments + Phase 2b	`training_heuristics_analysis.md`	Phase-by-phase results, conclusions

References

Cavazza, J. et al. (2018). Dropout as a Low-Rank Regularizer for Matrix Factorization. AISTATS.
Davis, C. & Kahan, W.M. (1970). The Rotation of Eigenvectors by a Perturbation. III. SIAM J. Numer. Anal.
Defazio, A. et al. (2024). The Road Less Scheduled. NeurIPS.
Golub, G.H. & Van Loan, C.F. (2013). Matrix Computations (4th ed.). Johns Hopkins.
Hayou, S. et al. (2024). LoRA+: Efficient Low-Rank Adaptation of Large Models. ICML.
Hu, E.J. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
Kingma, D.P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
Ma, J. & Yarats, D. (2021). On the Adequacy of Untuned Warmup for Adaptive Optimization. AAAI.
McCandlish, S. et al. (2018). An Empirical Model of Large-Batch Training. arXiv:1812.06162.
Miyato, T. et al. (2018). Spectral Normalization for Generative Adversarial Networks. ICLR.
Mu, T. & Klabjan, D. (2025). Convergence Analysis of LoRA Fine-Tuning. arXiv (Dec 2025).
Nesterov, Y. (2004). Introductory Lectures on Convex Optimization. Springer.
Pascanu, R. et al. (2013). On the difficulty of training recurrent neural networks. ICML.
Roy, O. & Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. EUSIPCO.
Tong, T., Ma, C. & Chi, Y. (2021). Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent. JMLR.
Tran, H. et al. (2025). Spectral Perturbation Bounds Under Eigengap Conditions. arXiv:2510.25670.
Wang, X. et al. (2025). NB-LoRA: Norm-Bounded Low-Rank Adaptation. arXiv:2501.19050.
Weyl, H. (1912). Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen. Math. Ann.

All code paths relative to src/modelcypher/. All formulas verified against source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geometric Hyperparameter Rosetta Stone `[EMPIRICAL]`

Thesis

Master Reference Table

Constants Derived from Machine Precision `[PROVEN]`

Detailed Derivations

1. Learning Rate (Historical Path Superseded by MASS) `[DISPROVEN]`

2. Adam Epsilon `[EMPIRICAL]`

3. Adam / Momentum (Beta1/Beta2) -> ScaledGD (Superseded)

4. Weight Decay `[EMPIRICAL]`

5. Gradient Clipping `[EMPIRICAL]`

6. Warmup `[EMPIRICAL]`

7. LR Schedule `[EMPIRICAL]`

8. Batch Size `[CONJECTURAL]`

9. Early Stopping `[EMPIRICAL]`

10. LoRA Scale `[VALIDATED]`

11. LoRA Rank `[EMPIRICAL]`

12. LoRA Target Modules `[EMPIRICAL]`

13. LoRA Dropout `[VALIDATED]`

14. LoRA Weight Initialization `[EMPIRICAL]`

15. Residual Connection Scaling `[EMPIRICAL]`

What Was Removed and Why

Partially Implemented

Cross-Reference Map

References

FilesExpand file tree

geometric_hyperparameter_rosetta_stone.md

Latest commit

History

geometric_hyperparameter_rosetta_stone.md

File metadata and controls

Geometric Hyperparameter Rosetta Stone [EMPIRICAL]

Thesis

Master Reference Table

Constants Derived from Machine Precision [PROVEN]

Detailed Derivations

1. Learning Rate (Historical Path Superseded by MASS) [DISPROVEN]

2. Adam Epsilon [EMPIRICAL]

3. Adam / Momentum (Beta1/Beta2) -> ScaledGD (Superseded)

4. Weight Decay [EMPIRICAL]

5. Gradient Clipping [EMPIRICAL]

6. Warmup [EMPIRICAL]

7. LR Schedule [EMPIRICAL]

8. Batch Size [CONJECTURAL]

9. Early Stopping [EMPIRICAL]

10. LoRA Scale [VALIDATED]

11. LoRA Rank [EMPIRICAL]

12. LoRA Target Modules [EMPIRICAL]

13. LoRA Dropout [VALIDATED]

14. LoRA Weight Initialization [EMPIRICAL]

15. Residual Connection Scaling [EMPIRICAL]

What Was Removed and Why

Partially Implemented

Cross-Reference Map

References

Geometric Hyperparameter Rosetta Stone `[EMPIRICAL]`

Constants Derived from Machine Precision `[PROVEN]`

1. Learning Rate (Historical Path Superseded by MASS) `[DISPROVEN]`

2. Adam Epsilon `[EMPIRICAL]`

4. Weight Decay `[EMPIRICAL]`

5. Gradient Clipping `[EMPIRICAL]`

6. Warmup `[EMPIRICAL]`

7. LR Schedule `[EMPIRICAL]`

8. Batch Size `[CONJECTURAL]`

9. Early Stopping `[EMPIRICAL]`

10. LoRA Scale `[VALIDATED]`

11. LoRA Rank `[EMPIRICAL]`

12. LoRA Target Modules `[EMPIRICAL]`

13. LoRA Dropout `[VALIDATED]`

14. LoRA Weight Initialization `[EMPIRICAL]`

15. Residual Connection Scaling `[EMPIRICAL]`