diff --git a/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/README.md b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/README.md new file mode 100644 index 000000000..61af88331 --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/README.md @@ -0,0 +1,91 @@ +# SwiGLU + EMA + TTT Memorization Analysis + +**Author:** Robby Sneiderman ([@Robby955](https://github.com/Robby955)) + +**Verified BPB:** 1.1679 (standard sliding-window eval, no TTT, 8xH100 SXM) + +**Artifact:** 15,528,857 bytes (code: 58,274 + weights: 15,470,583) + +## Results + +| Metric | Value | +|--------|-------| +| **Verified BPB (no TTT)** | **1.1679** | +| Training steps | 4,238 (8xH100 SXM, ~141ms/step) | +| Model params | 25,517,137 | +| Artifact size | 15.53 MB | + +## TTT Memorization Analysis (Research Contribution) + +We ran extensive multi-epoch TTT experiments and developed a diagnostic to distinguish genuine domain adaptation from test-set memorization. **Per Issue #402**, multi-epoch TTT is not valid for record claims — we present these results purely as a methodological contribution. + +### The Diagnostic + +After TTT adaptation, we run standard sliding-window eval (stride=64) on the adapted weights. If the model genuinely learned better representations, sliding eval (with overlapping context) should score *better* than the TTT-loop (non-overlapping chunks). If the model merely memorized token sequences, the TTT-loop score will be artificially low while sliding eval reveals the true performance. + +| TTT Config | TTT-Loop BPB | Sliding Diagnostic | Gap | Interpretation | +|------------|-------------|-------------------|-----|----------------| +| 0 epochs (baseline) | — | 1.1679 | — | No adaptation | +| 3 epochs, flat 5e-4 | 1.1032 | 1.0476 | -0.056 | Sliding BETTER — real adaptation | +| 5 epochs, cosine 7e-4 | 1.0101 | 1.0022 | -0.008 | Sliding BETTER — real adaptation | +| **10 epochs, flat 5e-4** | **0.8566** | **0.9229** | **+0.066** | **TTT-loop better — MEMORIZATION** | + +**Key finding**: At 10 epochs with flat LR, the TTT-loop reports 0.8566 BPB — but the sliding diagnostic reveals the actual prediction quality is only 0.9229 BPB. The 0.066 gap is pure memorization of token sequences. Submissions reporting sub-0.95 BPB from high-epoch TTT should be scrutinized with this diagnostic. + +**Implication for the competition**: Multi-epoch TTT conflates domain adaptation with test-set memorization. We recommend that TTT submissions either (a) use strictly single-pass score-first TTT per Issue #402, or (b) report the sliding diagnostic alongside TTT-loop BPB to verify legitimacy. + +### TTT Technical Details (for reproducibility of the analysis) + +- Sequential score-then-train on non-overlapping 2048-token chunks +- Batch 8 chunks per forward pass +- Freeze embeddings (tok_emb, bigram) — adapt only attention and MLP 2D weights +- AdamW optimizer, wd=0.0 +- **Global cosine LR decay** across all epochs (single cosine curve, not per-epoch reset) +- **Per-layer LR multipliers**: `lr_mult = 0.5 + 0.5 * (layer_idx / (num_layers - 1))` + +The global cosine schedule is what enables 5 epochs without crossing into memorization — by epoch 5, the LR has decayed to ~0.000002, minimizing further adaptation. With flat LR, 5+ epochs crosses the memorization boundary. + +## Architecture + +Built on thwu1 PR #180 (which built on unnir PR #162): + +1. **SwiGLU MLP** replacing ReLU-squared. `silu(W_gate @ x) * (W_up @ x)` with `swiglu_mult=2.0`. +2. **EMA** (decay=0.9985) replacing SWA during warmdown. +3. **Int5 quantization for all weights** with 5% magnitude pruning, zstd-22. +4. 512-dim, 8 heads, 4 KV heads, 10 transformer layers +5. BigramHash (10,240 buckets, 128-dim), SmearGate +6. Muon optimizer (WD=0.04, matrix_lr=0.02, momentum=0.99) + +## EBLS Exploration + +We explored Empirical Bayes Layer Sharing with learned shrinkage gammas (see [companion repo](https://github.com/Robby955/parameter-golf-ebls)): + +- **MLP gammas → 0.0000**: Fully shared MLP is optimal under compression constraints +- **Attention gammas near-zero**: Trace specialization in early layers only +- **LoRA rank threshold**: Rank 8 → all sharing; rank 16 → mild specialization +- **Quantization amplification**: 0.19 BPB compiled-vs-eager gap from depth recurrence + +## Negative Results + +- **Trigram hashing**: 3-token XOR hash did not improve over bigram (1.0532 vs 1.0320) +- **Late QAT**: STE-based int5 simulation added 13ms/step overhead; lost training steps outweighed benefits +- **11 layers**: Either exceeds 16MB (SWIGLU 2.0) or trains too slowly (SWIGLU 1.7) +- **Per-epoch cosine**: Resetting cosine each epoch was worse than flat LR + +## Reproducing + +```bash +# 8xH100 SXM, 10-minute wallclock training +NUM_LAYERS=10 SWIGLU_MULT=2.0 TTT_STEPS=0 PRUNE_FRAC=0.05 \ +torchrun --standalone --nproc_per_node=8 train_gpt.py +``` + +## Credits + +- thwu1 PR #180 (base architecture, int5/int6, SWA, BigramHash) +- unnir PR #162 (10L, MLP 3x, SmearGate, MuonWD) +- felipe-parodi (EMA concept) +- sjp611 (AdamW TTT concept) +- JoeProAI PR #462 (sequential TTT approach, SwiGLU) +- andrewbaggio1 PR #509, newjordan PR #508 (TTT epoch scaling data, embedding freeze) +- ndokutovich PR #486 (per-layer LR concept, global cosine TTT) diff --git a/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/submission.json b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/submission.json new file mode 100644 index 000000000..1c68369cd --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/submission.json @@ -0,0 +1,11 @@ +{ + "author": "Robby Sneiderman", + "github_id": "Robby955", + "name": "SwiGLU + EMA + TTT Memorization Analysis", + "blurb": "SwiGLU MLP (mult=2.0), EMA (0.9985), int5 quantization, 5% pruning, zstd-22. Verified 1.1679 BPB (no TTT, standard sliding eval). Includes TTT memorization analysis: multi-epoch TTT produces scores as low as 0.86 BPB but independent sliding-window diagnostic reveals memorization. We provide a framework for distinguishing genuine TTT adaptation from test-set memorization.", + "date": "2026-03-23T00:00:00Z", + "val_loss": 1.9765, + "val_bpb": 1.1679, + "bytes_total": 15528857, + "bytes_code": 58274 +} diff --git a/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/train_gpt.py b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/train_gpt.py new file mode 100644 index 000000000..0cabb9fc2 --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/train_gpt.py @@ -0,0 +1,1254 @@ +""" +Built on thwu1's record (PR #180) which built on unnir's PR #162. +Additions: SwiGLU MLP, AdamW TTT eval, EMA (replacing SWA). +Statistical motivation: TTT as posterior adaptation with adaptive shrinkage. + +Hard stop: train_gpt.py and train_gpt_mlx.py must never be longer than 1500 lines. +""" + +from __future__ import annotations + +import copy +import glob +import io +import math +import os +import random +import subprocess +import sys +import time +import uuid +import zlib +from pathlib import Path + +try: + import zstandard + _COMPRESSOR = "zstd" +except ImportError: + _COMPRESSOR = "zlib" + +import numpy as np +import sentencepiece as spm +import torch +import torch.distributed as dist +import torch.nn.functional as F +from torch import Tensor, nn +from torch.nn.parallel import DistributedDataParallel as DDP + +# ----------------------------- +# HYPERPARAMETERS +# ----------------------------- + +class Hyperparameters: + data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") + train_files = os.path.join(data_path, "fineweb_train_*.bin") + val_files = os.path.join(data_path, "fineweb_val_*.bin") + tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") + run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) + seed = int(os.environ.get("SEED", 42)) + + val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 500)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 100)) + + iterations = int(os.environ.get("ITERATIONS", 20000)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3000)) + warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) + qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) + + vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) + num_layers = int(os.environ.get("NUM_LAYERS", 10)) + num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) + model_dim = int(os.environ.get("MODEL_DIM", 512)) + num_heads = int(os.environ.get("NUM_HEADS", 8)) + # SwiGLU uses 8/3 * dim rounded to nearest multiple of 64 for hidden dim + swiglu_mult = float(os.environ.get("SWIGLU_MULT", 2.0)) + tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) + rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) + rope_dim = int(os.environ.get("ROPE_DIM", 0)) # Partial RoPE: 0 = full RoPE (default) + xsa_layers = int(os.environ.get("XSA_LAYERS", 0)) # Cross-sequence attention disabled (hurts TTT per PR #303) + logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) + + # BigramHash + SmearGate + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 10240)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) + + # Optimizer hyperparameters + embed_lr = float(os.environ.get("EMBED_LR", 0.6)) + head_lr = float(os.environ.get("HEAD_LR", 0.008)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.03)) + tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.02)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.02)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) + muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) + beta1 = float(os.environ.get("BETA1", 0.9)) + beta2 = float(os.environ.get("BETA2", 0.95)) + adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + weight_decay = float(os.environ.get("WEIGHT_DECAY", 0.04)) + + # EMA (replaces SWA) + ema_decay = float(os.environ.get("EMA_DECAY", 0.9985)) + ema_start_frac = float(os.environ.get("EMA_START_FRAC", 0.4)) + + # TTT (test-time training with AdamW) + ttt_steps = int(os.environ.get("TTT_STEPS", 3)) + ttt_lr = float(os.environ.get("TTT_LR", 5e-4)) + ttt_wd = float(os.environ.get("TTT_WD", 0.0)) + + # Eval + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + eval_batch_seqs = int(os.environ.get("EVAL_BATCH_SEQS", 32)) + + # Mixed quantization + prune_frac = float(os.environ.get("PRUNE_FRAC", 0.05)) + +# ----------------------------- +# MUON OPTIMIZER +# ----------------------------- + +def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor: + a, b, c = (3.4445, -4.7750, 2.0315) + X = G.bfloat16() + X /= X.norm() + eps + transposed = G.size(0) > G.size(1) + if transposed: + X = X.T + for _ in range(steps): + A = X @ X.T + B = b * A + c * A @ A + X = a * X + B @ X + return X.T if transposed else X + + +class Muon(torch.optim.Optimizer): + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0): + super().__init__(params, dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay)) + + @torch.no_grad() + def step(self, closure=None): + loss = None + if closure is not None: + with torch.enable_grad(): + loss = closure() + distributed = dist.is_available() and dist.is_initialized() + world_size = dist.get_world_size() if distributed else 1 + rank = dist.get_rank() if distributed else 0 + for group in self.param_groups: + params = group["params"] + if not params: + continue + lr, momentum = group["lr"], group["momentum"] + backend_steps, nesterov = group["backend_steps"], group["nesterov"] + wd = group.get("weight_decay", 0.0) + total_params = sum(int(p.numel()) for p in params) + updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16) + curr = 0 + for i, p in enumerate(params): + if i % world_size == rank and p.grad is not None: + g = p.grad + state = self.state[p] + if "momentum_buffer" not in state: + state["momentum_buffer"] = torch.zeros_like(g) + buf = state["momentum_buffer"] + buf.mul_(momentum).add_(g) + if nesterov: + g = g.add(buf, alpha=momentum) + g = zeropower_via_newtonschulz5(g, steps=backend_steps) + g *= max(1, g.size(0) / g.size(1)) ** 0.5 + updates_flat[curr : curr + p.numel()] = g.reshape(-1) + curr += p.numel() + if distributed: + dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM) + curr = 0 + for p in params: + g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype) + if wd > 0: + p.data.mul_(1.0 - lr * wd) + p.add_(g, alpha=-lr) + curr += p.numel() + return loss + + +# ----------------------------- +# TOKENIZER-AGNOSTIC EVALUATION +# ----------------------------- + +def build_sentencepiece_luts( + sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device +) -> tuple[Tensor, Tensor, Tensor]: + sp_vocab_size = int(sp.vocab_size()) + table_size = max(sp_vocab_size, vocab_size) + base_bytes_np = np.zeros((table_size,), dtype=np.int16) + has_leading_space_np = np.zeros((table_size,), dtype=np.bool_) + is_boundary_token_np = np.ones((table_size,), dtype=np.bool_) + for token_id in range(sp_vocab_size): + if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id): + continue + is_boundary_token_np[token_id] = False + if sp.is_byte(token_id): + base_bytes_np[token_id] = 1 + continue + piece = sp.id_to_piece(token_id) + if piece.startswith("\u2581"): + has_leading_space_np[token_id] = True + piece = piece[1:] + base_bytes_np[token_id] = len(piece.encode("utf-8")) + return ( + torch.tensor(base_bytes_np, dtype=torch.int16, device=device), + torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), + torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), + ) + + +def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: + files = [Path(p) for p in sorted(glob.glob(pattern))] + if not files: + raise FileNotFoundError(f"No files found for pattern: {pattern}") + tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() + usable = ((tokens.numel() - 1) // seq_len) * seq_len + if usable <= 0: + raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") + return tokens[: usable + 1] + + +def eval_val( + args: Hyperparameters, model: nn.Module, rank: int, world_size: int, + device: torch.device, grad_accum_steps: int, val_tokens: Tensor, + base_bytes_lut: Tensor, has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, +) -> tuple[float, float]: + local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) + if local_batch_tokens < args.train_seq_len: + raise ValueError("VAL_BATCH_SIZE too small for this config") + local_batch_seqs = local_batch_tokens // args.train_seq_len + total_seqs = (val_tokens.numel() - 1) // args.train_seq_len + seq_start = (total_seqs * rank) // world_size + seq_end = (total_seqs * (rank + 1)) // world_size + val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) + val_token_count = torch.zeros((), device=device, dtype=torch.float64) + val_byte_count = torch.zeros((), device=device, dtype=torch.float64) + model.eval() + with torch.inference_mode(): + for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): + batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) + raw_start = batch_seq_start * args.train_seq_len + raw_end = batch_seq_end * args.train_seq_len + 1 + local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) + x = local[:-1].reshape(-1, args.train_seq_len) + y = local[1:].reshape(-1, args.train_seq_len) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + batch_loss = model(x, y).detach() + batch_token_count = float(y.numel()) + val_loss_sum += batch_loss.to(torch.float64) * batch_token_count + val_token_count += batch_token_count + prev_ids = x.reshape(-1) + tgt_ids = y.reshape(-1) + token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) + token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) + val_byte_count += token_bytes.to(torch.float64).sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) + val_loss = val_loss_sum / val_token_count + bits_per_token = val_loss.item() / math.log(2.0) + tokens_per_byte = val_token_count.item() / val_byte_count.item() + model.train() + return float(val_loss.item()), float(bits_per_token * tokens_per_byte) + + +# ----------------------------- +# POST-TRAINING QUANTIZATION +# ----------------------------- + +CONTROL_TENSOR_NAME_PATTERNS = tuple( + pattern for pattern in os.environ.get( + "CONTROL_TENSOR_NAME_PATTERNS", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,bigram.scale", + ).split(",") if pattern +) +FP16_KEEP_NAME_PATTERNS = tuple( + pattern for pattern in os.environ.get("FP16_KEEP_NAME_PATTERNS", "tok_emb").split(",") if pattern +) +INT8_KEEP_FLOAT_FP32_NAME_PATTERNS = tuple( + pattern for pattern in os.environ.get( + "INT8_KEEP_FLOAT_FP32_NAME_PATTERNS", ",".join(CONTROL_TENSOR_NAME_PATTERNS), + ).split(",") if pattern +) +INT8_KEEP_FLOAT_MAX_NUMEL = 65_536 +INT8_PER_ROW_SCALE_DTYPE = torch.float16 +INT8_CLIP_Q = 0.9999984 + +def tensor_nbytes(t: Tensor) -> int: + return int(t.numel()) * int(t.element_size()) + + +def _classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if "bigram" in name: + return "bigram" + if ".attn." in name: + return "attn" + return "other" + + +def quantize_intN_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: + """Per-row quantization to [-clip_range-1, clip_range] stored as int8.""" + t32 = t.float() + if t32.ndim == 2: + row_max = t32.abs().amax(dim=1) + scale = (row_max / clip_range).clamp_min(1e-12).to(torch.float16) + scale = scale.clamp_min(torch.finfo(torch.float16).tiny) + q = torch.clamp(torch.round(t32 / scale.float()[:, None]), -(clip_range + 1), clip_range).to(torch.int8) + return q, scale + amax = t32.abs().max().item() + scale = torch.tensor(max(amax / clip_range, 1e-12), dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), -(clip_range + 1), clip_range).to(torch.int8) + return q, scale + + +def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str]): + """Int5 quantization for all large weight categories (MLP, attention, bigram).""" + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 8192: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if any(pattern in name for pattern in FP16_KEEP_NAME_PATTERNS): + result[name] = t.to(dtype=torch.float16).contiguous() + meta[name] = "passthrough_fp16" + continue + if cat in int6_cats and t.ndim >= 1: + clip = 15 # int5 for all categories + q, s = quantize_intN_per_row(t, clip_range=clip) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int5"} + else: + # Fallback: int8 with percentile clipping + t32 = t.float() + if t32.ndim == 2: + clip_abs = torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) if t32.numel() else torch.empty((t32.shape[0],), dtype=torch.float32) + clipped = torch.clamp(t32, -clip_abs[:, None], clip_abs[:, None]) + scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) + q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() + result[name + ".q"] = q + result[name + ".scale"] = scale.to(INT8_PER_ROW_SCALE_DTYPE).contiguous() + else: + clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 + scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) + q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() + result[name + ".q"] = q + result[name + ".scale"] = scale + meta[name] = {"type": "int8"} + return result, meta + + +def dequantize_mixed(result: dict[str, Tensor], meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta[name] + orig_dtype = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out + + +# ----------------------------- +# DATA LOADING +# ----------------------------- + +def load_data_shard(file: Path) -> Tensor: + header_bytes = 256 * np.dtype(" None: + self.file_idx = (self.file_idx + 1) % len(self.files) + self.tokens = load_data_shard(self.files[self.file_idx]) + self.pos = 0 + + def take(self, n: int) -> Tensor: + chunks: list[Tensor] = [] + remaining = n + while remaining > 0: + avail = self.tokens.numel() - self.pos + if avail <= 0: + self._advance_file() + continue + k = min(remaining, avail) + chunks.append(self.tokens[self.pos : self.pos + k]) + self.pos += k + remaining -= k + return chunks[0] if len(chunks) == 1 else torch.cat(chunks) + + +class DistributedTokenLoader: + def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): + self.rank = rank + self.world_size = world_size + self.device = device + self.stream = TokenStream(pattern) + + def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: + local_tokens = global_tokens // (self.world_size * grad_accum_steps) + per_rank_span = local_tokens + 1 + chunk = self.stream.take(per_rank_span * self.world_size) + start = self.rank * per_rank_span + local = chunk[start : start + per_rank_span].to(dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) + + +# ----------------------------- +# TRANSFORMER MODULES +# ----------------------------- + +class RMSNorm(nn.Module): + def __init__(self, eps: float | None = None): + super().__init__() + self.eps = eps + + def forward(self, x: Tensor) -> Tensor: + return F.rms_norm(x, (x.size(-1),), eps=self.eps) + + +class CastedLinear(nn.Linear): + qat_enabled: bool = False + + def forward(self, x: Tensor) -> Tensor: + w = self.weight + if self.qat_enabled: + # STE: simulate int5 per-row quantization during training + with torch.no_grad(): + scale = (w.abs().amax(dim=1, keepdim=True) / 15).clamp_min(1e-12) + w_q = torch.clamp(torch.round(w / scale), -16, 15) * scale + w = w + (w_q - w).detach() + w = w.to(x.dtype) + bias = self.bias.to(x.dtype) if self.bias is not None else None + return F.linear(x, w, bias) + + +def restore_low_dim_params_to_fp32(module: nn.Module) -> None: + with torch.no_grad(): + for name, param in module.named_parameters(): + if (param.ndim < 2 or any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: + param.data = param.data.float() + + +class Rotary(nn.Module): + def __init__(self, dim: int, base: float = 10000.0): + super().__init__() + inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim)) + self.register_buffer("inv_freq", inv_freq, persistent=False) + self._seq_len_cached = 0 + self._cos_cached: Tensor | None = None + self._sin_cached: Tensor | None = None + + def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: + if self._cos_cached is None or self._seq_len_cached != seq_len or self._cos_cached.device != device: + t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype) + freqs = torch.outer(t, self.inv_freq.to(device)) + self._cos_cached = freqs.cos()[None, None, :, :] + self._sin_cached = freqs.sin()[None, None, :, :] + self._seq_len_cached = seq_len + return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) + + +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor: + half = x.size(-1) // 2 + x1, x2 = x[..., :half], x[..., half:] + return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + + +class CausalSelfAttention(nn.Module): + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, rope_base: float, + qk_gain_init: float, rope_dim: int = 0, use_xsa: bool = False): + super().__init__() + self.num_heads = num_heads + self.num_kv_heads = num_kv_heads + self.head_dim = dim // num_heads + self.rope_dim = rope_dim if rope_dim > 0 else self.head_dim # Partial RoPE + self.use_xsa = use_xsa # Cross-sequence attention during eval + kv_dim = self.num_kv_heads * self.head_dim + self.c_q = CastedLinear(dim, dim, bias=False) + self.c_k = CastedLinear(dim, kv_dim, bias=False) + self.c_v = CastedLinear(dim, kv_dim, bias=False) + self.proj = CastedLinear(dim, dim, bias=False) + self.proj._zero_init = True + self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) + self.rotary = Rotary(self.rope_dim, base=rope_base) + + def forward(self, x: Tensor) -> Tensor: + bsz, seqlen, dim = x.shape + q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2) + k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2) + v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2) + q = F.rms_norm(q, (q.size(-1),)) + k = F.rms_norm(k, (k.size(-1),)) + cos, sin = self.rotary(seqlen, x.device, q.dtype) + # Partial RoPE: only rotate first rope_dim dimensions + if self.rope_dim < self.head_dim: + q_rot, q_pass = q[..., :self.rope_dim], q[..., self.rope_dim:] + k_rot, k_pass = k[..., :self.rope_dim], k[..., self.rope_dim:] + q_rot = apply_rotary_emb(q_rot, cos, sin) + k_rot = apply_rotary_emb(k_rot, cos, sin) + q = torch.cat([q_rot, q_pass], dim=-1) + k = torch.cat([k_rot, k_pass], dim=-1) + else: + q = apply_rotary_emb(q, cos, sin) + k = apply_rotary_emb(k, cos, sin) + q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None] + # Manual GQA expansion (PyTorch 2.4 compatible) + if self.num_kv_heads != self.num_heads: + rep = self.num_heads // self.num_kv_heads + k = k[:, :, None, :, :].expand(-1, -1, rep, -1, -1).reshape(bsz, self.num_heads, seqlen, self.head_dim) + v = v[:, :, None, :, :].expand(-1, -1, rep, -1, -1).reshape(bsz, self.num_heads, seqlen, self.head_dim) + # XSA: cross-sequence attention during eval (no causal mask) + use_causal = True + if self.use_xsa and not self.training: + use_causal = False + y = F.scaled_dot_product_attention(q, k, v, attn_mask=None, is_causal=use_causal) + y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim) + return self.proj(y) + + +# ---- CHANGE 1: SwiGLU MLP (replaces ReLU² MLP) ---- +class SwiGLUMLP(nn.Module): + """SwiGLU FFN: gate = silu(W_gate @ x), out = (gate * (W_up @ x)) @ W_down. + Uses 8/3 * dim hidden size (standard SwiGLU ratio from LLaMA/PaLM). + """ + def __init__(self, dim: int, swiglu_mult: float): + super().__init__() + hidden = int(swiglu_mult * dim) + # Round to nearest multiple of 64 for tensor core efficiency + hidden = ((hidden + 63) // 64) * 64 + self.w_gate = CastedLinear(dim, hidden, bias=False) + self.w_up = CastedLinear(dim, hidden, bias=False) + self.w_down = CastedLinear(hidden, dim, bias=False) + self.w_down._zero_init = True + + def forward(self, x: Tensor) -> Tensor: + return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x)) + + +class SmearGate(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev + + +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod + return out.long() + + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) + + +class Block(nn.Module): + def __init__(self, dim: int, num_heads: int, num_kv_heads: int, swiglu_mult: float, + rope_base: float, qk_gain_init: float, layer_idx: int = 0, + rope_dim: int = 0, use_xsa: bool = False): + super().__init__() + self.layer_idx = layer_idx + self.attn_norm = RMSNorm() + self.mlp_norm = RMSNorm() + self.attn = CausalSelfAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init, + rope_dim=rope_dim, use_xsa=use_xsa) + self.mlp = SwiGLUMLP(dim, swiglu_mult) + self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) + self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) + + def forward(self, x: Tensor, x0: Tensor) -> Tensor: + mix = self.resid_mix.to(dtype=x.dtype) + x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + attn_out = self.attn(self.attn_norm(x)) + x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out + x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x)) + return x + + +class GPT(nn.Module): + def __init__(self, vocab_size: int, num_layers: int, model_dim: int, num_heads: int, + num_kv_heads: int, swiglu_mult: float, tie_embeddings: bool, + tied_embed_init_std: float, logit_softcap: float, rope_base: float, + qk_gain_init: float, bigram_vocab_size: int = 0, bigram_dim: int = 128, + rope_dim: int = 0, xsa_layers: int = 0): + super().__init__() + self.tie_embeddings = tie_embeddings + self.tied_embed_init_std = tied_embed_init_std + self.logit_softcap = logit_softcap + self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None + self.num_encoder_layers = num_layers // 2 + self.num_decoder_layers = num_layers - self.num_encoder_layers + self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers) + self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, model_dim, dtype=torch.float32)) + self.smear = SmearGate(model_dim) + # XSA on last xsa_layers blocks (eval-only cross-sequence attention) + self.blocks = nn.ModuleList([ + Block(model_dim, num_heads, num_kv_heads, swiglu_mult, rope_base, qk_gain_init, + layer_idx=i, rope_dim=rope_dim, + use_xsa=(i >= num_layers - xsa_layers) if xsa_layers > 0 else False) + for i in range(num_layers) + ]) + self.final_norm = RMSNorm() + self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) + if self.lm_head is not None: + self.lm_head._zero_init = True + self._init_weights() + + def _init_weights(self) -> None: + if self.tie_embeddings: + nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) + num_layers = len(self.blocks) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: + nn.init.orthogonal_(module.weight, gain=1.0) + if ".proj." in name or name.endswith(".proj") or "w_down" in name: + with torch.no_grad(): + module.weight.mul_(1.0 / math.sqrt(2 * num_layers)) + + def _run_layers(self, input_ids: Tensor) -> Tensor: + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + skips: list[Tensor] = [] + for i in range(self.num_encoder_layers): + x = self.blocks[i](x, x0) + skips.append(x) + for i in range(self.num_decoder_layers): + if skips: + x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop() + x = self.blocks[self.num_encoder_layers + i](x, x0) + return self.final_norm(x) + + def _get_logits(self, hidden: Tensor) -> Tensor: + flat = hidden.reshape(-1, hidden.size(-1)) + if self.tie_embeddings: + logits_proj = F.linear(flat, self.tok_emb.weight) + else: + logits_proj = self.lm_head(flat) + return self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + + def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor: + hidden = self._run_layers(input_ids) + logits = self._get_logits(hidden) + return F.cross_entropy(logits.float(), target_ids.reshape(-1), reduction="mean") + + def forward_logits(self, input_ids: Tensor) -> Tensor: + hidden = self._run_layers(input_ids) + bsz, seq_len, _ = hidden.shape + return self._get_logits(hidden).reshape(bsz, seq_len, -1) + + +# ---- CHANGE 2: Sequential TTT eval (score-then-train, batched) ---- +def eval_val_ttt( + args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, + device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, +) -> tuple[float, float]: + """Sequential TTT: process val tokens left-to-right, score then train. + + Per competition rules: "you are only allowed to test-time train on + validation set tokens you've already evaluated your model on." + Score each chunk first, then train on it. Weights persist across + chunks. Batches multiple chunks per forward pass for speed. + """ + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + num_chunks = total_tokens // seq_len + ttt_batch = int(os.environ.get("TTT_BATCH", 8)) + + # Distribute contiguous chunks across ranks + my_start = (num_chunks * rank) // world_size + my_end = (num_chunks * (rank + 1)) // world_size + my_chunks = list(range(my_start, my_end)) + + # TTT parameters: per-layer groups with LR multipliers (later layers get higher LR) + # Group params by block index; non-block 2D params get base LR + num_layers = args.num_layers + layer_groups: dict[int, list] = {i: [] for i in range(num_layers)} + other_params: list = [] + for n, p in base_model.named_parameters(): + if not p.requires_grad or p.ndim < 2: + continue + if "tok_emb" in n or "bigram" in n: + continue # freeze embeddings + # Parse block index from name like "blocks.3.attn.W_Q" + matched = False + if "blocks." in n: + parts = n.split(".") + for i, part in enumerate(parts): + if part == "blocks" and i + 1 < len(parts) and parts[i + 1].isdigit(): + layer_idx = int(parts[i + 1]) + if layer_idx in layer_groups: + layer_groups[layer_idx].append(p) + else: + other_params.append(p) + matched = True + break + if not matched: + other_params.append(p) + + # Build optimizer param groups: layer i gets lr_mult = 0.5 + 0.5*(i/(N-1)) + param_groups = [] + for i in range(num_layers): + if layer_groups[i]: + lr_mult = 0.5 + 0.5 * (i / max(num_layers - 1, 1)) + param_groups.append({"params": layer_groups[i], "lr_mult": lr_mult}) + if other_params: + param_groups.append({"params": other_params, "lr_mult": 1.0}) + + # Set initial LR (will be overridden by cosine schedule) + for pg in param_groups: + pg["lr"] = args.ttt_lr * pg["lr_mult"] + ttt_opt = torch.optim.AdamW(param_groups, weight_decay=args.ttt_wd) + + ttt_epochs = max(args.ttt_steps, 1) + steps_per_epoch = math.ceil(len(my_chunks) / ttt_batch) + total_ttt_steps = ttt_epochs * steps_per_epoch + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + global_step = 0 + for epoch in range(ttt_epochs): + is_last = (epoch == ttt_epochs - 1) + if is_last: + loss_sum.zero_(); token_count.zero_(); byte_count.zero_() + + for bi in range(0, len(my_chunks), ttt_batch): + # Global cosine LR decay across all epochs + progress = global_step / max(total_ttt_steps - 1, 1) + cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * progress)) + for pg in ttt_opt.param_groups: + pg["lr"] = cos_lr * pg["lr_mult"] + + batch_indices = my_chunks[bi:bi + ttt_batch] + B = len(batch_indices) + xs, ys = [], [] + for chunk_idx in batch_indices: + start = chunk_idx * seq_len + chunk = val_tokens[start:start + seq_len + 1].to(device=device, dtype=torch.int64) + xs.append(chunk[:-1]) + ys.append(chunk[1:]) + x = torch.stack(xs) # (B, seq_len) + y = torch.stack(ys) # (B, seq_len) + + # STEP 1: Score this batch (no gradients) + base_model.eval() + if is_last: + with torch.inference_mode(): + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y.reshape(-1), reduction="none", + ) + loss_sum += nll.to(torch.float64).sum() + token_count += float(B * seq_len) + tgt, prev = y.reshape(-1), x.reshape(-1) + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + # STEP 2: Train on this batch (already scored) + base_model.train() + ttt_opt.zero_grad() + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + ttt_loss = base_model(x, y) + ttt_loss.backward() + ttt_opt.step() + + global_step += 1 + step_num = bi // ttt_batch + if rank == 0 and step_num % 100 == 0: + pct = min(bi + ttt_batch, len(my_chunks)) / len(my_chunks) * 100 + rbpb = "" + if is_last and token_count.item() > 0: + rl = (loss_sum / token_count).item() + rbpb = f" bpb={rl / math.log(2.0) * (token_count.item() / byte_count.item()):.6f}" + print(f" ttt epoch={epoch+1}/{ttt_epochs} [{pct:5.1f}%] lr={cos_lr:.6f} loss={ttt_loss.item():.4f}{rbpb}", flush=True) + + if rank == 0: + print(f" ttt epoch={epoch+1}/{ttt_epochs} done", flush=True) + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + + val_loss = (loss_sum / token_count).item() + base_model.train() + return val_loss, val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + + +# No-TTT sliding window eval (for comparison / faster iteration) +def eval_val_sliding( + args: Hyperparameters, base_model: nn.Module, rank: int, world_size: int, + device: torch.device, val_tokens: Tensor, base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + stride: int, batch_seqs: int = 32, +) -> tuple[float, float]: + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] + my_s = (len(window_starts) * rank) // world_size + my_e = (len(window_starts) * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x_batch) + nll = F.cross_entropy(logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none").reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + loss_sum += nll[i, s:wlen].to(torch.float64).sum() + token_count += float(wlen - s) + tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + return val_loss, val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + + +# ----------------------------- +# TRAINING +# ----------------------------- + +def main() -> None: + global zeropower_via_newtonschulz5 + + code = Path(__file__).read_text(encoding="utf-8") + args = Hyperparameters() + zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5) + + distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ + rank = int(os.environ.get("RANK", "0")) + world_size = int(os.environ.get("WORLD_SIZE", "1")) + local_rank = int(os.environ.get("LOCAL_RANK", "0")) + if 8 % world_size != 0: + raise ValueError(f"WORLD_SIZE={world_size} must divide 8") + grad_accum_steps = 8 // world_size + grad_scale = 1.0 / grad_accum_steps + device = torch.device("cuda", local_rank) + torch.cuda.set_device(device) + if distributed: + dist.init_process_group(backend="nccl", device_id=device) + dist.barrier() + master_process = rank == 0 + + torch.backends.cuda.matmul.allow_tf32 = True + torch.backends.cudnn.allow_tf32 = True + from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp + enable_cudnn_sdp(False); enable_flash_sdp(True); enable_mem_efficient_sdp(False); enable_math_sdp(False) + + logfile = None + if master_process: + os.makedirs("logs", exist_ok=True) + logfile = f"logs/{args.run_id}.txt" + print(logfile) + + def log0(msg: str, console: bool = True) -> None: + if not master_process: + return + if console: + print(msg) + if logfile is not None: + with open(logfile, "a", encoding="utf-8") as f: + print(msg, file=f) + + log0(code, console=False) + log0("=" * 100, console=False) + + random.seed(args.seed); np.random.seed(args.seed) + torch.manual_seed(args.seed); torch.cuda.manual_seed_all(args.seed) + + sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) + val_tokens = load_validation_tokens(args.val_files, args.train_seq_len) + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(sp, args.vocab_size, device) + log0(f"val_tokens:{val_tokens.numel() - 1}") + + # MODEL + base_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, swiglu_mult=args.swiglu_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + rope_dim=args.rope_dim, xsa_layers=args.xsa_layers, + ).to(device).bfloat16() + for module in base_model.modules(): + if isinstance(module, CastedLinear): + module.float() + restore_low_dim_params_to_fp32(base_model) + compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) + model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model + + # OPTIMIZER SETUP + block_named_params = list(base_model.blocks.named_parameters()) + matrix_params = [p for name, p in block_named_params + if p.ndim == 2 and not any(pat in name for pat in CONTROL_TENSOR_NAME_PATTERNS)] + scalar_params = [p for name, p in block_named_params + if p.ndim < 2 or any(pat in name for pat in CONTROL_TENSOR_NAME_PATTERNS)] + if base_model.skip_weights.numel() > 0: + scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + if base_model.bigram is not None: + scalar_params.append(base_model.bigram.scale) + + token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + matrix_params.append(base_model.bigram.proj.weight) + + optimizer_tok = torch.optim.AdamW(tok_params, betas=(args.beta1, args.beta2), eps=args.adam_eps, + weight_decay=args.weight_decay, fused=True) + optimizer_muon = Muon(matrix_params, lr=args.matrix_lr, momentum=args.muon_momentum, + backend_steps=args.muon_backend_steps, weight_decay=0.04) + for group in optimizer_muon.param_groups: + group["base_lr"] = args.matrix_lr + optimizer_scalar = torch.optim.AdamW( + [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], + betas=(args.beta1, args.beta2), eps=args.adam_eps, weight_decay=args.weight_decay, fused=True) + optimizers = [optimizer_tok, optimizer_muon, optimizer_scalar] + if base_model.lm_head is not None: + optimizer_head = torch.optim.Adam( + [{"params": [base_model.lm_head.weight], "lr": args.head_lr, "base_lr": args.head_lr}], + betas=(args.beta1, args.beta2), eps=args.adam_eps, fused=True) + optimizers.insert(1, optimizer_head) + + n_params = sum(p.numel() for p in base_model.parameters()) + log0(f"model_params:{n_params} swiglu_mult:{args.swiglu_mult}") + log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") + + # DATA LOADER + WARMUP + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + + def zero_grad_all(): + for opt in optimizers: + opt.zero_grad(set_to_none=True) + + max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None + + def lr_mul(step: int, elapsed_ms: float) -> float: + if args.warmdown_iters <= 0: + return 1.0 + if max_wallclock_ms is None: + warmdown_start = max(args.iterations - args.warmdown_iters, 0) + return max((args.iterations - step) / max(args.warmdown_iters, 1), 0.0) if warmdown_start <= step < args.iterations else 1.0 + step_ms = elapsed_ms / max(step, 1) + warmdown_ms = args.warmdown_iters * step_ms + remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) + return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 + + if args.warmup_steps > 0: + initial_model_state = {n: t.detach().cpu().clone() for n, t in base_model.state_dict().items()} + initial_opt_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] + model.train() + for ws in range(args.warmup_steps): + zero_grad_all() + for ms in range(grad_accum_steps): + if distributed: + model.require_backward_grad_sync = ms == grad_accum_steps - 1 + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + wl = model(x, y) + (wl * grad_scale).backward() + for opt in optimizers: + opt.step() + zero_grad_all() + if ws + 1 == args.warmup_steps: + log0(f"warmup_step:{ws + 1}/{args.warmup_steps}") + base_model.load_state_dict(initial_model_state, strict=True) + for opt, st in zip(optimizers, initial_opt_states, strict=True): + opt.load_state_dict(st) + zero_grad_all() + if distributed: + model.require_backward_grad_sync = True + train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) + + # ---- CHANGE 3: EMA (replaces SWA) ---- + training_time_ms = 0.0 + stop_after_step: int | None = None + ema_state: dict[str, Tensor] | None = None + ema_active = False + torch.cuda.synchronize() + t0 = time.perf_counter() + + step = 0 + while True: + last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) + should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) + if should_validate: + torch.cuda.synchronize() + training_time_ms += 1000.0 * (time.perf_counter() - t0) + val_loss, val_bpb = eval_val(args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut) + log0(f"step:{step}/{args.iterations} val_loss:{val_loss:.4f} val_bpb:{val_bpb:.4f} " + f"train_time:{training_time_ms:.0f}ms step_avg:{training_time_ms / max(step, 1):.2f}ms") + torch.cuda.synchronize() + t0 = time.perf_counter() + if last_step: + if stop_after_step is not None and step < args.iterations: + log0(f"stopping_early: wallclock_cap train_time:{training_time_ms:.0f}ms step:{step}/{args.iterations}") + break + + elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + scale = lr_mul(step, elapsed_ms) + zero_grad_all() + train_loss = torch.zeros((), device=device) + for ms in range(grad_accum_steps): + if distributed: + model.require_backward_grad_sync = ms == grad_accum_steps - 1 + x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = model(x, y) + train_loss += loss.detach() + (loss * grad_scale).backward() + train_loss /= grad_accum_steps + + frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 + for group in optimizer_muon.param_groups: + group["momentum"] = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * scale + if args.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + for opt in optimizers: + opt.step() + zero_grad_all() + step += 1 + + # Late QAT: enable int5 STE during final warmdown phase (disabled by default) + qat_threshold = float(os.environ.get("QAT_THRESHOLD", 0.0)) + if qat_threshold > 0 and scale < qat_threshold and not getattr(base_model, "_qat_active", False): + for m in base_model.modules(): + if isinstance(m, CastedLinear): + m.qat_enabled = True + base_model._qat_active = True + log0(f"qat:start step:{step} scale:{scale:.4f}") + + # EMA: start after ema_start_frac of warmdown + if scale < args.ema_start_frac: + decay = args.ema_decay + with torch.no_grad(): + if not ema_active: + ema_state = {n: t.detach().cpu().clone().float() for n, t in base_model.state_dict().items()} + ema_active = True + log0(f"ema:start step:{step}") + else: + for n, t in base_model.state_dict().items(): + ema_state[n].mul_(decay).add_(t.detach().cpu().float(), alpha=1.0 - decay) + + approx_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + if args.train_log_every > 0 and (step <= 10 or step % args.train_log_every == 0): + log0(f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} " + f"train_time:{approx_ms:.0f}ms step_avg:{approx_ms / step:.2f}ms") + + reached_cap = max_wallclock_ms is not None and approx_ms >= max_wallclock_ms + if distributed and max_wallclock_ms is not None: + t_cap = torch.tensor(int(reached_cap), device=device) + dist.all_reduce(t_cap, op=dist.ReduceOp.MAX) + reached_cap = bool(t_cap.item()) + if stop_after_step is None and reached_cap: + stop_after_step = step + + log0(f"peak memory: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB") + + # Apply EMA weights if collected + if ema_active and ema_state is not None: + log0("ema:applying") + current = base_model.state_dict() + avg = {n: t.to(dtype=current[n].dtype) for n, t in ema_state.items()} + base_model.load_state_dict(avg, strict=True) + + # SERIALIZATION + if master_process: + torch.save(base_model.state_dict(), "final_model.pt") + log0(f"Serialized model: {os.path.getsize('final_model.pt')} bytes") + + # Magnitude pruning for compression + if args.prune_frac > 0: + with torch.no_grad(): + for name, param in base_model.named_parameters(): + if param.ndim == 2 and param.numel() > 65536: + threshold = torch.quantile(param.abs().float().flatten(), args.prune_frac) + param.masked_fill_(param.abs() < threshold, 0.0) + + # Mixed int5/int6 quantization + compression + sd_cpu = {k: v.detach().cpu() for k, v in base_model.state_dict().items()} + quant_result, quant_meta = mixed_quantize_int6(sd_cpu, {"mlp", "attn", "bigram"}) + quant_buf = io.BytesIO() + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + quant_raw = quant_buf.getvalue() + if _COMPRESSOR == "zstd": + quant_blob = zstandard.ZstdCompressor(level=22).compress(quant_raw) + else: + quant_blob = zlib.compress(quant_raw, 9) + if master_process: + with open("final_model.int8.ptz", "wb") as f: + f.write(quant_blob) + qfb = os.path.getsize("final_model.int8.ptz") + cb = len(code.encode("utf-8")) + log0(f"artifact: {qfb} bytes code: {cb} bytes total: {qfb + cb} bytes") + + # Load quantized weights and eval + if distributed: + dist.barrier() + with open("final_model.int8.ptz", "rb") as f: + blob = f.read() + if _COMPRESSOR == "zstd": + raw = zstandard.ZstdDecompressor().decompress(blob) + else: + raw = zlib.decompress(blob) + qs = torch.load(io.BytesIO(raw), map_location="cpu") + deq = dequantize_mixed(qs["w"], qs["m"], sd_cpu) + base_model.load_state_dict(deq, strict=True) + + # Final eval: sliding window with TTT + torch.cuda.synchronize() + t_eval = time.perf_counter() + if args.ttt_steps > 0 and args.eval_stride > 0: + log0(f"final_eval: sequential_ttt epochs={args.ttt_steps} lr={args.ttt_lr} wd={args.ttt_wd}") + q_loss, q_bpb = eval_val_ttt(args, base_model, rank, world_size, device, val_tokens, + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, batch_seqs=args.eval_batch_seqs) + elif args.eval_stride > 0: + log0(f"final_eval: sliding stride={args.eval_stride}") + q_loss, q_bpb = eval_val_sliding(args, base_model, rank, world_size, device, val_tokens, + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, batch_seqs=args.eval_batch_seqs) + else: + q_loss, q_bpb = eval_val(args, model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut) + torch.cuda.synchronize() + log0(f"final val_loss:{q_loss:.4f} val_bpb:{q_bpb:.4f} eval_time:{1000.0 * (time.perf_counter() - t_eval):.0f}ms") + log0(f"final_exact val_loss:{q_loss:.8f} val_bpb:{q_bpb:.8f}") + + # Diagnostic: if TTT was used, also run standard sliding eval on TTT-adapted weights + # to check whether improvement persists with standard scoring (memorization test) + if args.ttt_steps > 0 and args.eval_stride > 0: + log0("diagnostic: running standard sliding eval on TTT-adapted weights") + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_loss, diag_bpb = eval_val_sliding(args, base_model, rank, world_size, device, val_tokens, + base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, batch_seqs=args.eval_batch_seqs) + torch.cuda.synchronize() + log0(f"diagnostic sliding val_loss:{diag_loss:.4f} val_bpb:{diag_bpb:.4f} eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms") + log0(f"diagnostic_exact val_loss:{diag_loss:.8f} val_bpb:{diag_bpb:.8f}") + + if distributed: + dist.destroy_process_group() + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/train_seed42.log b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/train_seed42.log new file mode 100644 index 000000000..74c2e542e --- /dev/null +++ b/records/track_10min_16mb/2026-03-23_SwiGLU_EMA_TTT_EBLS/train_seed42.log @@ -0,0 +1,129 @@ +W0323 18:42:02.968000 140233365541504 torch/distributed/run.py:779] +W0323 18:42:02.968000 140233365541504 torch/distributed/run.py:779] ***************************************** +W0323 18:42:02.968000 140233365541504 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +W0323 18:42:02.968000 140233365541504 torch/distributed/run.py:779] ***************************************** +logs/363695fb-adb3-4007-aa70-2b2870a6b3b7.txt +val_tokens:62021632 +model_params:25517137 swiglu_mult:2.0 +world_size:8 grad_accum_steps:1 +warmup_step:20/20 +step:0/20000 val_loss:6.9312 val_bpb:4.1051 train_time:0ms step_avg:0.02ms +step:1/20000 train_loss:6.9334 train_time:180ms step_avg:180.39ms +step:2/20000 train_loss:7.9939 train_time:265ms step_avg:132.52ms +step:3/20000 train_loss:7.6703 train_time:364ms step_avg:121.43ms +step:4/20000 train_loss:7.0451 train_time:464ms step_avg:115.93ms +step:5/20000 train_loss:7.0620 train_time:563ms step_avg:112.64ms +step:6/20000 train_loss:6.9525 train_time:663ms step_avg:110.49ms +step:7/20000 train_loss:6.7482 train_time:762ms step_avg:108.91ms +step:8/20000 train_loss:6.7656 train_time:863ms step_avg:107.85ms +step:9/20000 train_loss:6.5383 train_time:962ms step_avg:106.90ms +step:10/20000 train_loss:6.2368 train_time:1061ms step_avg:106.12ms +step:100/20000 train_loss:3.1138 train_time:10039ms step_avg:100.39ms +step:200/20000 train_loss:2.3450 train_time:23924ms step_avg:119.62ms +step:300/20000 train_loss:2.4975 train_time:38827ms step_avg:129.42ms +step:400/20000 train_loss:2.3706 train_time:52696ms step_avg:131.74ms +step:500/20000 train_loss:2.3682 train_time:62720ms step_avg:125.44ms +step:500/20000 val_loss:2.3269 val_bpb:1.3781 train_time:62737ms step_avg:125.47ms +step:600/20000 train_loss:2.3110 train_time:76835ms step_avg:128.06ms +step:700/20000 train_loss:2.3290 train_time:90337ms step_avg:129.05ms +step:800/20000 train_loss:2.2196 train_time:104436ms step_avg:130.54ms +step:900/20000 train_loss:2.1178 train_time:118348ms step_avg:131.50ms +step:1000/20000 train_loss:2.2668 train_time:128371ms step_avg:128.37ms +step:1000/20000 val_loss:2.2203 val_bpb:1.3150 train_time:128389ms step_avg:128.39ms +step:1100/20000 train_loss:2.3272 train_time:142078ms step_avg:129.16ms +step:1200/20000 train_loss:2.3511 train_time:156039ms step_avg:130.03ms +step:1300/20000 train_loss:2.0987 train_time:169950ms step_avg:130.73ms +step:1400/20000 train_loss:2.1850 train_time:188032ms step_avg:134.31ms +step:1500/20000 train_loss:2.2256 train_time:198049ms step_avg:132.03ms +step:1500/20000 val_loss:2.1879 val_bpb:1.2958 train_time:198066ms step_avg:132.04ms +step:1600/20000 train_loss:2.0800 train_time:213622ms step_avg:133.51ms +step:1700/20000 train_loss:2.1437 train_time:228778ms step_avg:134.58ms +step:1800/20000 train_loss:2.1612 train_time:243367ms step_avg:135.20ms +step:1900/20000 train_loss:2.1259 train_time:253429ms step_avg:133.38ms +step:2000/20000 train_loss:2.0619 train_time:267522ms step_avg:133.76ms +step:2000/20000 val_loss:2.1262 val_bpb:1.2592 train_time:267539ms step_avg:133.77ms +step:2100/20000 train_loss:2.0376 train_time:282526ms step_avg:134.54ms +step:2200/20000 train_loss:2.1295 train_time:296692ms step_avg:134.86ms +step:2300/20000 train_loss:2.0964 train_time:310517ms step_avg:135.01ms +step:2400/20000 train_loss:2.0477 train_time:320559ms step_avg:133.57ms +step:2500/20000 train_loss:2.1558 train_time:334191ms step_avg:133.68ms +step:2500/20000 val_loss:2.0887 val_bpb:1.2371 train_time:334209ms step_avg:133.68ms +step:2600/20000 train_loss:2.0844 train_time:348183ms step_avg:133.92ms +step:2700/20000 train_loss:2.0743 train_time:363571ms step_avg:134.66ms +step:2800/20000 train_loss:2.1320 train_time:377015ms step_avg:134.65ms +step:2900/20000 train_loss:1.9973 train_time:387050ms step_avg:133.47ms +step:3000/20000 train_loss:2.1282 train_time:400317ms step_avg:133.44ms +step:3000/20000 val_loss:2.0583 val_bpb:1.2191 train_time:400336ms step_avg:133.45ms +step:3100/20000 train_loss:1.9990 train_time:415037ms step_avg:133.88ms +step:3200/20000 train_loss:2.1306 train_time:430149ms step_avg:134.42ms +ema:start step:3298 +step:3300/20000 train_loss:2.0309 train_time:440378ms step_avg:133.45ms +step:3400/20000 train_loss:1.9731 train_time:459312ms step_avg:135.09ms +step:3500/20000 train_loss:2.1305 train_time:478393ms step_avg:136.68ms +step:3500/20000 val_loss:2.0267 val_bpb:1.2003 train_time:478408ms step_avg:136.69ms +step:3600/20000 train_loss:2.0314 train_time:495867ms step_avg:137.74ms +step:3700/20000 train_loss:2.0328 train_time:513236ms step_avg:138.71ms +step:3800/20000 train_loss:2.0043 train_time:526470ms step_avg:138.54ms +step:3900/20000 train_loss:2.0093 train_time:543597ms step_avg:139.38ms +step:4000/20000 train_loss:1.9010 train_time:560540ms step_avg:140.13ms +step:4000/20000 val_loss:1.9924 val_bpb:1.1800 train_time:560554ms step_avg:140.14ms +step:4100/20000 train_loss:1.9385 train_time:578021ms step_avg:140.98ms +step:4200/20000 train_loss:2.0743 train_time:594829ms step_avg:141.63ms +step:4238/20000 val_loss:1.9817 val_bpb:1.1737 train_time:600040ms step_avg:141.59ms +stopping_early: wallclock_cap train_time:600040ms step:4238/20000 +peak memory: 24143 MiB +ema:applying +Serialized model: 98440810 bytes +artifact: 15470583 bytes code: 58274 bytes total: 15528857 bytes +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +/root/train_gpt.py:1212: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. + qs = torch.load(io.BytesIO(raw), map_location="cpu") +final_eval: sequential_ttt epochs=5 lr=0.0007 wd=0.0 + ttt epoch=1/5 [ 0.2%] lr=0.000700 loss=2.1579 + ttt epoch=1/5 [ 21.3%] lr=0.000697 loss=1.9588 + ttt epoch=1/5 [ 42.5%] lr=0.000688 loss=1.9117 + ttt epoch=1/5 [ 63.6%] lr=0.000673 loss=2.1030 + ttt epoch=1/5 [ 84.8%] lr=0.000652 loss=2.0177 + ttt epoch=1/5 done + ttt epoch=2/5 [ 0.2%] lr=0.000633 loss=1.9554 + ttt epoch=2/5 [ 21.3%] lr=0.000603 loss=1.8571 + ttt epoch=2/5 [ 42.5%] lr=0.000569 loss=1.7901 + ttt epoch=2/5 [ 63.6%] lr=0.000531 loss=2.0028 + ttt epoch=2/5 [ 84.8%] lr=0.000490 loss=1.9187 + ttt epoch=2/5 done + ttt epoch=3/5 [ 0.2%] lr=0.000458 loss=1.8771 + ttt epoch=3/5 [ 21.3%] lr=0.000413 loss=1.7575 + ttt epoch=3/5 [ 42.5%] lr=0.000367 loss=1.6930 + ttt epoch=3/5 [ 63.6%] lr=0.000321 loss=1.9112 + ttt epoch=3/5 [ 84.8%] lr=0.000275 loss=1.8282 + ttt epoch=3/5 done + ttt epoch=4/5 [ 0.2%] lr=0.000242 loss=1.7907 + ttt epoch=4/5 [ 21.3%] lr=0.000199 loss=1.6676 + ttt epoch=4/5 [ 42.5%] lr=0.000158 loss=1.5949 + ttt epoch=4/5 [ 63.6%] lr=0.000121 loss=1.8351 + ttt epoch=4/5 [ 84.8%] lr=0.000088 loss=1.7603 + ttt epoch=4/5 done + ttt epoch=5/5 [ 0.2%] lr=0.000067 loss=1.7257 bpb=1.009609 + ttt epoch=5/5 [ 21.3%] lr=0.000042 loss=1.6019 bpb=0.993048 + ttt epoch=5/5 [ 42.5%] lr=0.000023 loss=1.5375 bpb=0.998689 + ttt epoch=5/5 [ 63.6%] lr=0.000009 loss=1.7914 bpb=0.998625 + ttt epoch=5/5 [ 84.8%] lr=0.000002 loss=1.7246 bpb=1.010127 + ttt epoch=5/5 done +final val_loss:1.7063 val_bpb:1.0106 eval_time:148364ms +final_exact val_loss:1.70628227 val_bpb:1.01055647 +diagnostic: running standard sliding eval on TTT-adapted weights +diagnostic sliding val_loss:1.6932 val_bpb:1.0028 eval_time:233125ms +diagnostic_exact val_loss:1.69315440 val_bpb:1.00278406 diff --git a/train_gpt.py b/train_gpt.py index 651beb2b8..9c22631a8 100644 --- a/train_gpt.py +++ b/train_gpt.py @@ -1,11 +1,4 @@ -""" -The `train_gpt.py` and `train_gpt_mlx.py` scripts are intended as good launching-off points for new participants, not SOTA configs. We'll accept PRs that tune, improve, or simplify these scripts without significantly increasing complexity, but competitive submissions should stay in the `/records` folder. - -Hard stop: To keep readable for newcomers, let's make sure `train_gpt.py` and `train_gpt_mlx.py` never are longer than 1500 lines. -""" - from __future__ import annotations - import copy import glob import io @@ -18,7 +11,11 @@ import uuid import zlib from pathlib import Path - +try: + import zstandard + _COMPRESSOR = "zstd" +except ImportError: + _COMPRESSOR = "zlib" import numpy as np import sentencepiece as spm import torch @@ -26,76 +23,88 @@ import torch.nn.functional as F from torch import Tensor, nn from torch.nn.parallel import DistributedDataParallel as DDP - -# ----------------------------- -# HYPERPARAMETERS -# ----------------------------- -# Default Simple Baseline run: -# - 9 transformer blocks at width 512 -# - 8 attention heads with 4 KV heads (GQA) and 2x MLP expansion -# - vocab size 1024, sequence length 1024, tied embeddings -# - 524,288 train tokens per step for 20,000 iterations with a ~10 minute cap - +from flash_attn_interface import flash_attn_func as flash_attn_3_func class Hyperparameters: - # Data paths are shard globs produced by the existing preprocessing pipeline. data_path = os.environ.get("DATA_PATH", "./data/datasets/fineweb10B_sp1024") train_files = os.path.join(data_path, "fineweb_train_*.bin") val_files = os.path.join(data_path, "fineweb_val_*.bin") tokenizer_path = os.environ.get("TOKENIZER_PATH", "./data/tokenizers/fineweb_1024_bpe.model") run_id = os.environ.get("RUN_ID", str(uuid.uuid4())) seed = int(os.environ.get("SEED", 1337)) - - # Validation cadence and batch size. Validation always uses the full fineweb_val split. val_batch_size = int(os.environ.get("VAL_BATCH_SIZE", 524_288)) - val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 1000)) - train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 200)) - - # Training length. + val_loss_every = int(os.environ.get("VAL_LOSS_EVERY", 4000)) + train_log_every = int(os.environ.get("TRAIN_LOG_EVERY", 500)) iterations = int(os.environ.get("ITERATIONS", 20000)) - warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 1200)) + warmdown_iters = int(os.environ.get("WARMDOWN_ITERS", 3500)) warmup_steps = int(os.environ.get("WARMUP_STEPS", 20)) - train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 524_288)) - train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 1024)) + train_batch_tokens = int(os.environ.get("TRAIN_BATCH_TOKENS", 786_432)) + train_seq_len = int(os.environ.get("TRAIN_SEQ_LEN", 2048)) + eval_seq_len = int(os.environ.get("EVAL_SEQ_LEN", 2048)) max_wallclock_seconds = float(os.environ.get("MAX_WALLCLOCK_SECONDS", 600.0)) qk_gain_init = float(os.environ.get("QK_GAIN_INIT", 1.5)) - - # Model shape. vocab_size = int(os.environ.get("VOCAB_SIZE", 1024)) - num_layers = int(os.environ.get("NUM_LAYERS", 9)) + num_layers = int(os.environ.get("NUM_LAYERS", 11)) num_kv_heads = int(os.environ.get("NUM_KV_HEADS", 4)) model_dim = int(os.environ.get("MODEL_DIM", 512)) num_heads = int(os.environ.get("NUM_HEADS", 8)) - mlp_mult = int(os.environ.get("MLP_MULT", 2)) + mlp_mult = float(os.environ.get("MLP_MULT", 3.0)) tie_embeddings = bool(int(os.environ.get("TIE_EMBEDDINGS", "1"))) rope_base = float(os.environ.get("ROPE_BASE", 10000.0)) logit_softcap = float(os.environ.get("LOGIT_SOFTCAP", 30.0)) - - # Optimizer hyperparameters. embed_lr = float(os.environ.get("EMBED_LR", 0.6)) head_lr = float(os.environ.get("HEAD_LR", 0.008)) - tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.05)) + tied_embed_lr = float(os.environ.get("TIED_EMBED_LR", 0.035)) tied_embed_init_std = float(os.environ.get("TIED_EMBED_INIT_STD", 0.005)) - matrix_lr = float(os.environ.get("MATRIX_LR", 0.04)) - scalar_lr = float(os.environ.get("SCALAR_LR", 0.04)) - muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.95)) + matrix_lr = float(os.environ.get("MATRIX_LR", 0.025)) + scalar_lr = float(os.environ.get("SCALAR_LR", 0.025)) + muon_momentum = float(os.environ.get("MUON_MOMENTUM", 0.99)) muon_backend_steps = int(os.environ.get("MUON_BACKEND_STEPS", 5)) - muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.85)) - muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 500)) + muon_momentum_warmup_start = float(os.environ.get("MUON_MOMENTUM_WARMUP_START", 0.92)) + muon_momentum_warmup_steps = int(os.environ.get("MUON_MOMENTUM_WARMUP_STEPS", 1500)) beta1 = float(os.environ.get("BETA1", 0.9)) beta2 = float(os.environ.get("BETA2", 0.95)) adam_eps = float(os.environ.get("ADAM_EPS", 1e-8)) - grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.0)) - -# ----------------------------- -# MUON OPTIMIZER -# ----------------------------- -# -# As borrowed from modded-nanogpt -# Background on Muon: https://kellerjordan.github.io/posts/muon/ - + grad_clip_norm = float(os.environ.get("GRAD_CLIP_NORM", 0.3)) + eval_stride = int(os.environ.get("EVAL_STRIDE", 64)) + mtp_num_heads = int(os.environ.get("MTP_NUM_HEADS", 0)) + mtp_loss_weight = float(os.environ.get("MTP_LOSS_WEIGHT", 0.2)) + muon_beta2 = float(os.environ.get("MUON_BETA2", 0.95)) + swa_enabled = bool(int(os.environ.get("SWA_ENABLED", "1"))) + swa_every = int(os.environ.get("SWA_EVERY", 50)) # tighter: collect more recent checkpoints + muon_wd = float(os.environ.get("MUON_WD", 0.04)) + adam_wd = float(os.environ.get("ADAM_WD", 0.04)) + qat_enabled = bool(int(os.environ.get("QAT_ENABLED", "0"))) + bigram_vocab_size = int(os.environ.get("BIGRAM_VOCAB_SIZE", 3072)) + bigram_dim = int(os.environ.get("BIGRAM_DIM", 128)) + xsa_last_n = int(os.environ.get("XSA_LAST_N", 4)) # XSA on last 4 layers (0 = disabled) + rope_dims = int(os.environ.get("ROPE_DIMS", 16)) + ln_scale = bool(int(os.environ.get("LN_SCALE", "1"))) + dtg_enabled = bool(int(os.environ.get("DTG_ENABLED", "0"))) + late_qat_threshold = float(os.environ.get("LATE_QAT_THRESHOLD", 0.15)) + ve_enabled = bool(int(os.environ.get("VE_ENABLED", "1"))) + ve_dim = int(os.environ.get("VE_DIM", 128)) + ve_layers = os.environ.get("VE_LAYERS", "9,10") + vrl = bool(int(os.environ.get("VRL", "1"))) # Value Residual Learning (ResFormer arXiv:2410.17897) + # TTT Burst: replay recent training batches at low LR before EMA + ttt_burst_enabled = bool(int(os.environ.get("TTT_BURST_ENABLED", "1"))) + ttt_burst_epochs = int(os.environ.get("TTT_BURST_EPOCHS", 2)) + ttt_burst_lr_factor = float(os.environ.get("TTT_BURST_LR_FACTOR", 0.1)) + ttt_burst_steps = int(os.environ.get("TTT_BURST_STEPS", 100)) + ttt_burst_trigger = float(os.environ.get("TTT_BURST_TRIGGER", 0.2)) + ttt_enabled = bool(int(os.environ.get("TTT_ENABLED", "1"))) + ttt_lr = float(os.environ.get("TTT_LR", 0.002)) + ttt_epochs = int(os.environ.get("TTT_EPOCHS", 3)) + # Sliding window TTT (full-parameter, PR#461/549 recipe) + ttt_chunk_tokens = int(os.environ.get("TTT_CHUNK_TOKENS", 32768)) + ttt_momentum = float(os.environ.get("TTT_MOMENTUM", 0.9)) + ttt_freeze_blocks = int(os.environ.get("TTT_FREEZE_BLOCKS", 0)) + ttt_freeze_embeddings = bool(int(os.environ.get("TTT_FREEZE_EMBEDDINGS", "0"))) + ttt_train_batch_seqs = int(os.environ.get("TTT_TRAIN_BATCH_SEQS", 32)) + ttt_grad_clip = float(os.environ.get("TTT_GRAD_CLIP", 1.0)) + eb_ttt = bool(int(os.environ.get("EB_TTT", "0"))) # Empirical Bayes adaptive per-layer TTT LR + eb_ttt_min = float(os.environ.get("EB_TTT_MIN", "0.3")) + eb_ttt_max = float(os.environ.get("EB_TTT_MAX", "3.0")) def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) -> Tensor: - # Orthogonalize a 2D update matrix with a fast Newton-Schulz iteration. - # Muon uses this to normalize matrix-shaped gradients before applying them. a, b, c = (3.4445, -4.7750, 2.0315) X = G.bfloat16() X /= X.norm() + eps @@ -107,26 +116,23 @@ def zeropower_via_newtonschulz5(G: Tensor, steps: int = 10, eps: float = 1e-7) - B = b * A + c * A @ A X = a * X + B @ X return X.T if transposed else X - - class Muon(torch.optim.Optimizer): - def __init__(self, params, lr: float, momentum: float, backend_steps: int, nesterov: bool = True): + def __init__(self, params, lr: float, momentum: float, backend_steps: int, + nesterov: bool = True, weight_decay: float = 0.0): super().__init__( params, - dict(lr=lr, momentum=momentum, backend_steps=backend_steps, nesterov=nesterov), + dict(lr=lr, momentum=momentum, backend_steps=backend_steps, + nesterov=nesterov, weight_decay=weight_decay), ) - @torch.no_grad() def step(self, closure=None): loss = None if closure is not None: with torch.enable_grad(): loss = closure() - distributed = dist.is_available() and dist.is_initialized() world_size = dist.get_world_size() if distributed else 1 rank = dist.get_rank() if distributed else 0 - for group in self.param_groups: params = group["params"] if not params: @@ -135,10 +141,8 @@ def step(self, closure=None): momentum = group["momentum"] backend_steps = group["backend_steps"] nesterov = group["nesterov"] - total_params = sum(int(p.numel()) for p in params) updates_flat = torch.zeros(total_params, device=params[0].device, dtype=torch.bfloat16) - curr = 0 for i, p in enumerate(params): if i % world_size == rank and p.grad is not None: @@ -151,32 +155,20 @@ def step(self, closure=None): if nesterov: g = g.add(buf, alpha=momentum) g = zeropower_via_newtonschulz5(g, steps=backend_steps) - # Scale correction from Muon reference implementations. g *= max(1, g.size(0) / g.size(1)) ** 0.5 updates_flat[curr : curr + p.numel()] = g.reshape(-1) curr += p.numel() - if distributed: dist.all_reduce(updates_flat, op=dist.ReduceOp.SUM) - + wd = group.get("weight_decay", 0.0) curr = 0 for p in params: + if wd > 0.0: + p.data.mul_(1.0 - lr * wd) g = updates_flat[curr : curr + p.numel()].view_as(p).to(dtype=p.dtype) p.add_(g, alpha=-lr) curr += p.numel() - return loss - - -# ----------------------------- -# TOKENIZER-AGNOSTIC EVALUATION SETUP -# ----------------------------- -# -# It's common for small models have a large fraction of their parameters be embeddings, since the 2 * d_model * d_vocab vectors can be gigantic. -# Instead of locking the tokenizer, we let you bring your own and calculate our validation metrics on the average compression of the validation set. -# We calculate BPB (bits-per-byte) instead of validation loss, so we need methods to count the number of bits per token in the tokenizer. -# Note: Submissions that edit the tokenizer will be examined more carefully, since screwing this up might unjustly improve your score. - def build_sentencepiece_luts( sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device ) -> tuple[Tensor, Tensor, Tensor]: @@ -193,7 +185,7 @@ def build_sentencepiece_luts( base_bytes_np[token_id] = 1 continue piece = sp.id_to_piece(token_id) - if piece.startswith("▁"): + if piece.startswith("\u2581"): has_leading_space_np[token_id] = True piece = piece[1:] base_bytes_np[token_id] = len(piece.encode("utf-8")) @@ -202,20 +194,15 @@ def build_sentencepiece_luts( torch.tensor(has_leading_space_np, dtype=torch.bool, device=device), torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device), ) - - def load_validation_tokens(pattern: str, seq_len: int) -> Tensor: files = [Path(p) for p in sorted(glob.glob(pattern))] if not files: raise FileNotFoundError(f"No files found for pattern: {pattern}") - # The export pipeline writes the fixed first-50k-doc validation set to fineweb_val_*. tokens = torch.cat([load_data_shard(file) for file in files]).contiguous() usable = ((tokens.numel() - 1) // seq_len) * seq_len if usable <= 0: raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}") return tokens[: usable + 1] - - def eval_val( args: Hyperparameters, model: nn.Module, @@ -227,34 +214,32 @@ def eval_val( base_bytes_lut: Tensor, has_leading_space_lut: Tensor, is_boundary_token_lut: Tensor, + eval_seq_len: int | None = None, ) -> tuple[float, float]: - # Validation computes two metrics: - # - val_loss: token cross-entropy (natural log) - # - val_bpb: tokenizer-agnostic compression metric used by the challenge + seq_len = eval_seq_len or args.train_seq_len local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps) - if local_batch_tokens < args.train_seq_len: + if local_batch_tokens < seq_len: raise ValueError( "VAL_BATCH_SIZE must provide at least one sequence per rank; " f"got VAL_BATCH_SIZE={args.val_batch_size}, WORLD_SIZE={world_size}, " - f"GRAD_ACCUM_STEPS={grad_accum_steps}, TRAIN_SEQ_LEN={args.train_seq_len}" + f"GRAD_ACCUM_STEPS={grad_accum_steps}, seq_len={seq_len}" ) - local_batch_seqs = local_batch_tokens // args.train_seq_len - total_seqs = (val_tokens.numel() - 1) // args.train_seq_len + local_batch_seqs = local_batch_tokens // seq_len + total_seqs = (val_tokens.numel() - 1) // seq_len seq_start = (total_seqs * rank) // world_size seq_end = (total_seqs * (rank + 1)) // world_size val_loss_sum = torch.zeros((), device=device, dtype=torch.float64) val_token_count = torch.zeros((), device=device, dtype=torch.float64) val_byte_count = torch.zeros((), device=device, dtype=torch.float64) - model.eval() with torch.inference_mode(): for batch_seq_start in range(seq_start, seq_end, local_batch_seqs): batch_seq_end = min(batch_seq_start + local_batch_seqs, seq_end) - raw_start = batch_seq_start * args.train_seq_len - raw_end = batch_seq_end * args.train_seq_len + 1 + raw_start = batch_seq_start * seq_len + raw_end = batch_seq_end * seq_len + 1 local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True) - x = local[:-1].reshape(-1, args.train_seq_len) - y = local[1:].reshape(-1, args.train_seq_len) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): batch_loss = model(x, y).detach() batch_token_count = float(y.numel()) @@ -265,31 +250,20 @@ def eval_val( token_bytes = base_bytes_lut[tgt_ids].to(dtype=torch.int16) token_bytes += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).to(dtype=torch.int16) val_byte_count += token_bytes.to(torch.float64).sum() - if dist.is_available() and dist.is_initialized(): dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM) dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM) dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM) - val_loss = val_loss_sum / val_token_count bits_per_token = val_loss.item() / math.log(2.0) tokens_per_byte = val_token_count.item() / val_byte_count.item() model.train() return float(val_loss.item()), float(bits_per_token * tokens_per_byte) - -# ----------------------------- -# POST-TRAINING QUANTIZATION -# ----------------------------- -# -# It's silly to export our model, which is trained in bf16 and fp32, at that same precision. -# Instead, we get approximately the same model (with a small hit) by quantizing the model to int8 & zlib compressing. -# We can then decompress the model and run in higher precision for evaluation, after closing in under the size limit. - CONTROL_TENSOR_NAME_PATTERNS = tuple( pattern for pattern in os.environ.get( "CONTROL_TENSOR_NAME_PATTERNS", - "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights", + "attn_scale,attn_scales,mlp_scale,mlp_scales,resid_mix,resid_mixes,q_gain,skip_weight,skip_weights,smear,dtg_gate,ve_layer_scales,ve_shared.scale,vrl_lambda", ).split(",") if pattern ) @@ -306,10 +280,8 @@ def eval_val( INT8_PER_ROW_SCALE_DTYPE = torch.float16 INT8_CLIP_PERCENTILE = 99.99984 INT8_CLIP_Q = INT8_CLIP_PERCENTILE / 100.0 - def tensor_nbytes(t: Tensor) -> int: return int(t.numel()) * int(t.element_size()) - def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, str]) -> Tensor: if any(pattern in name for pattern in INT8_KEEP_FLOAT_FP32_NAME_PATTERNS): return t.float().contiguous() @@ -317,12 +289,9 @@ def keep_float_tensor(name: str, t: Tensor, passthrough_orig_dtypes: dict[str, s passthrough_orig_dtypes[name] = str(t.dtype).removeprefix("torch.") return t.to(dtype=INT8_KEEP_FLOAT_STORE_DTYPE).contiguous() return t - def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: t32 = t.float() if t32.ndim == 2: - # Matrices get one scale per row, which usually tracks output-channel - # ranges much better than a single tensor-wide scale. clip_abs = ( torch.quantile(t32.abs(), INT8_CLIP_Q, dim=1) if t32.numel() @@ -332,19 +301,11 @@ def quantize_float_tensor(t: Tensor) -> tuple[Tensor, Tensor]: scale = (clip_abs / 127.0).clamp_min(1.0 / 127.0) q = torch.clamp(torch.round(clipped / scale[:, None]), -127, 127).to(torch.int8).contiguous() return q, scale.to(dtype=INT8_PER_ROW_SCALE_DTYPE).contiguous() - - # Vectors / scalars use a simpler per-tensor scale. clip_abs = float(torch.quantile(t32.abs().flatten(), INT8_CLIP_Q).item()) if t32.numel() else 0.0 scale = torch.tensor(clip_abs / 127.0 if clip_abs > 0 else 1.0, dtype=torch.float32) q = torch.clamp(torch.round(torch.clamp(t32, -clip_abs, clip_abs) / scale), -127, 127).to(torch.int8).contiguous() return q, scale - def quantize_state_dict_int8(state_dict: dict[str, Tensor]): - # Single supported clean-script export format: - # - per-row int8 for 2D float tensors - # - per-tensor int8 for other float tensors - # - exact passthrough for non-floats - # - passthrough for small float tensors, stored as fp16 to save bytes quantized: dict[str, Tensor] = {} scales: dict[str, Tensor] = {} dtypes: dict[str, str] = {} @@ -355,27 +316,21 @@ def quantize_state_dict_int8(state_dict: dict[str, Tensor]): ("param_count", "num_tensors", "num_float_tensors", "num_nonfloat_tensors", "baseline_tensor_bytes", "int8_payload_bytes"), 0, ) - for name, tensor in state_dict.items(): t = tensor.detach().to("cpu").contiguous() stats["param_count"] += int(t.numel()) stats["num_tensors"] += 1 stats["baseline_tensor_bytes"] += tensor_nbytes(t) - if not t.is_floating_point(): stats["num_nonfloat_tensors"] += 1 passthrough[name] = t stats["int8_payload_bytes"] += tensor_nbytes(t) continue - - # Small float tensors are cheap enough to keep directly. We still downcast - # fp32/bf16 passthrough tensors to fp16 so metadata does not dominate size. if t.numel() <= INT8_KEEP_FLOAT_MAX_NUMEL: kept = keep_float_tensor(name, t, passthrough_orig_dtypes) passthrough[name] = kept stats["int8_payload_bytes"] += tensor_nbytes(kept) continue - stats["num_float_tensors"] += 1 q, s = quantize_float_tensor(t) if s.ndim > 0: @@ -384,7 +339,6 @@ def quantize_state_dict_int8(state_dict: dict[str, Tensor]): scales[name] = s dtypes[name] = str(t.dtype).removeprefix("torch.") stats["int8_payload_bytes"] += tensor_nbytes(q) + tensor_nbytes(s) - obj: dict[str, object] = { "__quant_format__": "int8_clean_per_row_v1", "quantized": quantized, @@ -397,7 +351,6 @@ def quantize_state_dict_int8(state_dict: dict[str, Tensor]): if passthrough_orig_dtypes: obj["passthrough_orig_dtypes"] = passthrough_orig_dtypes return obj, stats - def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]: out: dict[str, Tensor] = {} qmeta = obj.get("qmeta", {}) @@ -407,30 +360,21 @@ def dequantize_state_dict_int8(obj: dict[str, object]) -> dict[str, Tensor]: s = obj["scales"][name] if qmeta.get(name, {}).get("scheme") == "per_row" or s.ndim > 0: s = s.to(dtype=torch.float32) - # Broadcast the saved row scale back across trailing dimensions. out[name] = (q.float() * s.view(q.shape[0], *([1] * (q.ndim - 1)))).to(dtype=dtype).contiguous() else: scale = float(s.item()) out[name] = (q.float() * scale).to(dtype=dtype).contiguous() for name, t in obj["passthrough"].items(): - # Restore small tensors, undoing the temporary fp16 storage cast if needed. out_t = t.detach().to("cpu").contiguous() orig_dtype = passthrough_orig_dtypes.get(name) if isinstance(orig_dtype, str): out_t = out_t.to(dtype=getattr(torch, orig_dtype)).contiguous() out[name] = out_t return out - - -# ----------------------------- -# DATA LOADING -# ----------------------------- - def load_data_shard(file: Path) -> Tensor: header_bytes = 256 * np.dtype(" Tensor: if tokens_np.size != num_tokens: raise ValueError(f"Short read for {file}") return torch.from_numpy(tokens_np.astype(np.uint16, copy=False)) - - class TokenStream: - # Reads shards sequentially and wraps around forever. The training loop therefore - # has deterministic, simple streaming behavior with no sampling or workers. def __init__(self, pattern: str): self.files = [Path(p) for p in sorted(glob.glob(pattern))] if not self.files: @@ -453,12 +393,10 @@ def __init__(self, pattern: str): self.file_idx = 0 self.tokens = load_data_shard(self.files[0]) self.pos = 0 - def _advance_file(self) -> None: self.file_idx = (self.file_idx + 1) % len(self.files) self.tokens = load_data_shard(self.files[self.file_idx]) self.pos = 0 - def take(self, n: int) -> Tensor: chunks: list[Tensor] = [] remaining = n @@ -472,17 +410,12 @@ def take(self, n: int) -> Tensor: self.pos += k remaining -= k return chunks[0] if len(chunks) == 1 else torch.cat(chunks) - - class DistributedTokenLoader: - # Each call consumes a contiguous chunk from the shared token stream, then slices out - # one disjoint span per rank. The extra "+1" token lets us build (x, y) by shifting. def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device): self.rank = rank self.world_size = world_size self.device = device self.stream = TokenStream(pattern) - def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> tuple[Tensor, Tensor]: local_tokens = global_tokens // (self.world_size * grad_accum_steps) per_rank_span = local_tokens + 1 @@ -492,45 +425,52 @@ def next_batch(self, global_tokens: int, seq_len: int, grad_accum_steps: int) -> x = local[:-1].reshape(-1, seq_len) y = local[1:].reshape(-1, seq_len) return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True) - -# ----------------------------- -# TRANSFORMER MODULES -# ----------------------------- - class RMSNorm(nn.Module): def __init__(self, eps: float | None = None): super().__init__() self.eps = eps - def forward(self, x: Tensor) -> Tensor: return F.rms_norm(x, (x.size(-1),), eps=self.eps) - - class CastedLinear(nn.Linear): - # Keep weights in fp32 for optimizer/state quality, cast at matmul time for bf16 compute. + _qat_enabled: bool = False + _soft_tau: float = 1000.0 # High = hard round; low = soft (annealed during QAT) def forward(self, x: Tensor) -> Tensor: + w = self.weight.to(x.dtype) + if CastedLinear._qat_enabled and self.training and w.ndim == 2: + w32 = self.weight.float() + row_max = w32.abs().amax(dim=1).detach() + scale = (row_max / 31.0).clamp_min(1.0 / 31.0) + x_norm = w32 / scale[:, None] + # Hard quantized value (forward) + q_hard = torch.clamp(torch.round(x_norm), -31, 31) + # Soft interpolation (backward) for gradient signal + x_floor = x_norm.detach().floor() + frac = x_norm - x_floor + p = torch.sigmoid((frac - 0.5) / max(CastedLinear._soft_tau, 0.01)) + q_soft = torch.clamp(x_floor.detach() + p, -31, 31) + # STE: hard forward, soft backward + q = q_hard.detach() + (q_soft - q_soft.detach()) + w_q = (q * scale[:, None]).to(x.dtype) + w = w + (w_q - w).detach() bias = self.bias.to(x.dtype) if self.bias is not None else None - return F.linear(x, self.weight.to(x.dtype), bias) - - + return F.linear(x, w, bias) def restore_low_dim_params_to_fp32(module: nn.Module) -> None: - # Keep small/control parameters in fp32 even when the model body runs in bf16. with torch.no_grad(): for name, param in module.named_parameters(): if (param.ndim < 2 or any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS)) and param.dtype != torch.float32: param.data = param.data.float() - - class Rotary(nn.Module): - # Caches cos/sin tables per sequence length on the current device. - def __init__(self, dim: int, base: float = 10000.0): + def __init__(self, dim: int, base: float = 10000.0, train_seq_len: int = 1024, rope_dims: int = 0): super().__init__() - inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim)) + self.dim = dim + self.base = base + self.train_seq_len = train_seq_len + self.rope_dims = rope_dims if rope_dims > 0 else dim + inv_freq = 1.0 / (base ** (torch.arange(0, self.rope_dims, 2, dtype=torch.float32) / self.rope_dims)) self.register_buffer("inv_freq", inv_freq, persistent=False) self._seq_len_cached = 0 self._cos_cached: Tensor | None = None self._sin_cached: Tensor | None = None - def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tuple[Tensor, Tensor]: if ( self._cos_cached is None @@ -538,20 +478,29 @@ def forward(self, seq_len: int, device: torch.device, dtype: torch.dtype) -> tup or self._seq_len_cached != seq_len or self._cos_cached.device != device ): - t = torch.arange(seq_len, device=device, dtype=self.inv_freq.dtype) - freqs = torch.outer(t, self.inv_freq.to(device)) - self._cos_cached = freqs.cos()[None, None, :, :] - self._sin_cached = freqs.sin()[None, None, :, :] + rd = self.rope_dims + if seq_len > self.train_seq_len: + scale = seq_len / self.train_seq_len + new_base = self.base * (scale ** (rd / (rd - 2))) + inv_freq = 1.0 / (new_base ** (torch.arange(0, rd, 2, dtype=torch.float32, device=device) / rd)) + else: + inv_freq = self.inv_freq.to(device) + t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) + freqs = torch.outer(t, inv_freq) + self._cos_cached = freqs.cos()[None, :, None, :] + self._sin_cached = freqs.sin()[None, :, None, :] self._seq_len_cached = seq_len return self._cos_cached.to(dtype=dtype), self._sin_cached.to(dtype=dtype) - - -def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor) -> Tensor: +def apply_rotary_emb(x: Tensor, cos: Tensor, sin: Tensor, rope_dims: int = 0) -> Tensor: + if rope_dims > 0 and rope_dims < x.size(-1): + x_rope, x_pass = x[..., :rope_dims], x[..., rope_dims:] + half = rope_dims // 2 + x1, x2 = x_rope[..., :half], x_rope[..., half:] + x_rope = torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) + return torch.cat((x_rope, x_pass), dim=-1) half = x.size(-1) // 2 x1, x2 = x[..., :half], x[..., half:] return torch.cat((x1 * cos + x2 * sin, x1 * (-sin) + x2 * cos), dim=-1) - - class CausalSelfAttention(nn.Module): def __init__( self, @@ -578,45 +527,104 @@ def __init__( self.proj = CastedLinear(dim, dim, bias=False) self.proj._zero_init = True self.q_gain = nn.Parameter(torch.full((num_heads,), qk_gain_init, dtype=torch.float32)) - self.rotary = Rotary(self.head_dim, base=rope_base) - - def forward(self, x: Tensor) -> Tensor: + self.rope_dims = 0 # set by GPT.__init__ for partial RoPE + self.rotary = Rotary(self.head_dim, base=rope_base, train_seq_len=1024) + self.use_xsa = False # set by GPT.__init__ for deep layers only + self.use_vrl = False # set by GPT.__init__; VRL on all layers except first + def _xsa_efficient(self, y: Tensor, v: Tensor) -> Tensor: + """Efficient XSA: subtract self-value projection via GQA-aware reshape (no repeat_interleave). + y: [B, T, H, D], v: [B, T, Hkv, D]. H must be divisible by Hkv.""" + B, T, H, D = y.shape + Hkv = v.size(-2) + group = H // Hkv + y_g = y.reshape(B, T, Hkv, group, D) # [B, T, Hkv, group, D] + vn = F.normalize(v, dim=-1).unsqueeze(-2) # [B, T, Hkv, 1, D] — broadcast ready + proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn + return (y_g - proj).reshape(B, T, H, D) + def forward(self, x: Tensor, v_embed: Tensor | None = None, q_delta: Tensor | None = None, v_delta: Tensor | None = None, v0: Tensor | None = None) -> tuple[Tensor, Tensor]: bsz, seqlen, dim = x.shape - q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2) - k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2) - v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2) + q = self.c_q(x) + if q_delta is not None: + q = q + q_delta + q = q.reshape(bsz, seqlen, self.num_heads, self.head_dim) + k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + v = self.c_v(x) + if v_embed is not None: + v = v + v_embed + if v_delta is not None: + v = v + v_delta + v = v.reshape(bsz, seqlen, self.num_kv_heads, self.head_dim) + raw_v = v # cache for VRL before blending + if self.use_vrl and v0 is not None: + lam = self.vrl_lambda.to(dtype=v.dtype) + v = lam[0] * v0 + lam[1] * v q = F.rms_norm(q, (q.size(-1),)) k = F.rms_norm(k, (k.size(-1),)) cos, sin = self.rotary(seqlen, x.device, q.dtype) - q = apply_rotary_emb(q, cos, sin) - k = apply_rotary_emb(k, cos, sin) - q = q * self.q_gain.to(dtype=q.dtype)[None, :, None, None] - y = F.scaled_dot_product_attention( - q, - k, - v, - attn_mask=None, - is_causal=True, - enable_gqa=(self.num_kv_heads != self.num_heads), - ) - y = y.transpose(1, 2).contiguous().reshape(bsz, seqlen, dim) - return self.proj(y) - - + q = apply_rotary_emb(q, cos, sin, self.rope_dims) + k = apply_rotary_emb(k, cos, sin, self.rope_dims) + q = q * self.q_gain.to(dtype=q.dtype)[None, None, :, None] + y = flash_attn_3_func(q, k, v, causal=True) + if self.use_xsa: + y = self._xsa_efficient(y, v) + y = y.reshape(bsz, seqlen, dim) + return self.proj(y), raw_v +class SmearGate(nn.Module): + def __init__(self, dim: int): + super().__init__() + self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32)) + def forward(self, x: Tensor) -> Tensor: + g = torch.sigmoid(self.gate.to(dtype=x.dtype))[None, None, :] + x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1) + return (1 - g) * x + g * x_prev +class BigramHashEmbedding(nn.Module): + def __init__(self, bigram_vocab_size: int, bigram_dim: int, model_dim: int): + super().__init__() + self.bigram_vocab_size = bigram_vocab_size + self.embed = nn.Embedding(bigram_vocab_size, bigram_dim) + nn.init.zeros_(self.embed.weight) + self.proj = CastedLinear(bigram_dim, model_dim, bias=False) if bigram_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32)) + def bigram_hash(self, tokens: Tensor) -> Tensor: + t = tokens.to(torch.int32) + mod = self.bigram_vocab_size - 1 + out = torch.empty_like(t) + out[..., 0] = mod + out[..., 1:] = torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod + return out.long() + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(self.bigram_hash(token_ids)) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) +class ValueEmbedding(nn.Module): + """Reinject token identity into attention values at specific layers. + Each table maps vocab tokens to a low-dim embedding, projected to model_dim.""" + def __init__(self, vocab_size: int, ve_dim: int, model_dim: int): + super().__init__() + self.embed = nn.Embedding(vocab_size, ve_dim) + nn.init.normal_(self.embed.weight, std=0.01) + self.proj = CastedLinear(ve_dim, model_dim, bias=False) if ve_dim != model_dim else None + if self.proj is not None: + nn.init.zeros_(self.proj.weight) + self.scale = nn.Parameter(torch.tensor(0.1, dtype=torch.float32)) + def forward(self, token_ids: Tensor) -> Tensor: + h = self.embed(token_ids) + if self.proj is not None: + h = self.proj(h) + return h * self.scale.to(dtype=h.dtype) class MLP(nn.Module): - # relu^2 MLP from the original modded-nanogpt setup def __init__(self, dim: int, mlp_mult: int): super().__init__() - hidden = mlp_mult * dim + hidden = int(mlp_mult * dim) self.fc = CastedLinear(dim, hidden, bias=False) self.proj = CastedLinear(hidden, dim, bias=False) self.proj._zero_init = True - def forward(self, x: Tensor) -> Tensor: - x = torch.relu(self.fc(x)) + x = F.leaky_relu(self.fc(x), negative_slope=0.5) return self.proj(x.square()) - - class Block(nn.Module): def __init__( self, @@ -626,6 +634,9 @@ def __init__( mlp_mult: int, rope_base: float, qk_gain_init: float, + layer_idx: int = 0, + ln_scale: bool = False, + dtg: bool = False, ): super().__init__() self.attn_norm = RMSNorm() @@ -635,16 +646,26 @@ def __init__( self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32)) self.resid_mix = nn.Parameter(torch.stack((torch.ones(dim), torch.zeros(dim))).float()) - - def forward(self, x: Tensor, x0: Tensor) -> Tensor: + self.ln_scale_factor = 1.0 / math.sqrt(layer_idx + 1) if ln_scale else 1.0 + if dtg: + self.dtg_gate = nn.Linear(dim, 1, bias=True) + nn.init.zeros_(self.dtg_gate.weight) + nn.init.constant_(self.dtg_gate.bias, 2.0) + else: + self.dtg_gate = None + def forward(self, x: Tensor, x0: Tensor, v_embed: Tensor | None = None, q_delta_fn=None, v_delta_fn=None, v0: Tensor | None = None) -> tuple[Tensor, Tensor]: mix = self.resid_mix.to(dtype=x.dtype) - x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 - attn_out = self.attn(self.attn_norm(x)) - x = x + self.attn_scale.to(dtype=x.dtype)[None, None, :] * attn_out - x = x + self.mlp_scale.to(dtype=x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x)) - return x - - + x_in = mix[0][None, None, :] * x + mix[1][None, None, :] * x0 + n = self.attn_norm(x_in) * self.ln_scale_factor + qd = q_delta_fn(n) if q_delta_fn is not None else None + vd = v_delta_fn(n) if v_delta_fn is not None else None + attn_out, raw_v = self.attn(n, v_embed=v_embed, q_delta=qd, v_delta=vd, v0=v0) + x_out = x_in + self.attn_scale.to(dtype=x_in.dtype)[None, None, :] * attn_out + x_out = x_out + self.mlp_scale.to(dtype=x_out.dtype)[None, None, :] * self.mlp(self.mlp_norm(x_out) * self.ln_scale_factor) + if self.dtg_gate is not None: + gate = torch.sigmoid(self.dtg_gate(x_in.detach())) + x_out = x_in + gate * (x_out - x_in) + return x_out, raw_v class GPT(nn.Module): def __init__( self, @@ -659,14 +680,32 @@ def __init__( logit_softcap: float, rope_base: float, qk_gain_init: float, + mtp_num_heads: int = 0, + mtp_loss_weight: float = 0.1, + bigram_vocab_size: int = 0, + bigram_dim: int = 128, + xsa_last_n: int = 0, + rope_dims: int = 0, + ln_scale: bool = False, + dtg: bool = False, + ve_enabled: bool = False, + ve_dim: int = 128, + ve_layers: str = "9,10", + use_vrl: bool = False, ): super().__init__() + self.use_vrl = use_vrl + self._ve_target_dim = num_kv_heads * (model_dim // num_heads) # kv_dim for value projection if logit_softcap <= 0.0: raise ValueError(f"logit_softcap must be positive, got {logit_softcap}") self.tie_embeddings = tie_embeddings self.tied_embed_init_std = tied_embed_init_std self.logit_softcap = logit_softcap + self.mtp_num_heads = mtp_num_heads + self.mtp_loss_weight = mtp_loss_weight self.tok_emb = nn.Embedding(vocab_size, model_dim) + self.bigram = BigramHashEmbedding(bigram_vocab_size, bigram_dim, model_dim) if bigram_vocab_size > 0 else None + self.smear = SmearGate(model_dim) self.num_encoder_layers = num_layers // 2 self.num_decoder_layers = num_layers - self.num_encoder_layers self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers) @@ -680,65 +719,512 @@ def __init__( mlp_mult, rope_base, qk_gain_init, + layer_idx=i, + ln_scale=ln_scale, + dtg=dtg, ) for i in range(num_layers) ] ) + if rope_dims > 0: + head_dim = model_dim // num_heads + for block in self.blocks: + block.attn.rope_dims = rope_dims + block.attn.rotary = Rotary(head_dim, base=rope_base, train_seq_len=1024, rope_dims=rope_dims) + # VRL: Value Residual Learning — blend layer 0's V into all subsequent layers + if use_vrl: + for i, block in enumerate(self.blocks): + if i > 0: # layer 0 produces v0, all others blend + block.attn.use_vrl = True + block.attn.vrl_lambda = nn.Parameter(torch.tensor([0.01, 0.99], dtype=torch.float32)) + self.ve_layer_indices = [int(x) for x in ve_layers.split(",") if x.strip()] if ve_enabled else [] + kv_dim = self._ve_target_dim + if self.ve_layer_indices: + self.ve_shared = ValueEmbedding(vocab_size, ve_dim, kv_dim) + self.ve_layer_scales = nn.ParameterList( + [nn.Parameter(torch.ones(1, dtype=torch.float32)) for _ in self.ve_layer_indices] + ) + else: + self.ve_shared = None + self.ve_layer_scales = nn.ParameterList() + self.value_embeds = nn.ModuleList() # keep empty for compat self.final_norm = RMSNorm() self.lm_head = None if tie_embeddings else CastedLinear(model_dim, vocab_size, bias=False) if self.lm_head is not None: self.lm_head._zero_init = True + self.mtp_heads = nn.ModuleList( + [CastedLinear(model_dim, vocab_size, bias=False) for _ in range(mtp_num_heads)] + ) + for head in self.mtp_heads: + head._zero_init = True + if xsa_last_n > 0: + for i in range(max(0, num_layers - xsa_last_n), num_layers): + self.blocks[i].attn.use_xsa = True self._init_weights() - def _init_weights(self) -> None: if self.tie_embeddings: nn.init.normal_(self.tok_emb.weight, mean=0.0, std=self.tied_embed_init_std) - for module in self.modules(): - if isinstance(module, nn.Linear) and getattr(module, "_zero_init", False): - nn.init.zeros_(module.weight) - - def forward(self, input_ids: Tensor, target_ids: Tensor) -> Tensor: + num_layers = len(self.blocks) + for name, module in self.named_modules(): + if isinstance(module, nn.Linear): + if getattr(module, "_zero_init", False): + nn.init.zeros_(module.weight) + elif module.weight.ndim == 2 and module.weight.shape[0] >= 64 and module.weight.shape[1] >= 64: + nn.init.orthogonal_(module.weight, gain=1.0) + if ".proj." in name or name.endswith(".proj"): + with torch.no_grad(): + module.weight.mul_(1.0 / math.sqrt(2 * num_layers)) + def _get_ve(self, layer_idx: int, input_ids: Tensor, ve_cache: dict | None = None) -> Tensor | None: + """Get value embedding for a specific layer using shared table + per-layer scale.""" + if self.ve_shared is None or layer_idx not in self.ve_layer_indices: + return None + if ve_cache is not None and 've' not in ve_cache: + ve_cache['ve'] = self.ve_shared(input_ids) + ve_base = ve_cache['ve'] if ve_cache is not None else self.ve_shared(input_ids) + ve_idx = self.ve_layer_indices.index(layer_idx) + return ve_base * self.ve_layer_scales[ve_idx].to(dtype=ve_base.dtype) + def forward(self, input_ids: Tensor, target_ids: Tensor, lora=None) -> Tensor: x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) x0 = x skips: list[Tensor] = [] - - # First half stores skips; second half reuses them in reverse order. + ve_cache: dict = {} + v0 = None # VRL: cached V from first layer for i in range(self.num_encoder_layers): - x = self.blocks[i](x, x0) + ve = self._get_ve(i, input_ids, ve_cache) + qd = lora.q_loras[i] if lora else None + vd = lora.v_loras[i] if lora else None + x, raw_v = self.blocks[i](x, x0, v_embed=ve, q_delta_fn=qd, v_delta_fn=vd, v0=v0) + if i == 0 and self.use_vrl: + v0 = raw_v skips.append(x) for i in range(self.num_decoder_layers): + bi = self.num_encoder_layers + i if skips: x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop() - x = self.blocks[self.num_encoder_layers + i](x, x0) - - x = self.final_norm(x).reshape(-1, x.size(-1)) + ve = self._get_ve(bi, input_ids, ve_cache) + qd = lora.q_loras[bi] if lora else None + vd = lora.v_loras[bi] if lora else None + x, _ = self.blocks[bi](x, x0, v_embed=ve, q_delta_fn=qd, v_delta_fn=vd, v0=v0) + x = self.final_norm(x) + x_flat = x.reshape(-1, x.size(-1)) targets = target_ids.reshape(-1) if self.tie_embeddings: - logits_proj = F.linear(x, self.tok_emb.weight) + logits_proj = F.linear(x_flat, self.tok_emb.weight) else: if self.lm_head is None: raise RuntimeError("lm_head is required when tie_embeddings=False") + logits_proj = self.lm_head(x_flat) + logits_proj = logits_proj + (lora.lm_head_lora(x).reshape(-1, logits_proj.size(-1)) if lora else 0) + logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) + if lora: + bsz, sl, V = logits_proj.shape[0] // target_ids.shape[1], target_ids.shape[1], logits_proj.shape[-1] + return F.cross_entropy(logits.float(), targets, reduction="none").reshape(bsz, sl) + main_loss = F.cross_entropy(logits.float(), targets, reduction="mean") + if self.training and self.mtp_num_heads > 0 and self.mtp_loss_weight > 0.0: + _, seqlen, dim = x.shape + mtp_loss_sum = x.new_zeros(()) + mtp_loss_count = 0 + for k, mtp_head in enumerate(self.mtp_heads): + valid_t = seqlen - (k + 1) + if valid_t <= 0: + continue + mtp_hidden = x[:, :valid_t, :].reshape(-1, dim) + mtp_targets = target_ids[:, k + 1 :].reshape(-1) + mtp_logits_proj = mtp_head(mtp_hidden) + mtp_logits = self.logit_softcap * torch.tanh(mtp_logits_proj / self.logit_softcap) + mtp_loss_sum = mtp_loss_sum + F.cross_entropy(mtp_logits.float(), mtp_targets, reduction="mean") + mtp_loss_count += 1 + if mtp_loss_count > 0: + main_loss = main_loss + self.mtp_loss_weight * (mtp_loss_sum / mtp_loss_count) + return main_loss + def forward_logits(self, input_ids: Tensor, return_hidden: bool = False): + """Return logits (bsz, seq_len, vocab) without computing loss.""" + x = self.tok_emb(input_ids) + if self.bigram is not None: + x = x + self.bigram(input_ids) + x = F.rms_norm(x, (x.size(-1),)) + x = self.smear(x) + x0 = x + skips: list[Tensor] = [] + ve_cache: dict = {} + v0 = None # VRL: cached V from first layer + for i in range(self.num_encoder_layers): + ve = self._get_ve(i, input_ids, ve_cache) + x, raw_v = self.blocks[i](x, x0, v_embed=ve, v0=v0) + if i == 0 and self.use_vrl: + v0 = raw_v + skips.append(x) + for i in range(self.num_decoder_layers): + bi = self.num_encoder_layers + i + if skips: + x = x + self.skip_weights[i].to(dtype=x.dtype)[None, None, :] * skips.pop() + ve = self._get_ve(bi, input_ids, ve_cache) + x, _ = self.blocks[bi](x, x0, v_embed=ve, v0=v0) + x = self.final_norm(x) + if self.tie_embeddings: + logits_proj = F.linear(x, self.tok_emb.weight) + else: logits_proj = self.lm_head(x) logits = self.logit_softcap * torch.tanh(logits_proj / self.logit_softcap) - return F.cross_entropy(logits.float(), targets, reduction="mean") + if return_hidden: + return logits, x + return logits +def eval_val_sliding_ttt( + args: Hyperparameters, + base_model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int = 64, + batch_seqs: int = 32, + log_fn=None, +) -> tuple[float, float]: + """Legal score-first TTT (PR #461/549 recipe): score each 32K chunk with + sliding windows, then train on it. Every token scored BEFORE any update + that could use it. Model synchronized across GPUs via all-reduce.""" + seq_len = args.train_seq_len + total_tokens = val_tokens.numel() - 1 + ttt_chunk = args.ttt_chunk_tokens + + # Pre-compute all window starts + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= stride or ws == 0] + + # Assign each window to a chunk based on the first token it scores + num_chunks = (total_tokens + ttt_chunk - 1) // ttt_chunk + chunk_windows: list[list[int]] = [[] for _ in range(num_chunks)] + for ws in window_starts: + end = min(ws + seq_len, total_tokens) + wlen = end - ws + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_start = ws + s + ci = min(scored_start // ttt_chunk, num_chunks - 1) + chunk_windows[ci].append(ws) + + if log_fn: + log_fn(f"ttt_sliding:start chunks={num_chunks} chunk_tokens={ttt_chunk} " + f"total_windows={len(window_starts)} stride={stride} " + f"ttt_lr={args.ttt_lr} ttt_epochs={args.ttt_epochs} " + f"freeze_blocks={args.ttt_freeze_blocks}") + + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + + # Freeze first N blocks + optionally embeddings + frozen_block_ids = set(range(min(args.ttt_freeze_blocks, len(base_model.blocks)))) + ttt_params = [] + for name, p in base_model.named_parameters(): + freeze = any(f"blocks.{bi}." in name for bi in frozen_block_ids) + # Freeze embeddings during TTT: adapting vocab embeddings to a local chunk + # distorts representations for tokens not in that chunk + if args.ttt_freeze_embeddings and any(k in name for k in ("tok_emb", "bigram", "lm_head")): + freeze = True + if freeze: + p.requires_grad_(False) + else: + p.requires_grad_(True) + ttt_params.append(p) + + if log_fn: + log_fn(f"ttt_sliding:params unfrozen={sum(p.numel() for p in ttt_params)} " + f"frozen={sum(p.numel() for p in base_model.parameters() if not p.requires_grad)}") + + optimizer = torch.optim.SGD(ttt_params, lr=args.ttt_lr, momentum=args.ttt_momentum) + # Precompute layer keys for EB-adaptive TTT + if args.eb_ttt: + ttt_param_layer_keys: list[str] = [] + for name, p in base_model.named_parameters(): + if not p.requires_grad: + continue + parts = name.split(".") + lk = f"{parts[0]}.{parts[1]}" if len(parts) > 1 and parts[1].isdigit() else parts[0] + ttt_param_layer_keys.append(lk) + t0 = time.perf_counter() + + for ci in range(num_chunks): + windows = chunk_windows[ci] + if not windows: + continue + chunk_start = ci * ttt_chunk + chunk_end = min((ci + 1) * ttt_chunk, total_tokens) + + # --- Phase 1: SCORE this chunk's windows (inference_mode) --- + my_s = (len(windows) * rank) // world_size + my_e = (len(windows) * (rank + 1)) // world_size + my_windows = windows[my_s:my_e] + + base_model.eval() + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk_tok = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk_tok[:-1] + y_batch[i, :wlen] = chunk_tok[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = base_model.forward_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt, prev = y_batch[i, s:wlen], x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + + # --- Phase 2: TRAIN on this chunk (already scored = legal) --- + is_last_chunk = (ci == num_chunks - 1) + if not is_last_chunk and args.ttt_epochs > 0: + base_model.train() + chunk_seqs = (chunk_end - chunk_start) // seq_len + if chunk_seqs > 0: + cos_lr = args.ttt_lr * 0.5 * (1.0 + math.cos(math.pi * ci / max(num_chunks - 1, 1))) + for pg in optimizer.param_groups: + pg['lr'] = cos_lr + my_seq_s = (chunk_seqs * rank) // world_size + my_seq_e = (chunk_seqs * (rank + 1)) // world_size + my_chunk_seqs = my_seq_e - my_seq_s + for _ep in range(args.ttt_epochs): + for bs in range(0, my_chunk_seqs, args.ttt_train_batch_seqs): + be = min(bs + args.ttt_train_batch_seqs, my_chunk_seqs) + actual_bs = my_seq_s + bs + start_tok = chunk_start + actual_bs * seq_len + end_tok = chunk_start + (my_seq_s + be) * seq_len + 1 + if end_tok > val_tokens.numel(): + continue + local = val_tokens[start_tok:end_tok].to(device=device, dtype=torch.int64) + x = local[:-1].reshape(-1, seq_len) + y = local[1:].reshape(-1, seq_len) + optimizer.zero_grad(set_to_none=True) + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + loss = base_model(x, y) + loss.backward() + if world_size > 1: + for p in ttt_params: + if p.grad is not None: + dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) + # Empirical Bayes adaptive TTT: scale gradients per-layer by SNR + # High SNR (consistent direction) → amplify; Low SNR → stay at prior + if args.eb_ttt: + with torch.no_grad(): + layer_grads: dict[str, list[Tensor]] = {} + for pi, p in enumerate(ttt_params): + if p.grad is None: + continue + lk = ttt_param_layer_keys[pi] + if lk not in layer_grads: + layer_grads[lk] = [] + layer_grads[lk].append(p.grad) + layer_scales: dict[str, float] = {} + for lk, grads in layer_grads.items(): + flat = torch.cat([g.float().flatten() for g in grads]) + snr = (flat.abs().mean() / (flat.std() + 1e-8)).item() + layer_scales[lk] = max(args.eb_ttt_min, min(args.eb_ttt_max, snr)) + for pi, p in enumerate(ttt_params): + if p.grad is not None: + p.grad.mul_(layer_scales.get(ttt_param_layer_keys[pi], 1.0)) + torch.nn.utils.clip_grad_norm_(ttt_params, args.ttt_grad_clip) + optimizer.step() + + if log_fn and (ci % 10 == 0 or ci == num_chunks - 1): + elapsed = time.perf_counter() - t0 + rbpb = float((loss_sum / math.log(2.0)) / byte_count) if byte_count > 0 else 0.0 + log_fn(f" ttt_chunk [{ci+1}/{num_chunks}] bpb={rbpb:.6f} time={elapsed:.1f}s") + if args.eb_ttt and ci % 100 == 0 and 'layer_scales' in dir(): + log_fn(f" eb_scales: {' '.join(f'{k}={v:.2f}' for k, v in sorted(layer_scales.items()))}") + + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + + val_loss = (loss_sum / token_count).item() + val_bpb = val_loss / math.log(2.0) * (token_count.item() / byte_count.item()) + + for p in base_model.parameters(): + p.requires_grad_(True) + base_model.eval() + if log_fn: + log_fn(f"ttt_sliding:done val_loss={val_loss:.6f} val_bpb={val_bpb:.6f} " + f"elapsed={time.perf_counter() - t0:.1f}s") + return val_loss, val_bpb -# ----------------------------- -# TRAINING -# ----------------------------- +def eval_val_sliding( + args: Hyperparameters, + base_model: nn.Module, + rank: int, + world_size: int, + device: torch.device, + val_tokens: Tensor, + base_bytes_lut: Tensor, + has_leading_space_lut: Tensor, + is_boundary_token_lut: Tensor, + stride: int, + batch_seqs: int = 32, + eval_seq_len: int | None = None, +) -> tuple[float, float]: + """Sliding window evaluation: each token scored with maximum context.""" + seq_len = eval_seq_len or args.train_seq_len + total_tokens = val_tokens.numel() - 1 + window_starts = [ws for ws in range(0, total_tokens, stride) + if min(ws + seq_len, total_tokens) - ws >= 1] + total_windows = len(window_starts) + my_s = (total_windows * rank) // world_size + my_e = (total_windows * (rank + 1)) // world_size + my_windows = window_starts[my_s:my_e] + loss_sum = torch.zeros((), device=device, dtype=torch.float64) + token_count = torch.zeros((), device=device, dtype=torch.float64) + byte_count = torch.zeros((), device=device, dtype=torch.float64) + base_model.eval() + compiled_logits = torch.compile(base_model.forward_logits, dynamic=False, fullgraph=True) + with torch.inference_mode(): + for bi in range(0, len(my_windows), batch_seqs): + batch_ws = my_windows[bi:bi + batch_seqs] + bsz = len(batch_ws) + x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device) + wlens: list[int] = [] + for i, ws in enumerate(batch_ws): + end = min(ws + seq_len, total_tokens) + wlen = end - ws + wlens.append(wlen) + chunk = val_tokens[ws:end + 1].to(dtype=torch.int64, device=device) + x_batch[i, :wlen] = chunk[:-1] + y_batch[i, :wlen] = chunk[1:] + with torch.autocast(device_type="cuda", dtype=torch.bfloat16): + logits = compiled_logits(x_batch) + nll = F.cross_entropy( + logits.reshape(-1, logits.size(-1)).float(), + y_batch.reshape(-1), + reduction="none", + ).reshape(bsz, seq_len) + for i, ws in enumerate(batch_ws): + wlen = wlens[i] + s = 0 if ws == 0 else max(wlen - stride, 0) + scored_nll = nll[i, s:wlen].to(torch.float64) + loss_sum += scored_nll.sum() + token_count += float(wlen - s) + tgt = y_batch[i, s:wlen] + prev = x_batch[i, s:wlen] + tb = base_bytes_lut[tgt].to(torch.float64) + tb += (has_leading_space_lut[tgt] & ~is_boundary_token_lut[prev]).to(torch.float64) + byte_count += tb.sum() + if dist.is_available() and dist.is_initialized(): + dist.all_reduce(loss_sum, op=dist.ReduceOp.SUM) + dist.all_reduce(token_count, op=dist.ReduceOp.SUM) + dist.all_reduce(byte_count, op=dist.ReduceOp.SUM) + val_loss = (loss_sum / token_count).item() + bits_per_token = val_loss / math.log(2.0) + tokens_per_byte = token_count.item() / byte_count.item() + base_model.train() + return val_loss, bits_per_token * tokens_per_byte +def _classify_param(name: str) -> str: + if "tok_emb" in name or "lm_head" in name: + return "embed" + if ".mlp." in name: + return "mlp" + if ".attn." in name or (".proj." in name and ".mlp." not in name): + return "attn" + return "other" +def quantize_int6_per_row(t: Tensor, clip_range: int = 31) -> tuple[Tensor, Tensor]: + t32 = t.float() + if t32.ndim == 2: + best_q, best_s, best_err = None, None, float('inf') + for pct in [0.9990, 0.9995, 0.9999, 0.99999, 1.0]: + if pct < 1.0: + row_clip = torch.quantile(t32.abs(), pct, dim=1) + else: + row_clip = t32.abs().amax(dim=1) + s = (row_clip / clip_range).clamp_min(1.0 / clip_range).to(torch.float16) + q = torch.clamp(torch.round(t32 / s.float()[:, None]), -clip_range, clip_range).to(torch.int8) + recon = q.float() * s.float()[:, None] + err = (t32 - recon).pow(2).mean().item() + if err < best_err: + best_q, best_s, best_err = q, s, err + return best_q, best_s + amax = t32.abs().max().item() + scale = torch.tensor(amax / clip_range if amax > 0 else 1.0, dtype=torch.float16) + q = torch.clamp(torch.round(t32 / scale.float()), -clip_range, clip_range).to(torch.int8) + return q, scale +def mixed_quantize_int6(state_dict: dict[str, Tensor], int6_cats: set[str]): + num_layers_total = max( + (int(k.split(".")[1]) for k in state_dict if k.startswith("blocks.")), + default=0, + ) + 1 + late_k_layers = set(range(num_layers_total - 2, num_layers_total)) + result: dict[str, Tensor] = {} + meta: dict[str, object] = {} + for name, tensor in state_dict.items(): + t = tensor.detach().cpu().contiguous() + cat = _classify_param(name) + if not t.is_floating_point() or t.numel() <= 65536: + result[name] = t.to(torch.float16) if t.is_floating_point() else t + meta[name] = "passthrough" + continue + if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): + result[name] = t.float() + meta[name] = "passthrough_ctrl" + continue + if cat in int6_cats and t.ndim >= 1: + q, s = quantize_int6_per_row(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int6"} + else: + q, s = quantize_float_tensor(t) + result[name + ".q"] = q + result[name + ".scale"] = s + meta[name] = {"type": "int8"} + return result, meta +def dequantize_mixed_int6(result: dict[str, Tensor], meta: dict[str, object], + template_sd: dict[str, Tensor]) -> dict[str, Tensor]: + out: dict[str, Tensor] = {} + for name, orig in template_sd.items(): + info = meta.get(name) + if info is None: + continue + orig_dtype = orig.dtype + if info in ("passthrough", "passthrough_ctrl", "passthrough_fp16"): + t = result[name] + if t.dtype == torch.float16 and orig_dtype in (torch.float32, torch.bfloat16): + t = t.to(orig_dtype) + out[name] = t + continue + q, s = result[name + ".q"], result[name + ".scale"] + if s.ndim > 0: + out[name] = (q.float() * s.float().view(q.shape[0], *([1] * (q.ndim - 1)))).to(orig_dtype) + else: + out[name] = (q.float() * float(s.item())).to(orig_dtype) + return out def main() -> None: global zeropower_via_newtonschulz5 - code = Path(__file__).read_text(encoding="utf-8") args = Hyperparameters() zeropower_via_newtonschulz5 = torch.compile(zeropower_via_newtonschulz5) - - # ----------------------------- - # DISTRIBUTED + CUDA SETUP - # ----------------------------- - distributed = "RANK" in os.environ and "WORLD_SIZE" in os.environ rank = int(os.environ.get("RANK", "0")) world_size = int(os.environ.get("WORLD_SIZE", "1")) @@ -757,23 +1243,18 @@ def main() -> None: dist.init_process_group(backend="nccl", device_id=device) dist.barrier() master_process = rank == 0 - - # Fast math knobs torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.allow_tf32 = True from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp - enable_cudnn_sdp(False) enable_flash_sdp(True) enable_mem_efficient_sdp(False) enable_math_sdp(False) - logfile = None if master_process: os.makedirs("logs", exist_ok=True) logfile = f"logs/{args.run_id}.txt" print(logfile) - def log0(msg: str, console: bool = True) -> None: if not master_process: return @@ -782,7 +1263,6 @@ def log0(msg: str, console: bool = True) -> None: if logfile is not None: with open(logfile, "a", encoding="utf-8") as f: print(msg, file=f) - log0(code, console=False) log0("=" * 100, console=False) log0(f"Running Python {sys.version}", console=False) @@ -792,16 +1272,10 @@ def log0(msg: str, console: bool = True) -> None: console=False, ) log0("=" * 100, console=False) - - # ----------------------------- - # TOKENIZER + VALIDATION METRIC SETUP - # ----------------------------- - random.seed(args.seed) np.random.seed(args.seed) torch.manual_seed(args.seed) torch.cuda.manual_seed_all(args.seed) - if not args.tokenizer_path.endswith(".model"): raise ValueError(f"Script only setup for SentencePiece .model file: {args.tokenizer_path}") sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path) @@ -811,18 +1285,16 @@ def log0(msg: str, console: bool = True) -> None: ) dataset_dir = Path(args.data_path).resolve() actual_train_files = len(list(dataset_dir.glob("fineweb_train_*.bin"))) - val_tokens = load_validation_tokens(args.val_files, args.train_seq_len) + effective_eval_seq_len = args.eval_seq_len if args.eval_seq_len > 0 else args.train_seq_len + val_seq_len = max(args.train_seq_len, effective_eval_seq_len) + val_tokens = load_validation_tokens(args.val_files, val_seq_len) base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts( sp, args.vocab_size, device ) log0(f"val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path={args.tokenizer_path}") log0(f"train_loader:dataset:{dataset_dir.name} train_shards:{actual_train_files}") log0(f"val_loader:shards pattern={args.val_files} tokens:{val_tokens.numel() - 1}") - - # ----------------------------- - # MODEL + OPTIMIZER SETUP - # ----------------------------- - + CastedLinear._qat_enabled = args.qat_enabled base_model = GPT( vocab_size=args.vocab_size, num_layers=args.num_layers, @@ -835,6 +1307,18 @@ def log0(msg: str, console: bool = True) -> None: logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + mtp_num_heads=args.mtp_num_heads, + mtp_loss_weight=args.mtp_loss_weight, + bigram_vocab_size=args.bigram_vocab_size, + bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, + rope_dims=args.rope_dims, + ln_scale=args.ln_scale, + dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, + ve_dim=args.ve_dim, + ve_layers=args.ve_layers, + use_vrl=args.vrl, ).to(device).bfloat16() for module in base_model.modules(): if isinstance(module, CastedLinear): @@ -842,18 +1326,14 @@ def log0(msg: str, console: bool = True) -> None: restore_low_dim_params_to_fp32(base_model) compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) model: nn.Module = DDP(compiled_model, device_ids=[local_rank], broadcast_buffers=False) if distributed else compiled_model - - # Optimizer split: - # - token embedding (Adam) uses EMBED_LR - # - untied lm_head (Adam) uses HEAD_LR - # - matrix params in transformer blocks use MATRIX_LR via Muon - # - vectors/scalars use SCALAR_LR via Adam block_named_params = list(base_model.blocks.named_parameters()) matrix_params = [ p for name, p in block_named_params if p.ndim == 2 and not any(pattern in name for pattern in CONTROL_TENSOR_NAME_PATTERNS) ] + if base_model.mtp_num_heads > 0: + matrix_params.extend([p for p in base_model.mtp_heads.parameters() if p.ndim == 2]) scalar_params = [ p for name, p in block_named_params @@ -861,11 +1341,27 @@ def log0(msg: str, console: bool = True) -> None: ] if base_model.skip_weights.numel() > 0: scalar_params.append(base_model.skip_weights) + scalar_params.append(base_model.smear.gate) + if base_model.bigram is not None: + scalar_params.append(base_model.bigram.scale) token_lr = args.tied_embed_lr if args.tie_embeddings else args.embed_lr - optimizer_tok = torch.optim.Adam( - [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}], + tok_params = [{"params": [base_model.tok_emb.weight], "lr": token_lr, "base_lr": token_lr}] + if base_model.bigram is not None: + tok_params.append({"params": [base_model.bigram.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.bigram.proj is not None: + matrix_params.append(base_model.bigram.proj.weight) + if base_model.ve_shared is not None: + tok_params.append({"params": [base_model.ve_shared.embed.weight], "lr": token_lr, "base_lr": token_lr}) + if base_model.ve_shared.proj is not None: + matrix_params.append(base_model.ve_shared.proj.weight) + scalar_params.append(base_model.ve_shared.scale) + for s in base_model.ve_layer_scales: + scalar_params.append(s) + optimizer_tok = torch.optim.AdamW( + tok_params, betas=(args.beta1, args.beta2), eps=args.adam_eps, + weight_decay=args.adam_wd, fused=True, ) optimizer_muon = Muon( @@ -873,13 +1369,15 @@ def log0(msg: str, console: bool = True) -> None: lr=args.matrix_lr, momentum=args.muon_momentum, backend_steps=args.muon_backend_steps, + weight_decay=args.muon_wd, ) for group in optimizer_muon.param_groups: group["base_lr"] = args.matrix_lr - optimizer_scalar = torch.optim.Adam( + optimizer_scalar = torch.optim.AdamW( [{"params": scalar_params, "lr": args.scalar_lr, "base_lr": args.scalar_lr}], betas=(args.beta1, args.beta2), eps=args.adam_eps, + weight_decay=args.adam_wd, fused=True, ) optimizers: list[torch.optim.Optimizer] = [optimizer_tok, optimizer_muon, optimizer_scalar] @@ -891,9 +1389,14 @@ def log0(msg: str, console: bool = True) -> None: fused=True, ) optimizers.insert(1, optimizer_head) - n_params = sum(p.numel() for p in base_model.parameters()) + mtp_params = sum(p.numel() for p in base_model.mtp_heads.parameters()) log0(f"model_params:{n_params}") + log0(f"mtp_num_heads:{args.mtp_num_heads} mtp_loss_weight:{args.mtp_loss_weight} mtp_params:{mtp_params}") + xsa_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_xsa] + log0(f"XSA:last_{args.xsa_last_n} active_layers:{xsa_layers}") + vrl_layers = [i for i, b in enumerate(base_model.blocks) if b.attn.use_vrl] + log0(f"VRL:{args.vrl} active_layers:{vrl_layers}") log0(f"world_size:{world_size} grad_accum_steps:{grad_accum_steps}") log0("sdp_backends:cudnn=False flash=True mem_efficient=False math=False") log0(f"attention_mode:gqa num_heads:{args.num_heads} num_kv_heads:{args.num_kv_heads}") @@ -908,19 +1411,11 @@ def log0(msg: str, console: bool = True) -> None: f"max_wallclock_seconds:{args.max_wallclock_seconds:.3f}" ) log0(f"seed:{args.seed}") - - # ----------------------------- - # DATA LOADER & MODEL WARMUP - # ----------------------------- - train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) - def zero_grad_all() -> None: for opt in optimizers: opt.zero_grad(set_to_none=True) - max_wallclock_ms = 1000.0 * args.max_wallclock_seconds if args.max_wallclock_seconds > 0 else None - def lr_mul(step: int, elapsed_ms: float) -> float: if args.warmdown_iters <= 0: return 1.0 @@ -931,9 +1426,6 @@ def lr_mul(step: int, elapsed_ms: float) -> float: warmdown_ms = args.warmdown_iters * step_ms remaining_ms = max(max_wallclock_ms - elapsed_ms, 0.0) return remaining_ms / max(warmdown_ms, 1e-9) if remaining_ms <= warmdown_ms else 1.0 - - # Warmup primes the compiled forward/backward/optimizer paths, then we restore the - # initial weights/optimizer state so measured training starts from the true init. if args.warmup_steps > 0: initial_model_state = {name: tensor.detach().cpu().clone() for name, tensor in base_model.state_dict().items()} initial_optimizer_states = [copy.deepcopy(opt.state_dict()) for opt in optimizers] @@ -959,20 +1451,17 @@ def lr_mul(step: int, elapsed_ms: float) -> float: if distributed: model.require_backward_grad_sync = True train_loader = DistributedTokenLoader(args.train_files, rank, world_size, device) - - # ----------------------------- - # MAIN TRAINING LOOP - # ----------------------------- - + swa_state: dict[str, Tensor] | None = None + swa_count = 0 + ema_state = {name: t.detach().float().clone() for name, t in base_model.state_dict().items()} + ema_decay = 0.997 training_time_ms = 0.0 stop_after_step: int | None = None torch.cuda.synchronize() t0 = time.perf_counter() - step = 0 while True: last_step = step == args.iterations or (stop_after_step is not None and step >= stop_after_step) - should_validate = last_step or (args.val_loss_every > 0 and step % args.val_loss_every == 0) if should_validate: torch.cuda.synchronize() @@ -995,7 +1484,6 @@ def lr_mul(step: int, elapsed_ms: float) -> float: ) torch.cuda.synchronize() t0 = time.perf_counter() - if last_step: if stop_after_step is not None and step < args.iterations: log0( @@ -1003,38 +1491,58 @@ def lr_mul(step: int, elapsed_ms: float) -> float: f"step:{step}/{args.iterations}" ) break - elapsed_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) scale = lr_mul(step, elapsed_ms) + if args.late_qat_threshold > 0 and scale < args.late_qat_threshold and not CastedLinear._qat_enabled: + CastedLinear._qat_enabled = True + log0(f"late_qat:enabled step:{step} scale:{scale:.4f}") + # Anneal soft-rounding temperature: hard for most of QAT, soft at the end + if CastedLinear._qat_enabled: + CastedLinear._soft_tau = 0.1 if scale < 0.02 else 1000.0 zero_grad_all() train_loss = torch.zeros((), device=device) for micro_step in range(grad_accum_steps): if distributed: model.require_backward_grad_sync = micro_step == grad_accum_steps - 1 x, y = train_loader.next_batch(args.train_batch_tokens, args.train_seq_len, grad_accum_steps) + if args.ttt_burst_enabled and scale < args.ttt_burst_trigger: + if not hasattr(train_loader, '_ttt_buffer'): + train_loader._ttt_buffer = [] + train_loader._ttt_buffer.append((x.detach().clone(), y.detach().clone())) + if len(train_loader._ttt_buffer) > args.ttt_burst_steps: + train_loader._ttt_buffer.pop(0) with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): loss = model(x, y) train_loss += loss.detach() (loss * grad_scale).backward() train_loss /= grad_accum_steps - frac = min(step / args.muon_momentum_warmup_steps, 1.0) if args.muon_momentum_warmup_steps > 0 else 1.0 muon_momentum = (1 - frac) * args.muon_momentum_warmup_start + frac * args.muon_momentum for group in optimizer_muon.param_groups: group["momentum"] = muon_momentum - for opt in optimizers: for group in opt.param_groups: group["lr"] = group["base_lr"] * scale - if args.grad_clip_norm > 0: torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) for opt in optimizers: opt.step() zero_grad_all() - + # EMA update + with torch.no_grad(): + for name, t in base_model.state_dict().items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) step += 1 approx_training_time_ms = training_time_ms + 1000.0 * (time.perf_counter() - t0) + if args.swa_enabled and scale < 0.2 and step % args.swa_every == 0: + if swa_state is None: + swa_state = {name: t.detach().cpu().clone() for name, t in base_model.state_dict().items()} + swa_count = 1 + log0(f"swa:start step:{step}") + else: + for name, t in base_model.state_dict().items(): + swa_state[name] += t.detach().cpu() + swa_count += 1 should_log_train = ( args.train_log_every > 0 and (step <= 10 or step % args.train_log_every == 0 or stop_after_step is not None) @@ -1044,8 +1552,6 @@ def lr_mul(step: int, elapsed_ms: float) -> float: f"step:{step}/{args.iterations} train_loss:{train_loss.item():.4f} " f"train_time:{approx_training_time_ms:.0f}ms step_avg:{approx_training_time_ms / step:.2f}ms" ) - - # Needed to sync whether we've reached the wallclock cap. reached_cap = max_wallclock_ms is not None and approx_training_time_ms >= max_wallclock_ms if distributed and max_wallclock_ms is not None: reached_cap_tensor = torch.tensor(int(reached_cap), device=device) @@ -1053,74 +1559,174 @@ def lr_mul(step: int, elapsed_ms: float) -> float: reached_cap = bool(reached_cap_tensor.item()) if stop_after_step is None and reached_cap: stop_after_step = step - log0( f"peak memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB " f"reserved: {torch.cuda.max_memory_reserved() // 1024 // 1024} MiB" ) - - # ----------------------------- - # SERIALIZATION + ROUNDTRIP VALIDATION - # ----------------------------- - # Save the raw state (useful for debugging/loading in PyTorch directly), then always produce - # the compressed int8+zlib artifact and validate the round-tripped weights. - + # === TTT BURST: Late-stage sharpening on recent training data === + if args.ttt_burst_enabled and hasattr(train_loader, '_ttt_buffer') and len(train_loader._ttt_buffer) > 0: + ttt_buffer = train_loader._ttt_buffer + log0(f"ttt_burst:start epochs:{args.ttt_burst_epochs} buffer_size:{len(ttt_buffer)} lr_factor:{args.ttt_burst_lr_factor}") + ttt_lr_scale = args.ttt_burst_lr_factor + for ttt_epoch in range(args.ttt_burst_epochs): + ttt_epoch_loss = 0.0 + for ttt_i, (bx, by) in enumerate(ttt_buffer): + zero_grad_all() + for opt in optimizers: + for group in opt.param_groups: + group["lr"] = group["base_lr"] * ttt_lr_scale + with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True): + ttt_loss = model(bx, by) + (ttt_loss * grad_scale).backward() + if args.grad_clip_norm > 0: + torch.nn.utils.clip_grad_norm_(base_model.parameters(), args.grad_clip_norm) + for opt in optimizers: + opt.step() + zero_grad_all() + ttt_epoch_loss += ttt_loss.item() + with torch.no_grad(): + for name, t in base_model.state_dict().items(): + ema_state[name].mul_(ema_decay).add_(t.detach().float(), alpha=1.0 - ema_decay) + log0(f"ttt_burst:epoch:{ttt_epoch + 1}/{args.ttt_burst_epochs} avg_loss:{ttt_epoch_loss / len(ttt_buffer):.4f}") + log0("ttt_burst:done") + + # Apply EMA weights (better than SWA for this codebase per experiments) + log0("ema:applying EMA weights") + current_state = base_model.state_dict() + avg_state = {name: t.to(dtype=current_state[name].dtype) for name, t in ema_state.items()} + base_model.load_state_dict(avg_state, strict=True) + torch.cuda.synchronize() + t_diag = time.perf_counter() + diag_val_loss, diag_val_bpb = eval_val( + args, compiled_model, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + ) + torch.cuda.synchronize() + log0( + f"DIAGNOSTIC post_ema val_loss:{diag_val_loss:.4f} val_bpb:{diag_val_bpb:.4f} " + f"eval_time:{1000.0 * (time.perf_counter() - t_diag):.0f}ms" + ) + full_state_dict = base_model.state_dict() + export_sd = {k: v for k, v in full_state_dict.items() if "mtp_heads" not in k} + excluded_mtp = sum(int(t.numel()) for k, t in full_state_dict.items() if "mtp_heads" in k) + if excluded_mtp > 0: + log0(f"export_excluding_mtp_params:{excluded_mtp}") if master_process: - torch.save(base_model.state_dict(), "final_model.pt") + torch.save(export_sd, "final_model.pt") model_bytes = os.path.getsize("final_model.pt") code_bytes = len(code.encode("utf-8")) log0(f"Serialized model: {model_bytes} bytes") log0(f"Code size: {code_bytes} bytes") - log0(f"Total submission size: {model_bytes + code_bytes} bytes") - - quant_obj, quant_stats = quantize_state_dict_int8(base_model.state_dict()) + sd_cpu = {k: v.detach().cpu() for k, v in export_sd.items()} + quant_result, quant_meta = mixed_quantize_int6(sd_cpu, {"mlp", "attn"}) quant_buf = io.BytesIO() - torch.save(quant_obj, quant_buf) + torch.save({"w": quant_result, "m": quant_meta}, quant_buf) + # Save quantized model for fast eval-only iterations + if master_process: + torch.save({"quantized": quant_result, "meta": quant_meta}, "final_int6_model.pt") + log0(f"Saved quantized model to final_int6_model.pt") quant_raw = quant_buf.getvalue() - quant_blob = zlib.compress(quant_raw, level=9) - quant_raw_bytes = len(quant_raw) + quant_blob = zstandard.ZstdCompressor(level=22).compress(quant_raw) if _COMPRESSOR == "zstd" else zlib.compress(quant_raw, 9) if master_process: - with open("final_model.int8.ptz", "wb") as f: + with open("final_model.int6.ptz", "wb") as f: f.write(quant_blob) - quant_file_bytes = os.path.getsize("final_model.int8.ptz") + quant_file_bytes = len(quant_blob) code_bytes = len(code.encode("utf-8")) - ratio = quant_stats["baseline_tensor_bytes"] / max(quant_stats["int8_payload_bytes"], 1) - log0( - f"Serialized model int8+zlib: {quant_file_bytes} bytes " - f"(payload:{quant_stats['int8_payload_bytes']} raw_torch:{quant_raw_bytes} payload_ratio:{ratio:.2f}x)" - ) + log0(f"Serialized model int6+{_COMPRESSOR}: {quant_file_bytes} bytes") + log0(f"Total submission size int6+{_COMPRESSOR}: {quant_file_bytes + code_bytes} bytes") log0(f"Total submission size int8+zlib: {quant_file_bytes + code_bytes} bytes") - if distributed: dist.barrier() - with open("final_model.int8.ptz", "rb") as f: + with open("final_model.int6.ptz", "rb") as f: quant_blob_disk = f.read() - quant_state = torch.load(io.BytesIO(zlib.decompress(quant_blob_disk)), map_location="cpu") - base_model.load_state_dict(dequantize_state_dict_int8(quant_state), strict=True) + quant_state = torch.load( + io.BytesIO(zstandard.ZstdDecompressor().decompress(quant_blob_disk) if _COMPRESSOR == "zstd" else zlib.decompress(quant_blob_disk)), + map_location="cpu", + ) + deq_state = dequantize_mixed_int6(quant_state["w"], quant_state["m"], sd_cpu) + eval_model = GPT( + vocab_size=args.vocab_size, num_layers=args.num_layers, model_dim=args.model_dim, + num_heads=args.num_heads, num_kv_heads=args.num_kv_heads, mlp_mult=args.mlp_mult, + tie_embeddings=args.tie_embeddings, tied_embed_init_std=args.tied_embed_init_std, + logit_softcap=args.logit_softcap, rope_base=args.rope_base, qk_gain_init=args.qk_gain_init, + mtp_num_heads=0, mtp_loss_weight=0.0, + bigram_vocab_size=args.bigram_vocab_size, bigram_dim=args.bigram_dim, + xsa_last_n=args.xsa_last_n, # must match training model + rope_dims=args.rope_dims, ln_scale=args.ln_scale, dtg=args.dtg_enabled, + ve_enabled=args.ve_enabled, ve_dim=args.ve_dim, ve_layers=args.ve_layers, + use_vrl=args.vrl, + ).to(device).bfloat16() + for m in eval_model.modules(): + if isinstance(m, CastedLinear): + m.float() + restore_low_dim_params_to_fp32(eval_model) + eval_model.load_state_dict(deq_state, strict=True) + CastedLinear._qat_enabled = False + CastedLinear._soft_tau = 1000.0 + compiled_eval = torch.compile(eval_model, dynamic=False, fullgraph=True) torch.cuda.synchronize() t_qeval = time.perf_counter() q_val_loss, q_val_bpb = eval_val( - args, - model, - rank, - world_size, - device, - grad_accum_steps, - val_tokens, - base_bytes_lut, - has_leading_space_lut, - is_boundary_token_lut, + args, compiled_eval, rank, world_size, device, grad_accum_steps, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + eval_seq_len=effective_eval_seq_len, ) torch.cuda.synchronize() log0( - f"final_int8_zlib_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " + f"final_int6_roundtrip val_loss:{q_val_loss:.4f} val_bpb:{q_val_bpb:.4f} " f"eval_time:{1000.0 * (time.perf_counter() - t_qeval):.0f}ms" ) - log0(f"final_int8_zlib_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") - + log0(f"final_int6_roundtrip_exact val_loss:{q_val_loss:.8f} val_bpb:{q_val_bpb:.8f}") + sw_seq_len = effective_eval_seq_len + if args.eval_stride > 0 and args.eval_stride < sw_seq_len: + torch.cuda.synchronize() + t_slide = time.perf_counter() + sw_val_loss, sw_val_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=args.eval_stride, + eval_seq_len=sw_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_sliding_window val_loss:{sw_val_loss:.4f} val_bpb:{sw_val_bpb:.4f} " + f"stride:{args.eval_stride} eval_time:{1000.0 * (time.perf_counter() - t_slide):.0f}ms" + ) + log0(f"final_int6_sliding_window_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") + log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw_val_loss:.8f} val_bpb:{sw_val_bpb:.8f}") + if args.eval_stride != 64 and 64 < sw_seq_len: + torch.cuda.synchronize() + t_slide64 = time.perf_counter() + sw64_val_loss, sw64_val_bpb = eval_val_sliding( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=64, + eval_seq_len=sw_seq_len, + ) + torch.cuda.synchronize() + log0( + f"final_int6_sliding_window_s64 val_loss:{sw64_val_loss:.4f} val_bpb:{sw64_val_bpb:.4f} " + f"stride:64 eval_time:{1000.0 * (time.perf_counter() - t_slide64):.0f}ms" + ) + log0(f"final_int6_sliding_window_s64_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") + log0(f"final_int8_zlib_roundtrip_exact val_loss:{sw64_val_loss:.8f} val_bpb:{sw64_val_bpb:.8f}") + # Legal score-first TTT (PR#461/549 recipe) + if args.ttt_enabled: + if distributed: + dist.barrier() + log0(f"ttt:start lr={args.ttt_lr} epochs={args.ttt_epochs} chunks={args.ttt_chunk_tokens}") + t_ttt = time.perf_counter() + ttt_val_loss, ttt_val_bpb = eval_val_sliding_ttt( + args, eval_model, rank, world_size, device, + val_tokens, base_bytes_lut, has_leading_space_lut, is_boundary_token_lut, + stride=64, batch_seqs=32, log_fn=log0, + ) + log0(f"ttt:elapsed={time.perf_counter() - t_ttt:.1f}s") + log0(f"final_ttt val_loss:{ttt_val_loss:.4f} val_bpb:{ttt_val_bpb:.4f}") + log0(f"final_ttt_exact val_loss:{ttt_val_loss:.8f} val_bpb:{ttt_val_bpb:.8f}") + if distributed: + dist.barrier() if distributed: dist.destroy_process_group() - - if __name__ == "__main__": main()