openai · Bortlesboat · Mar 31, 2026
diff --git a/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/README.md b/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/README.md
@@ -0,0 +1,82 @@
+# Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt (val_bpb 1.1126)
+
+**val_bpb: 1.1126** (3-seed mean, std 0.0003) | **~15.98 MB** | 8xH100 SXM, 600s train, ~120s eval
+
+Built on [PR #1089](https://github.com/openai/parameter-golf/pull/1089) by @mikeapedia. Fused Triton MLP architecture from [PR #1072](https://github.com/openai/parameter-golf/pull/1072) by @vimeto, forward-only fusion insight from [PR #1105](https://github.com/openai/parameter-golf/pull/1105) by @abaybektursun.
+
+## Results (8xH100 SXM, SWA applied, no TTT)
+
+| Seed | Sliding BPB | val_loss (nats) | Artifact |
+|------|-------------|-----------------|----------|
+| 1337 | **1.1126** | 1.87857 | 15,981,856 |
+| 42 | **1.1123** | 1.87803 | 15,984,349 |
+| 999 | **1.1129** | 1.87900 | 15,985,912 |
+| **Mean +/- Std** | **1.1126 +/- 0.0003** | **1.87853** | |
+
+vs merged leaderboard SOTA ([PR #549](https://github.com/openai/parameter-golf/pull/549), 1.1194 BPB, 1.89002 nats): **-0.01149 nats** (-0.0068 BPB). Note: open PRs #1089 (1.1091) and #1105 (1.1138) achieve better scores.
+
+## What's New vs PR #1089
+
+### 1. GPTQ Reserve Optimization
+Reduced GPTQ calibration reserve from 14s to 9s. Calibration consistently completes in ~8.4s across all runs, so 14s wastes 5+ seconds of training budget. Recovers ~55 extra training steps at ~105ms/step.
+
+### 2. Forward-Only Fused Triton MLP Kernel Architecture
+Designed a `torch.library.triton_op`-based fused kernel for `matmul + LeakyReLU(0.3) + square` with standard PyTorch backward (cuBLAS matmuls + elementwise ops). This architecture addresses two known issues:
+- PR #1072's `torch.autograd.Function` crashes `torch.compile(fullgraph=True)` due to FakeTensor data pointer access
+- PR #1105 showed Triton backward forces eager mode (2.7x slower)
+
+Our solution: `triton_op` + `wrap_triton` for compile-safe forward, `register_autograd` with standard ops for backward. The kernel code is included but **hard-disabled** — it produces NaN on PyTorch 2.9 due to a TTIR analysis bug. The scored runs use the standard MLP path. This is included as experimental code for future work.
+
+### 3. Centralized Activation Parameters
+All `negative_slope` references unified via `_NEGATIVE_SLOPE = 0.3` constant with derived `_SLOPE_SQ = _NEGATIVE_SLOPE ** 2`.
+
+## Architecture (from PR #1089)
+
+- 11L, 512d, 8H/4KV (GQA), MLP 3.5x LeakyReLU(0.3)^2
+- Turbo-Muon optimizer (AOL preconditioning + Polar Express coefficients + row_col normalization, 4 Newton-Schulz iterations)
+- EngramLite hash embeddings (bigram + trigram, 2 heads, 8192 buckets)
+- Parameter Banking (3D bank tensors for batched Newton-Schulz via torch.bmm)
+- U-Net sigmoid-gated skip connections + ValueEmbedding (layers 9-10)
+- SmearGate, Partial RoPE(16), LN Scale
+- SWA (threshold=0.2, every 50 steps, 14 snapshots) + EMA(0.997) fallback
+- Mixed-precision GPTQ: int5 base + selective int6/int7 promotion by Hessian sensitivity
+- Brotli-11 + byte-shuffle compression
+- F.scaled_dot_product_attention (auto-selects FA3 backend)
+
+## Timing
+
+| Phase | Time |
+|-------|------|
+| Training (~5,668 steps @ 104ms) | 591s |
+| GPTQ calibration + quantization | 9s (reserved) |
+| Sliding window eval (stride=64) | ~120s |
+
+## Reproduction
+
+```bash
+# Use official template: runpod/parameter-golf:latest (PyTorch 2.9.1+cu128)
+# Or any 8xH100 SXM pod with PyTorch >= 2.6
+
+pip install brotli sentencepiece
+pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
+
+GPTQ_RESERVE_MS=9000 SEED=1337 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Rule Compliance
+
+- [x] Standard F.cross_entropy scoring (softmax, sum=1)
+- [x] No eval-time training data access
+- [x] Artifact < 16,000,000 bytes (all 3 seeds)
+- [x] Training < 600s, eval < 600s
+- [x] Causal sliding-window evaluation on full validation split (stride=64)
+- [x] 3-seed verification: delta = -0.01149 nats vs SOTA (> 0.005 threshold)
+- [x] No n-gram caching, no external downloads during eval
+
+## Credits
+
+- **Turbo-Muon + EngramLite + ParamBanking**: [PR #1089](https://github.com/openai/parameter-golf/pull/1089) by @mikeapedia
+- **Fused Triton MLP kernel design**: [PR #1072](https://github.com/openai/parameter-golf/pull/1072) by @vimeto
+- **Forward-only fusion insight**: [PR #1105](https://github.com/openai/parameter-golf/pull/1105) by @abaybektursun
+- **Base scaffold**: [PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun
diff --git a/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/requirements.txt b/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/requirements.txt
@@ -0,0 +1,3 @@
+torch>=2.6.0
+brotli
+sentencepiece
diff --git a/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/runpod_setup.sh b/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/runpod_setup.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+# V17 RunPod Setup — PR #1089 (TurboMuon) + PR #1072 (Fused Triton Kernel)
+# USAGE:
+#   bash runpod_setup.sh          # Setup (PyTorch upgrade, deps)
+#   bash runpod_setup.sh run      # Run training
+set -e
+
+if [ "$1" = "run" ]; then
+    # ---- RUN MODE ----
+    echo "=== V17 FusedTurboMuon ==="
+    echo "Config: SEED=${SEED:-1337} GPTQ_RESERVE_MS=${GPTQ_RESERVE_MS:-9000}"
+    echo "Starting in 3s..."
+    sleep 3
+    GPTQ_RESERVE_MS=${GPTQ_RESERVE_MS:-9000} \
+    SEED=${SEED:-1337} \
+    torchrun --standalone --nproc_per_node=8 train_gpt.py
+    exit 0
+fi
+
+# ---- SETUP MODE ----
+echo "============================================="
+echo "  V17 FUSED TURBOMUON — POD SETUP"
+echo "============================================="
+
+# 1. Check CUDA driver
+DRIVER=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
+echo "CUDA Driver: $DRIVER"
+
+# 2. Check current PyTorch
+CURRENT_PT=$(python3 -c "import torch; print(torch.__version__)" 2>/dev/null || echo "none")
+echo "Current PyTorch: $CURRENT_PT"
+
+# 3. Install deps (brotli required for compression, sentencepiece for tokenizer)
+pip install brotli sentencepiece 2>&1 | tail -2
+
+# 4. Install FA3 for SDPA backend acceleration (30-second wheel install)
+python3 -c "from flash_attn_interface import flash_attn_func" 2>/dev/null && echo "FA3: already installed" || {
+    echo "Installing FA3 pre-built wheel..."
+    pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 2>&1 | tail -3
+    python3 -c "from flash_attn_interface import flash_attn_func; print('FA3: OK')" 2>/dev/null || echo "FA3: not available (SDPA will use FA2 or math backend)"
+}
+
+# 5. Symlink data if needed
+[ -L data ] || [ -d data ] || ln -sf /workspace/data data
+[ -d data/datasets/fineweb10B_sp1024 ] && echo "Data: OK" || echo "WARNING: Data not found at data/datasets/fineweb10B_sp1024"
+[ -f data/tokenizers/fineweb_1024_bpe.model ] && echo "Tokenizer: OK" || echo "WARNING: Tokenizer not found"
+
+# 6. Check Triton
+python3 -c "
+import torch
+print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}')
+try:
+    import triton
+    from triton.tools.tensor_descriptor import TensorDescriptor
+    print(f'Triton {triton.__version__} + TensorDescriptor: OK → Fused MLP kernel ENABLED')
+except Exception as e:
+    print(f'Triton not available: {e} → Standard MLP path (slower)')
+"
+
+echo ""
+echo "============================================="
+echo "  SETUP COMPLETE"
+echo "============================================="
+echo ""
+echo "  V17 Stack:"
+echo "    Turbo-Muon + EngramLite + Parameter Banking (PR #1089)"
+echo "    Fused Triton MLP kernel (PR #1072, if Triton available)"
+echo "    Mixed-precision GPTQ int5/int6/int7 + Brotli compression"
+echo "    GPTQ reserve optimized to 9s (from 14s default)"
+echo ""
+echo "RUN COMMANDS:"
+echo ""
+echo "  # Single seed test:"
+echo "  SEED=1337 bash runpod_setup.sh run"
+echo ""
+echo "  # 3-seed submission:"
+echo "  for S in 1337 42 999; do"
+echo "    SEED=\$S bash runpod_setup.sh run | tee run_seed\$S.log"
+echo "  done"
+echo "============================================="
diff --git a/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/submission.json b/records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/submission.json
@@ -0,0 +1,9 @@
+{
+  "name": "Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Optimization",
+  "val_bpb": 1.1126,
+  "bytes_total": 15985912,
+  "blurb": "PR #1089 stack (Turbo-Muon, EngramLite, Parameter Banking, mixed GPTQ, brotli) with GPTQ reserve optimization (14s to 9s, +55 training steps). Includes experimental fused Triton MLP kernel architecture (disabled, pending PT2.11 compat). 3-seed mean: 1.1126 (std 0.0003). Built on PR #1089.",
+  "author": "Bortlesboat",
+  "github_id": "Bortlesboat",
+  "date": "2026-03-30"
+}