Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions records/track_10min_16mb/2026-03-30_V18_FusedTritonOp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Opt (val_bpb 1.1126)

**val_bpb: 1.1126** (3-seed mean, std 0.0003) | **~15.98 MB** | 8xH100 SXM, 600s train, ~120s eval

Built on [PR #1089](https://github.com/openai/parameter-golf/pull/1089) by @mikeapedia. Fused Triton MLP architecture from [PR #1072](https://github.com/openai/parameter-golf/pull/1072) by @vimeto, forward-only fusion insight from [PR #1105](https://github.com/openai/parameter-golf/pull/1105) by @abaybektursun.

## Results (8xH100 SXM, SWA applied, no TTT)

| Seed | Sliding BPB | val_loss (nats) | Artifact |
|------|-------------|-----------------|----------|
| 1337 | **1.1126** | 1.87857 | 15,981,856 |
| 42 | **1.1123** | 1.87803 | 15,984,349 |
| 999 | **1.1129** | 1.87900 | 15,985,912 |
| **Mean +/- Std** | **1.1126 +/- 0.0003** | **1.87853** | |

vs merged leaderboard SOTA ([PR #549](https://github.com/openai/parameter-golf/pull/549), 1.1194 BPB, 1.89002 nats): **-0.01149 nats** (-0.0068 BPB). Note: open PRs #1089 (1.1091) and #1105 (1.1138) achieve better scores.

## What's New vs PR #1089

### 1. GPTQ Reserve Optimization
Reduced GPTQ calibration reserve from 14s to 9s. Calibration consistently completes in ~8.4s across all runs, so 14s wastes 5+ seconds of training budget. Recovers ~55 extra training steps at ~105ms/step.

### 2. Forward-Only Fused Triton MLP Kernel Architecture
Designed a `torch.library.triton_op`-based fused kernel for `matmul + LeakyReLU(0.3) + square` with standard PyTorch backward (cuBLAS matmuls + elementwise ops). This architecture addresses two known issues:
- PR #1072's `torch.autograd.Function` crashes `torch.compile(fullgraph=True)` due to FakeTensor data pointer access
- PR #1105 showed Triton backward forces eager mode (2.7x slower)

Our solution: `triton_op` + `wrap_triton` for compile-safe forward, `register_autograd` with standard ops for backward. The kernel code is included but **hard-disabled** — it produces NaN on PyTorch 2.9 due to a TTIR analysis bug. The scored runs use the standard MLP path. This is included as experimental code for future work.

### 3. Centralized Activation Parameters
All `negative_slope` references unified via `_NEGATIVE_SLOPE = 0.3` constant with derived `_SLOPE_SQ = _NEGATIVE_SLOPE ** 2`.

## Architecture (from PR #1089)

- 11L, 512d, 8H/4KV (GQA), MLP 3.5x LeakyReLU(0.3)^2
- Turbo-Muon optimizer (AOL preconditioning + Polar Express coefficients + row_col normalization, 4 Newton-Schulz iterations)
- EngramLite hash embeddings (bigram + trigram, 2 heads, 8192 buckets)
- Parameter Banking (3D bank tensors for batched Newton-Schulz via torch.bmm)
- U-Net sigmoid-gated skip connections + ValueEmbedding (layers 9-10)
- SmearGate, Partial RoPE(16), LN Scale
- SWA (threshold=0.2, every 50 steps, 14 snapshots) + EMA(0.997) fallback
- Mixed-precision GPTQ: int5 base + selective int6/int7 promotion by Hessian sensitivity
- Brotli-11 + byte-shuffle compression
- F.scaled_dot_product_attention (auto-selects FA3 backend)

## Timing

| Phase | Time |
|-------|------|
| Training (~5,668 steps @ 104ms) | 591s |
| GPTQ calibration + quantization | 9s (reserved) |
| Sliding window eval (stride=64) | ~120s |

## Reproduction

```bash
# Use official template: runpod/parameter-golf:latest (PyTorch 2.9.1+cu128)
# Or any 8xH100 SXM pod with PyTorch >= 2.6

pip install brotli sentencepiece
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

GPTQ_RESERVE_MS=9000 SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Rule Compliance

- [x] Standard F.cross_entropy scoring (softmax, sum=1)
- [x] No eval-time training data access
- [x] Artifact < 16,000,000 bytes (all 3 seeds)
- [x] Training < 600s, eval < 600s
- [x] Causal sliding-window evaluation on full validation split (stride=64)
- [x] 3-seed verification: delta = -0.01149 nats vs SOTA (> 0.005 threshold)
- [x] No n-gram caching, no external downloads during eval

## Credits

- **Turbo-Muon + EngramLite + ParamBanking**: [PR #1089](https://github.com/openai/parameter-golf/pull/1089) by @mikeapedia
- **Fused Triton MLP kernel design**: [PR #1072](https://github.com/openai/parameter-golf/pull/1072) by @vimeto
- **Forward-only fusion insight**: [PR #1105](https://github.com/openai/parameter-golf/pull/1105) by @abaybektursun
- **Base scaffold**: [PR #549](https://github.com/openai/parameter-golf/pull/549) by @abaybektursun
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
torch>=2.6.0
brotli
sentencepiece
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/bin/bash
# V17 RunPod Setup — PR #1089 (TurboMuon) + PR #1072 (Fused Triton Kernel)
# USAGE:
# bash runpod_setup.sh # Setup (PyTorch upgrade, deps)
# bash runpod_setup.sh run # Run training
set -e

if [ "$1" = "run" ]; then
# ---- RUN MODE ----
echo "=== V17 FusedTurboMuon ==="
echo "Config: SEED=${SEED:-1337} GPTQ_RESERVE_MS=${GPTQ_RESERVE_MS:-9000}"
echo "Starting in 3s..."
sleep 3
GPTQ_RESERVE_MS=${GPTQ_RESERVE_MS:-9000} \
SEED=${SEED:-1337} \
torchrun --standalone --nproc_per_node=8 train_gpt.py
exit 0
fi

# ---- SETUP MODE ----
echo "============================================="
echo " V17 FUSED TURBOMUON — POD SETUP"
echo "============================================="

# 1. Check CUDA driver
DRIVER=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
echo "CUDA Driver: $DRIVER"

# 2. Check current PyTorch
CURRENT_PT=$(python3 -c "import torch; print(torch.__version__)" 2>/dev/null || echo "none")
echo "Current PyTorch: $CURRENT_PT"

# 3. Install deps (brotli required for compression, sentencepiece for tokenizer)
pip install brotli sentencepiece 2>&1 | tail -2

# 4. Install FA3 for SDPA backend acceleration (30-second wheel install)
python3 -c "from flash_attn_interface import flash_attn_func" 2>/dev/null && echo "FA3: already installed" || {
echo "Installing FA3 pre-built wheel..."
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 2>&1 | tail -3
python3 -c "from flash_attn_interface import flash_attn_func; print('FA3: OK')" 2>/dev/null || echo "FA3: not available (SDPA will use FA2 or math backend)"
}

# 5. Symlink data if needed
[ -L data ] || [ -d data ] || ln -sf /workspace/data data
[ -d data/datasets/fineweb10B_sp1024 ] && echo "Data: OK" || echo "WARNING: Data not found at data/datasets/fineweb10B_sp1024"
[ -f data/tokenizers/fineweb_1024_bpe.model ] && echo "Tokenizer: OK" || echo "WARNING: Tokenizer not found"

# 6. Check Triton
python3 -c "
import torch
print(f'PyTorch {torch.__version__}, CUDA {torch.version.cuda}')
try:
import triton
from triton.tools.tensor_descriptor import TensorDescriptor
print(f'Triton {triton.__version__} + TensorDescriptor: OK → Fused MLP kernel ENABLED')
except Exception as e:
print(f'Triton not available: {e} → Standard MLP path (slower)')
"

echo ""
echo "============================================="
echo " SETUP COMPLETE"
echo "============================================="
echo ""
echo " V17 Stack:"
echo " Turbo-Muon + EngramLite + Parameter Banking (PR #1089)"
echo " Fused Triton MLP kernel (PR #1072, if Triton available)"
echo " Mixed-precision GPTQ int5/int6/int7 + Brotli compression"
echo " GPTQ reserve optimized to 9s (from 14s default)"
echo ""
echo "RUN COMMANDS:"
echo ""
echo " # Single seed test:"
echo " SEED=1337 bash runpod_setup.sh run"
echo ""
echo " # 3-seed submission:"
echo " for S in 1337 42 999; do"
echo " SEED=\$S bash runpod_setup.sh run | tee run_seed\$S.log"
echo " done"
echo "============================================="
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "Turbo-Muon + EngramLite + ParamBanking + GPTQ Reserve Optimization",
"val_bpb": 1.1126,
"bytes_total": 15985912,
"blurb": "PR #1089 stack (Turbo-Muon, EngramLite, Parameter Banking, mixed GPTQ, brotli) with GPTQ reserve optimization (14s to 9s, +55 training steps). Includes experimental fused Triton MLP kernel architecture (disabled, pending PT2.11 compat). 3-seed mean: 1.1126 (std 0.0003). Built on PR #1089.",
"author": "Bortlesboat",
"github_id": "Bortlesboat",
"date": "2026-03-30"
}
Loading