Skip to content

Non-record: Full GPTQ + XSA-4 + Score-First TTT (3-seed mean 1.1198)#734

Open
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:submission/2026-03-25_GPTQ_XSA4_TTT_1.1198
Open

Non-record: Full GPTQ + XSA-4 + Score-First TTT (3-seed mean 1.1198)#734
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:submission/2026-03-25_GPTQ_XSA4_TTT_1.1198

Conversation

@Robby955
Copy link

Results (8xH100 SXM, PyTorch 2.9.1+cu128)

Seed Steps ms/step Post-TTT BPB Artifact
1337 6,461 86.67 1.1193 15,899,061
42 6,457 86.73 1.1196 15,954,941
2025 6,457 86.74 1.1205 15,907,769

Mean: 1.1198 | Std: 0.0006

Compliance

  • Training: 560s training + 40s GPTQ calibration = 600s total (within 10-min budget)
  • GPTQ calibration: Uses training data, runs within training time budget (NOT during eval)
  • Eval: ~82s sliding window + ~236s score-first TTT = ~318s (within 10-min eval limit)
  • No training data accessed during evaluation
  • TTT: Score-first protocol — score chunk under inference_mode(), then adapt. Never re-score.
  • Artifact: All seeds under 16,000,000 bytes
  • Script: 1,499 lines

Key Techniques

  1. Full Hessian GPTQ — 256-batch calibration, Cholesky error compensation, act-order, block_size=128
  2. XSA on last 4 layers — cross-sequence attention for extended context
  3. SWA/EMA 50/50 blend — EMA(0.997) + tight SWA (every 50 steps during warmdown)
  4. Score-first TTT — AdamW(lr=1e-4), 3 epochs, freeze first 9/11 blocks, 128K-token chunks
  5. LZMA compression — better ratio than zstd for int6 weights

Architecture

11L, 512d, 8H/4KV GQA, LeakyReLU(0.5)² MLP 3x, BigramHash(3072×128), VE128, Partial RoPE 16/64, LN Scale, U-Net skips, SmearGate.

Credits

🤖 Generated with Claude Code

…n 1.1198)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant