Skip to content

Record*: val_bpb=0.978 BPB — Goldfish ML Autonomous Research (100ep Cosine *leaky* TTT)#517

Closed
lukacf wants to merge 1 commit intoopenai:mainfrom
lukacf:main
Closed

Record*: val_bpb=0.978 BPB — Goldfish ML Autonomous Research (100ep Cosine *leaky* TTT)#517
lukacf wants to merge 1 commit intoopenai:mainfrom
lukacf:main

Conversation

@lukacf
Copy link

@lukacf lukacf commented Mar 23, 2026

Edit: Record* = Yes, but uses suspect leaky TTT, so not a clean solution.

Summary

Technique

CosineAnnealingLR applied to AdamW TTT optimizer, decaying lr from 0.001 to 0.00001 over 100 epochs. Prevents position-specific overfitting that limits constant-lr TTT to ~30 epochs. Three lines of code.

Methodology

Experiments were run autonomously by an AI coding agent using Goldfish ML for experiment orchestration and provenance tracking. The agent executed the full research loop — hypothesis, implementation, launch, monitoring, analysis, iteration — without human intervention on the training code.

Seven experiments were completed in ~2 hours wall-clock time, progressing from baseline replication (1.085 BPB) through the cosine LR discovery (1.018) to the final result (0.978). Dead ends (weight decay, BigramHash scaling, Value Residual) were documented automatically. Full experiment lineage and compressed timeline in the submission README.

Files

  • train_gpt.py — standalone training script
  • train.log — full log (seed 1337)
  • submission.json — 3-seed results
  • README.md — detailed write-up with experiment timeline and provenance

3-seed mean: 0.9789 BPB (sliding window stride=64)
Best seed: 0.9779 (seed 7)
Std: 0.0015

Key innovation: Autonomous ML research methodology.
AI coding agent discovered cosine LR scaling for TTT in a single
2-hour session — 7 experiments from hypothesis to record.

Technical: CosineAnnealingLR over 100 TTT epochs (3-line change).
Architecture: PR openai#398/openai#442 base (11L, int6+zstd, 15.51MB).
@lukacf
Copy link
Author

lukacf commented Mar 23, 2026

10 min training time, but I missed the 10 min eval time limit, which I now see this violates by a factor of 2. So goes into the "non leaderboard" bucket.

lolrazh added a commit to lolrazh/parameter-golf that referenced this pull request Mar 23, 2026
Remove obsolete experiment scripts, profiling tools, old run scripts,
and stale research docs. The project now builds on PR openai#512 (PROTEUS)
and PR openai#517 (Goldfish ML) as control scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lukacf
Copy link
Author

lukacf commented Mar 23, 2026

I should add that the way TTT used is clearly suspect wrt. eval tokens leaking into training. Educational, but not a clean solution.

@lukacf lukacf changed the title Record: 0.978 BPB — Goldfish ML Autonomous Research (100ep Cosine TTT) Record*: val_bpb=0.978 BPB — Goldfish ML Autonomous Research (100ep Cosine *leaky* TTT) Mar 23, 2026
@lukacf
Copy link
Author

lukacf commented Mar 23, 2026

Closing PR (TTT not valid)

@lukacf lukacf closed this Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant