Conversation
11L/512d U-Net + legal score-first 5-gram eval interpolation. Inspired by @deanbrr's n-gram cache technique (PR openai#659). 3-seed results: seed 1337: 1.0451 (15.63MB) seed 42: 1.0471 (15.59MB) seed 2045: 1.0460 (15.64MB) mean: 1.0461 Run: SEED=2045 MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 \ XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 \ NGRAM_EVAL_ORDER=5 NGRAM_EVAL_ALPHA=0.20 \ torchrun --nproc_per_node=8 train_gpt.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
I quite like the 5-gram EvalCache idea and encourage you to resubmit a valid run with it, but this submission has no submission.json or train logs, so I can't validate it, and it looks like it does GPTQ at eval time with training data, which is disallowed. However, if you removed the GPTQ and submitted a proper .json and training logs, I think the 5-gram EvalCache idea is at least directionally legal, would have to see the final code on a valid submission to make sure, though. Closing for now but please do resubmit! A concern I have with the current caching setup is that it sort of assumes we know what the next token is and only calculates the p_blended for that token -- in a real generation setting, we don't know what the correct token should be, so this wouldn't hold -- that being said, afaict this would only give an efficiency improvement (I think it spares you from having to calculate it over the whole vocab?) so as long as an updated version still fits within the eval time, I think the score should be quite similar. Will think about this more if you resubmit a similar run. |
|
I cant reopen this and jsut uoplaod my times so now I have an older PR? why did you close this dude? should have been just a comment. You could have asked to see the logs before you closed this PR and lost my time submissions |
Results
Progression
Key Addition
Legal score-first hashed 5-gram interpolation during sliding window eval. Fixed-weight linear mixing (alpha=0.20), no target-aware gating. Cache built from already-scored tokens only. Strictly backward-looking.
Inspired by and credited to @deanbrr (PR #659) for the n-gram eval cache concept.
Architecture
11L/512d U-Net, 26.93M params. LeakyReLU² (slope 0.5), XSA last 4, BigramHash 1536. GPTQ int6+zstd, late QAT.
Reproduce
8xH100 SXM, 600s training + ~190s eval.