Ablation: remove tandem curriculum entirely#1693
Closed
Conversation
|
I have read the CLA Document and I hereby sign the CLA 0 out of 2 committers have signed the CLA. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hypothesis
The tandem curriculum (lines 712-715) was introduced to help the model learn single-foil patterns before tandem complexity. However, the current implementation is bugged (detects ALL samples as tandem, effectively zeroing gradients for 10 epochs). The fix from PR #1674 was essentially neutral on val_loss (+0.0006) — suggesting the curriculum itself may not be beneficial.
This ablation removes the curriculum entirely. If the curriculum is not helping (or is slightly harmful due to the epoch-10 shock when tandem samples are suddenly reintroduced), removing it should improve training efficiency and potentially val_loss by giving the model full data from epoch 0.
Key insight: The current bugged curriculum zeros ALL gradients for 10 epochs. The fact that the model trains well despite this suggests those 10 epochs are largely wasted. Removing the curriculum gives the model 10 more productive training epochs.
Instructions
In
train.py, delete the tandem curriculum block entirely (lines 712-715):Delete these 4 lines:
No other changes. Run with
--wandb_group noam-r23-remove-curriculum.Baseline
Results
W&B run: q5qmcsj6
Baseline: val_loss=0.8326 | in_dist=17.94 | ood_cond=13.98 | ood_re=27.54 | tandem=36.73
Delta: -0.0002 vs baseline (essentially neutral)
Peak memory: 18.2 GB
What happened:
Essentially neutral result. val/loss 0.8324 vs 0.8326 is indistinguishable from noise. Per-split: ood_cond surf_p improved (-0.58 Pa), tandem is slightly worse (+1.17 Pa), in_dist and ood_re are approximately equal.
This confirms the hypothesis that the tandem curriculum (in either its buggy or fixed form) has minimal effect on training outcomes. The model learns equally well from epoch 0 with full tandem data — the "wasted" 10 warm-up epochs don't meaningfully hurt (or help) final performance. The curriculum adds code complexity for no benefit and can be cleanly removed.
Note: vis pipeline Fourier PE fix also applied to prevent crash during visualization.
Suggested follow-ups: