Proposed experiment procedure:
- pick a list of
perturb_step: [0, 78, 391, 1955, 9775]
- for each step, do 20 (?TODO) runs of
perturb with the smallest perturb_scale that has non-zero excess loss (i.e. around 1e-12). Take the 95% (?TODO) percentile of the excess loss as logreg_threshold.
- run
logreg to find the perturb_scale at which excess loss exceeds the logreg_threshold 50% of the time.
- plot
perturb_step (x) vs perturb_scale (y) (?TODO issue: this plot doesn't compare different logreg_threshold over time, whereas plotting perturb_scale (x) vs excess loss (y) does)
- TODO plot same but with excess loss mod perm
List of experiments to run:
Targeting <= 0.15 train CE
Per-layer perturbations:
Training (original butterfly experiment, but controlling perturbation L2, not perturbing norm layers, etc.):
Architecture effects, controlling for number of parameters:
Hparam effects:
Finetuning from various partially trained checkpoints (moved to issue #12 ):
Proposed experiment procedure:
perturb_step: [0, 78, 391, 1955, 9775]perturbwith the smallestperturb_scalethat has non-zero excess loss (i.e. around1e-12). Take the 95% (?TODO) percentile of the excess loss aslogreg_threshold.logregto find theperturb_scaleat which excess loss exceeds thelogreg_threshold50% of the time.perturb_step(x) vsperturb_scale(y) (?TODO issue: this plot doesn't compare differentlogreg_thresholdover time, whereas plottingperturb_scale(x) vs excess loss (y) does)List of experiments to run:
Targeting <= 0.15 train CE
Per-layer perturbations:
Training (original butterfly experiment, but controlling perturbation L2, not perturbing norm layers, etc.):
--group=reference --lr=0.001 --warmup_ratio=0.02 --weight_decay=0 --training_steps=25000stArchitecture effects, controlling for number of parameters:
--group=arch-wideshallow --model_name=resnet8-64 --warmup_ratio=0.02 --training_steps=25000st--group=arch-narrowdeep --model_name=resnet34-16 --warmup_ratio=0.02 --training_steps=40000stHparam effects:
--group=lr-0.01 --lr=0.01 --training_steps=45000st--group=bs-32 --batch_size=32 --training_steps=85000st--group=bs-512 --batch_size=512 --training_steps=15000st--group=warmup-10x --warmup_ratio=0.2 --training_steps=25000st--group=decay-0.0001 --weight_decay=0.0001 --training_steps=20000st--group=schedule-constant --lr_scheduler=constant --training_steps=35000st--group=opt-adamw --lr=0.003 --optimizer=adamw --training_steps=20000stFinetuning from various partially trained checkpoints (moved to issue #12 ):