-
Notifications
You must be signed in to change notification settings - Fork 0
Problems when reproducing phase1 results #3
Description
Thank you very much for your excellent work. I really appreciate it!
However, I ran into issues when trying to reproduce phase1.
Using the training settings from the source code, I found that both loss and grad norm do not converge.
After lowering the learning rate and increasing gradient accumulation steps, the grad norm became more stable, but the loss still did not converge.
After training for 5 epochs, the model learned the basic intent better than the baseline, but the FID and FVD are still noticeably worse than the released phase1 checkpoint.
Could you please confirm whether this kind of loss fluctuation is expected, whether my parameter settings are correct, and what the recommended setup is to reproduce your reported results more faithfully?
I would really appreciate your help with this, since it is very important for my work. If you need any additional information, please feel free to let me know.
my hyperparameters for phase1 training:
batch_size: 2
beta1: 0.9
beta2: 0.95
beta3: 0.98
checkpointing_limit: 5
checkpointing_steps: 100
do_validation: false
enable_slicing: true
enable_tiling: true
epsilon: 1.0e-08
gen_fps: 15
gradient_accumulation_steps: 8 (adjusted to 16 in the second run)
gradient_checkpointing: true
learning_rate: 1.0e-05 (adjusted to 5.0e-07 in the second run)
lora_alpha: 32
lr_num_cycles: 1
lr_power: 1.0
lr_scheduler: constant_with_warmup
lr_warmup_steps: 100
max_grad_norm: 1.0
mixed_precision: bf16
model_name: cogvideox-i2v
model_type: i2v
nccl_timeout: 1800
num_workers: 4
optimizer: adamw
pin_memory: true
rank: 64
report_to: tensorboard
seed: 42
target_modules: '[''to_q'', ''to_k'', ''to_v'', ''to_out.0'']'
tracker_name: finetrainer-cogvideo
train_epochs: 10
train_resolution: (49, 480, 720)
train_steps: 1220
training_type: sft
weight_decay: 0.0001