Skip to content

Problems when reproducing phase1 results #3

@WYS-WHU

Description

@WYS-WHU

Thank you very much for your excellent work. I really appreciate it!

However, I ran into issues when trying to reproduce phase1.

Using the training settings from the source code, I found that both loss and grad norm do not converge.

Image

After lowering the learning rate and increasing gradient accumulation steps, the grad norm became more stable, but the loss still did not converge.

Image

After training for 5 epochs, the model learned the basic intent better than the baseline, but the FID and FVD are still noticeably worse than the released phase1 checkpoint.
Could you please confirm whether this kind of loss fluctuation is expected, whether my parameter settings are correct, and what the recommended setup is to reproduce your reported results more faithfully?
I would really appreciate your help with this, since it is very important for my work. If you need any additional information, please feel free to let me know.

my hyperparameters for phase1 training:

batch_size: 2
beta1: 0.9
beta2: 0.95
beta3: 0.98
checkpointing_limit: 5
checkpointing_steps: 100
do_validation: false
enable_slicing: true
enable_tiling: true
epsilon: 1.0e-08
gen_fps: 15
gradient_accumulation_steps: 8 (adjusted to 16 in the second run)
gradient_checkpointing: true
learning_rate: 1.0e-05 (adjusted to 5.0e-07 in the second run)
lora_alpha: 32
lr_num_cycles: 1
lr_power: 1.0
lr_scheduler: constant_with_warmup
lr_warmup_steps: 100
max_grad_norm: 1.0
mixed_precision: bf16
model_name: cogvideox-i2v
model_type: i2v
nccl_timeout: 1800
num_workers: 4
optimizer: adamw
pin_memory: true
rank: 64
report_to: tensorboard
seed: 42
target_modules: '[''to_q'', ''to_k'', ''to_v'', ''to_out.0'']'
tracker_name: finetrainer-cogvideo
train_epochs: 10
train_resolution: (49, 480, 720)
train_steps: 1220
training_type: sft
weight_decay: 0.0001

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions