I have developed a project that replace unet in animatediff with the transformer block in stable diffusion 3.5. Given the hardware resource, I only use 20 videos as training set. And to verify the overfiting ability, I sampled three video from training set as validation set.
In training phrase, firstly, I Initialized the transformer block with the pretrained weight of transformer in sd 3.5, and then the transformer block is frozen. I only train the motion module from scrach.
The issue is that in validation phrase, the videos generated by the model appear to be pieced together from many frames that lack temporal coherence.
And I also train the official animatediff using the same training set and freeze the unet, the same issue has occurred again.