Hi, Thanks for open-sourcing your work and releasing the codebase! I was wondering if this `torch.cuda.synchronize()` ([Here](https://github.com/Max-Fu/icrt/blob/fd5f2fb1305f466b55f7aa49cf1b0109fa832adc/icrt/util/engine.py#L58)) plays any role? Upon removing it, I got some visible training performance improvement.