Hi, thanks for sharing this wonderful work. Since you use the multi-frame multi-view inputs during pretraining stage, I want to know whether did you still use the temporal multi-frame inputs during fine-tune stage?
If you did not use the temporal multi-frame inputs in the downstream tasks, did it mean you discard the voxel decoder in the finetune stage and only load the pre-trained voxel encoder?