3D visual grounding and 3D dense captioning are similar in that they both require an understanding of crossmodality relationships. Previous attempts to jointly solve these tasks have not fully exploited these relationships, instead only superficially enhancing one modality with the other. We propose a novel approach, VL3DNet, for jointly solving these two tasks. Our method utilizes a shared vision-language transformer module to enhance the vision and language modalities simultaneously, effectively exploiting the intermodal relations. Compared to previous attempts, such as D3Net and 3DJCG, our method has significant improvement in visual grounding and dense captioning.
- python>=3.10.6
- cuda>=11.6
In this project, we utilized the following libraries and tools:
- Pytorch (v1.13.1): A powerful machine learning library for building and training deep learning models.
- Pytorch-Lightning (v1.9.0): A high-level library for Pytorch that simplifies model implementation and training.
- Transformers (v4.25.1): A collection of pre-trained language models for NLP tasks.
- Hydra (v1.3.1): A library for managing complex experiments and configurations.
- Optuna (v2.10.1): A library for hyperparameter optimization.
- WandB (v0.13.9): A tool for logging, visualizing, and managing deep learning experiments.
There are different options to run the project. Because we use hydra we can changee the configurations for the different training/prediction modes via cli.
To get the preprocessed data from softgroup please contact us. Then you can download the data via this link. After you get access to the datasets please put them into the data folder like this.
data
├── scanrefer
│ ├── ScanRefer_filtered.json
│ ├── ScanRefer_filtered_train.json
│ ├── ScanRefer_filtered_train.txt
│ ├── ScanRefer_filtered_val.json
│ └── ScanRefer_filtered_val.txt
└── softgroup
├── train
│ ├── scene0000_00.pth
│ ├── scene0001_00.pth
│ ├── ...
└── val
├── scene0000_00.pth
├── scene0001_00.pth
└── ...For training there are several options you can choose, depending on the task you want to train on or which task you want to finetune.
To train the visual grounding task just use:
. scripts/train_grounding.shTo train the dense captioning task just use:
. scripts/train_captioning.shTo train both tasks jointly use:
. scripts/train_vl3dnet.shFor evaluating there are several options you can choose.
To evaluate the visual grounding task just use:
. scripts/eval_grounding.shTo evaluate the dense captioning task just use:
. scripts/eval_captioning.shTo evaluate both tasks jointly use:
. scripts/eval_vl3dnet.shpython train.py +hparams_search=optunaThe benchmark for the visual grounding task:
The benchmark for the dense captioning task:
To get the checkpoints from vl3dnet download them via the provided link. After you get access to the checkpoints please put them into the checkpoints folder like this.
checkpoints
├── best/mode=0-val_loss.ckpt
├── best/mode=0-val_loss.yaml
├── best/mode=1-val_loss.ckpt
├── best/mode=1-val_loss.yaml
├── best/mode=2-val_loss.ckpt
└── best/mode=2-val_loss.yamlCopyright (c) 2023 Yaomengxi Han, Robin Borth



