Skip to content

Conversation

@allela-roy
Copy link
Contributor

Issue #, if available:

Description of changes:
Adding nanoVLM sample on Slurm.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please go through the changes and let me know when I can review it again.

## 5. Download the dataset required for the training
The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"].

### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the (optional) section you have a sbatch download_datasets.sbatch . Is that optional? Or optional is just editing the file to download additional datasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below:

```bash
configs = get_dataset_config_names("HuggingFaceM4/the_cauldron")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where should I place this? Can you please add the existing line and how it should look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

```

## 5. Download the dataset required for the training
The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please change the code so it take a different directory as the base path? Then change the sbatch file to parse that variable and use it. The idea is to allow anyone to define the base path (ex: /home/user or /lustre/ubuntu) and instead of using a hard coded choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

#SBATCH --cpus-per-task=48
#SBATCH --partition=p5en

export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of sourcing that token from a cache directory, add instructions to the README asking the user to export HF_TOKEN and then you use the env variable on your sbatch file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)

cd /fsx/ubuntu/nanoVLM
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow users to define a different base path. Example: instead of /fsx/ubuntu I want to use /home/user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

#SBATCH --partition=p5en
#SBATCH --array=0

cd /fsx/ubuntu/nanoVLM
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another place to allow for different base path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
```

Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: redunce should be reduce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

## 7. Update the dataset path in the config

```bash
sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please allow for diff base path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:

```bash
sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the path here should be ./awsome-distributed-training/3.test_cases/pytorch/nanoVLM/nanoVLM/models/config.py instead of /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


```bash
cd ..
git clone https://github.com/huggingface/nanoVLM.git
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure you clone a specific commit hash as the external repo might change without notice and the example stop working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@allela-roy
Copy link
Contributor Author

@paragao @aravneelaws , addressed the above changes and validated test runs

Copy link

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, review the comments. Most of them are non-blocking. As soon as I finish running this example I'll come back and approve. Hopefully you've had time to review the changes.


The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"].

### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way you put this ### (Optional) here, it seems that the command sbatch download_dataset.sbatch is also optional. Put the Optional information as a markdown blockquote instead.


```bash
cd slurm
sbatch download_dataset.sbatch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a mandatory command. Move out of this optional section.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. 253 seconds.

Add an example output that the user can see on their log files:

Downloading 1/3: clevr
✓ Saved clevr in 111.5s
Downloading 2/3: vqav2
✓ Saved vqav2 in 100.7s
Downloading 3/3: docvqa

## 7. Update the dataset and checkpoint path in the NanoVLM config

```bash
cd ..
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice to tell the user where they are or they should land. Example:

Now, let's move back one folder where you have created the dataset folder.

cd ..

You should find yourself on the awsome-distributed-training/3.test_cases/pytorch/nanoVLM folder.

etc....

```

```bash
export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what this line do. Add an explanation before telling the user to run this command.

```

```bash
sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then you can move this line to be on the same code block as the export CHECKPOINTS_DIR.

#SBATCH --job-name=download_dataset
#SBATCH --output=logs/download_%A.out
#SBATCH --error=logs/download_%A.err
#SBATCH --nodes=2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need 2 nodes to download datasets?

@@ -0,0 +1,57 @@
#!/bin/bash
#SBATCH --job-name=train_nanoVLM
#SBATCH --output=logs/train_nanoVLM/%A.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#!/bin/bash
#SBATCH --job-name=train_nanoVLM
#SBATCH --output=logs/train_nanoVLM/%A.out
#SBATCH --error=logs/train_nanoVLM/%A.err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --error=logs/train_nanoVLM/%A.err
#SBATCH --time=01:00:00
#SBATCH --nodes=4
#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need partition name. Will fail if doesn't exist.

#SBATCH --nodes=4
#SBATCH --partition=p5en

GPUS_PER_NODE=8 #set to 1 for g5.8xlarge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants