-
Notifications
You must be signed in to change notification settings - Fork 151
Adding nanoVLM sample #864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
paragao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please go through the changes and let me know when I can review it again.
| ## 5. Download the dataset required for the training | ||
| The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"]. | ||
|
|
||
| ### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the (optional) section you have a sbatch download_datasets.sbatch . Is that optional? Or optional is just editing the file to download additional datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| ### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below: | ||
|
|
||
| ```bash | ||
| configs = get_dataset_config_names("HuggingFaceM4/the_cauldron") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where should I place this? Can you please add the existing line and how it should look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| ``` | ||
|
|
||
| ## 5. Download the dataset required for the training | ||
| The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please change the code so it take a different directory as the base path? Then change the sbatch file to parse that variable and use it. The idea is to allow anyone to define the base path (ex: /home/user or /lustre/ubuntu) and instead of using a hard coded choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| #SBATCH --cpus-per-task=48 | ||
| #SBATCH --partition=p5en | ||
|
|
||
| export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of sourcing that token from a cache directory, add instructions to the README asking the user to export HF_TOKEN and then you use the env variable on your sbatch file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
|
|
||
| export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token) | ||
|
|
||
| cd /fsx/ubuntu/nanoVLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow users to define a different base path. Example: instead of /fsx/ubuntu I want to use /home/user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| #SBATCH --partition=p5en | ||
| #SBATCH --array=0 | ||
|
|
||
| cd /fsx/ubuntu/nanoVLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another place to allow for different base path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py | ||
| ``` | ||
|
|
||
| Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: redunce should be reduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| ## 7. Update the dataset path in the config | ||
|
|
||
| ```bash | ||
| sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please allow for diff base path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below: | ||
|
|
||
| ```bash | ||
| sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the path here should be ./awsome-distributed-training/3.test_cases/pytorch/nanoVLM/nanoVLM/models/config.py instead of /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
|
|
||
| ```bash | ||
| cd .. | ||
| git clone https://github.com/huggingface/nanoVLM.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure you clone a specific commit hash as the external repo might change without notice and the example stop working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
|
@paragao @aravneelaws , addressed the above changes and validated test runs |
paragao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, review the comments. Most of them are non-blocking. As soon as I finish running this example I'll come back and approve. Hopefully you've had time to review the changes.
|
|
||
| The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"]. | ||
|
|
||
| ### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way you put this ### (Optional) here, it seems that the command sbatch download_dataset.sbatch is also optional. Put the Optional information as a markdown blockquote instead.
|
|
||
| ```bash | ||
| cd slurm | ||
| sbatch download_dataset.sbatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a mandatory command. Move out of this optional section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. 253 seconds.
Add an example output that the user can see on their log files:
Downloading 1/3: clevr
✓ Saved clevr in 111.5s
Downloading 2/3: vqav2
✓ Saved vqav2 in 100.7s
Downloading 3/3: docvqa
| ## 7. Update the dataset and checkpoint path in the NanoVLM config | ||
|
|
||
| ```bash | ||
| cd .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's nice to tell the user where they are or they should land. Example:
Now, let's move back one folder where you have created the dataset folder.
cd ..You should find yourself on the awsome-distributed-training/3.test_cases/pytorch/nanoVLM folder.
etc....
| ``` | ||
|
|
||
| ```bash | ||
| export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain what this line do. Add an explanation before telling the user to run this command.
| ``` | ||
|
|
||
| ```bash | ||
| sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then you can move this line to be on the same code block as the export CHECKPOINTS_DIR.
| #SBATCH --job-name=download_dataset | ||
| #SBATCH --output=logs/download_%A.out | ||
| #SBATCH --error=logs/download_%A.err | ||
| #SBATCH --nodes=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need 2 nodes to download datasets?
| @@ -0,0 +1,57 @@ | |||
| #!/bin/bash | |||
| #SBATCH --job-name=train_nanoVLM | |||
| #SBATCH --output=logs/train_nanoVLM/%A.out | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.
| #!/bin/bash | ||
| #SBATCH --job-name=train_nanoVLM | ||
| #SBATCH --output=logs/train_nanoVLM/%A.out | ||
| #SBATCH --error=logs/train_nanoVLM/%A.err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.
| #SBATCH --error=logs/train_nanoVLM/%A.err | ||
| #SBATCH --time=01:00:00 | ||
| #SBATCH --nodes=4 | ||
| #SBATCH --partition=p5en |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need partition name. Will fail if doesn't exist.
| #SBATCH --nodes=4 | ||
| #SBATCH --partition=p5en | ||
|
|
||
| GPUS_PER_NODE=8 #set to 1 for g5.8xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on?
Issue #, if available:
Description of changes:
Adding nanoVLM sample on Slurm.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.