Adding nanoVLM sample #864

allela-roy · 2025-09-25T13:49:42Z

Issue #, if available:

Description of changes:
Adding nanoVLM sample on Slurm.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

paragao

please go through the changes and let me know when I can review it again.

paragao · 2025-09-25T14:33:07Z

3.test_cases/pytorch/nanoVLM/README.md

+## 5. Download the dataset required for the training
+The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"]. 
+
+### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below:


Under the (optional) section you have a sbatch download_datasets.sbatch . Is that optional? Or optional is just editing the file to download additional datasets?

paragao · 2025-09-25T14:33:42Z

3.test_cases/pytorch/nanoVLM/README.md

+### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below:
+
+```bash
+configs = get_dataset_config_names("HuggingFaceM4/the_cauldron")


where should I place this? Can you please add the existing line and how it should look?

paragao · 2025-09-25T14:38:52Z

3.test_cases/pytorch/nanoVLM/README.md

+  ```
+
+## 5. Download the dataset required for the training
+The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"]. 


can you please change the code so it take a different directory as the base path? Then change the sbatch file to parse that variable and use it. The idea is to allow anyone to define the base path (ex: /home/user or /lustre/ubuntu) and instead of using a hard coded choice.

paragao · 2025-09-25T14:41:10Z

3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch

+#SBATCH --cpus-per-task=48
+#SBATCH --partition=p5en
+
+export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)


instead of sourcing that token from a cache directory, add instructions to the README asking the user to export HF_TOKEN and then you use the env variable on your sbatch file.

paragao · 2025-09-25T14:42:08Z

3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch

+
+export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)
+
+cd /fsx/ubuntu/nanoVLM


allow users to define a different base path. Example: instead of /fsx/ubuntu I want to use /home/user.

paragao · 2025-09-25T14:43:06Z

3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch

+#SBATCH --partition=p5en
+#SBATCH --array=0
+
+cd /fsx/ubuntu/nanoVLM


another place to allow for different base path.

paragao · 2025-09-25T14:57:10Z

3.test_cases/pytorch/nanoVLM/README.md

+sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
+```
+
+Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:


typo: redunce should be reduce

paragao · 2025-09-25T15:03:33Z

3.test_cases/pytorch/nanoVLM/README.md

+## 7. Update the dataset path in the config 
+
+```bash
+sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py


please allow for diff base path

paragao · 2025-09-25T15:08:20Z

3.test_cases/pytorch/nanoVLM/README.md

+Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:
+
+```bash
+sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py


the path here should be ./awsome-distributed-training/3.test_cases/pytorch/nanoVLM/nanoVLM/models/config.py instead of /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py

paragao · 2025-09-25T15:09:12Z

3.test_cases/pytorch/nanoVLM/README.md

+
+```bash
+cd ..
+git clone https://github.com/huggingface/nanoVLM.git


make sure you clone a specific commit hash as the external repo might change without notice and the example stop working.

allela-roy · 2025-09-30T10:16:42Z

@paragao @aravneelaws , addressed the above changes and validated test runs

paragao

please, review the comments. Most of them are non-blocking. As soon as I finish running this example I'll come back and approve. Hopefully you've had time to review the changes.

paragao · 2025-10-09T23:39:41Z

3.test_cases/pytorch/nanoVLM/README.md

+
+The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"]. 
+
+### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file:


the way you put this ### (Optional) here, it seems that the command sbatch download_dataset.sbatch is also optional. Put the Optional information as a markdown blockquote instead.

paragao · 2025-10-09T23:40:06Z

3.test_cases/pytorch/nanoVLM/README.md

+
+```bash
+cd slurm
+sbatch download_dataset.sbatch


this is a mandatory command. Move out of this optional section.

Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. 253 seconds.

Add an example output that the user can see on their log files:

Downloading 1/3: clevr ✓ Saved clevr in 111.5s Downloading 2/3: vqav2 ✓ Saved vqav2 in 100.7s Downloading 3/3: docvqa

paragao · 2025-10-10T20:37:40Z

3.test_cases/pytorch/nanoVLM/README.md

+## 7. Update the dataset and checkpoint path in the NanoVLM config 
+
+```bash
+cd ..


it's nice to tell the user where they are or they should land. Example:

Now, let's move back one folder where you have created the dataset folder.

cd ..

You should find yourself on the awsome-distributed-training/3.test_cases/pytorch/nanoVLM folder.

etc....

paragao · 2025-10-10T20:39:50Z

3.test_cases/pytorch/nanoVLM/README.md

+```
+
+```bash
+export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints


explain what this line do. Add an explanation before telling the user to run this command.

paragao · 2025-10-10T20:40:19Z

3.test_cases/pytorch/nanoVLM/README.md

+```
+
+```bash
+sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py


and then you can move this line to be on the same code block as the export CHECKPOINTS_DIR.

paragao · 2025-10-10T22:47:56Z

3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch

+#SBATCH --job-name=download_dataset
+#SBATCH --output=logs/download_%A.out
+#SBATCH --error=logs/download_%A.err
+#SBATCH --nodes=2


do you need 2 nodes to download datasets?

paragao · 2025-10-10T23:30:24Z

3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch

@@ -0,0 +1,57 @@
+#!/bin/bash
+#SBATCH --job-name=train_nanoVLM
+#SBATCH --output=logs/train_nanoVLM/%A.out


is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

paragao · 2025-10-10T23:30:28Z

3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch

+#!/bin/bash
+#SBATCH --job-name=train_nanoVLM
+#SBATCH --output=logs/train_nanoVLM/%A.out
+#SBATCH --error=logs/train_nanoVLM/%A.err


is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

paragao · 2025-10-10T23:30:47Z

3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch

+#SBATCH --error=logs/train_nanoVLM/%A.err
+#SBATCH --time=01:00:00
+#SBATCH --nodes=4
+#SBATCH --partition=p5en


don't need partition name. Will fail if doesn't exist.

paragao · 2025-10-10T23:32:07Z

3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch

+#SBATCH --nodes=4
+#SBATCH --partition=p5en
+
+GPUS_PER_NODE=8 #set to 1 for g5.8xlarge


should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on?

Adding nanoVLM sample

a26bb4a

allela-roy requested review from aravneelaws and paragao September 25, 2025 13:49

paragao requested changes Sep 25, 2025

View reviewed changes

Updating README and sbatch scripts with relative paths

026ec67

allela-roy added 3 commits October 1, 2025 12:35

Update README.md

a73d644

Update README.md

5fabc9e

Update nanovlm.Dockerfile

834a642

paragao requested changes Oct 10, 2025

View reviewed changes


		export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)

		cd /fsx/ubuntu/nanoVLM


		The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"].

		### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file:

Adding nanoVLM sample #864

Are you sure you want to change the base?

Adding nanoVLM sample #864

Uh oh!

Conversation

allela-roy commented Sep 25, 2025

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allela-roy commented Sep 30, 2025

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels