Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
45ea34d
add new attn
Jul 21, 2025
d0c4cba
fix imports
Jul 23, 2025
32c18cc
jul 24 changes
Jul 26, 2025
3fb558c
works but ooms?
sumo43 Jul 26, 2025
b508dea
temp remove vision encoder
sumo43 Jul 26, 2025
71839a6
final
sumo43 Jul 26, 2025
7fde526
Merge branch 'dev-updated' into mistral-small
teknium1 Jul 29, 2025
7e50705
add mistral conversion
Jul 31, 2025
8b3fef8
fix checkpoint loading, add finetuning script mirroring qwen
Jul 31, 2025
0d43ccb
cleanup of mistral3 code
Jul 31, 2025
d1d7673
pt 1 multimodal sample packing preprocessing func
Aug 1, 2025
e3a0056
add packing with images
Aug 4, 2025
dc8f6cd
hf preprocess instead of mistral
Aug 5, 2025
cba826e
add back vision encoder, make change to preprocess (we now only keep …
Aug 14, 2025
ad0e63e
add multimodal packed ds, update to conversion script
Aug 14, 2025
c11f1dd
add instructions, set better default configs
Aug 14, 2025
46cb25e
limit for testing preproc multimodal
Aug 14, 2025
a9fedf5
update readme
Aug 14, 2025
b486534
add interleaved packed ds
Aug 14, 2025
84b2b7c
add interleaved text-image and textonly preprocess script & functiona…
Aug 15, 2025
8c079d7
bugfix: limit keyword in multimodal data prprocess script
Aug 18, 2025
24d6a7c
add conversion script back for mistral
Aug 20, 2025
6968436
fix freqs_cis bug. add more configs
Aug 21, 2025
f252e0d
update gitignore
Aug 21, 2025
220d7c8
small changes to scripts
Aug 21, 2025
0a9f59b
remove junk and update gitignore
Aug 21, 2025
f1e4890
fix VLM embedding with TP
Aug 21, 2025
2761c2c
temp fix for vision encoder loading in TP context
Aug 25, 2025
b8a4c0b
nvidia VLM dataset support
Aug 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,17 @@ Sessionx.vim

# macOS dir files
.DS_Store

# Ignore everything inside scripts/
scripts/*

# Keep Python files in scripts/
!scripts/*.py

# Keep contents of these subdirectories
!scripts/example/
!scripts/example/**
!scripts/generate/
!scripts/generate/**
!scripts/estimate/
!scripts/estimate/**
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,29 @@ srun torchrun --nnodes 2

If your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section.

## (NOUS) training with sample packing and multimodality

### Training Qwen3-8B with sample packing
To preprocess and pack a textonly chat dataset, run `scripts/preprocess_data.py`:
```
python3 scripts/preprocess_data.py --dataset NousResearch/Hermes-3-Dataset --tokenizer Qwen/Qwen3-8B --chat --pack-to-sequence-length 8000 --split "train[:1000]" --save-to-disk ./dataset
```

Qwen3-8B can be trained using this dataset:
```
CONFIG_FILE="./torchtitan/models/qwen3/train_configs/qwen3_8b_finetuning.toml" ./run_train.sh
```

## Training Mistral Small 3.1 with multimodal sample packing
To preprocess and pack a multimodal chat dataset, run `scripts/preprocess_multimodal_data.py`:
```
python3 scripts/preprocess_multimodal_data.py --dataset /home/shared/datasets/cambrian_sample.json --preprocessor mistralai/Mistral-Small-3.1-24B-Instruct-2503 --chat --pack-to-sequence-length 8000 --split "train" --save-to-disk ./multimodal_dataset --limit 1000
```

Mistral Small 3.1 can be trained using this dataset:
```
CONFIG_FILE="./torchtitan/models/mistral3/train_configs/mistral24b_finetuning.toml" ./run_train.sh
```

## Citation

Expand Down
59 changes: 59 additions & 0 deletions scripts/convert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import re
from datasets import load_dataset
from datasets.utils.info_utils import VerificationMode

# Load your dataset (replace 'your_dataset_name' with the actual Hugging Face dataset name or path)
ds = load_dataset("nvidia/Llama-Nemotron-VLM-Dataset-v1", verification_mode=VerificationMode.NO_CHECKS, split="vqa_8")

def process_conversation(row):
image_path = row["image"]
original_conv = row["conversations"]

messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
}
]

# Assuming the conversation alternates starting with human, and image is only in the first human message
for i, turn in enumerate(original_conv):
role = "user" if turn["from"] == "human" else "assistant"
value = turn["value"]

if role == "user" and i == 0 and "<image>" in value:
# Split the value around <image>
parts = re.split(r'(\n?<image>\n?)', value)
content = []
for part in parts:
if re.match(r'\n?<image>\n?', part):
content.append({"type": "image", "path": './ChartQA Dataset/' + image_path})
elif part.strip():
content.append({"type": "text", "text": part.strip()})
else:
content = [{"type": "text", "text": value.strip()}]

messages.append({
"role": role,
"content": content
})

# The format has an outer list with a dict containing "messages"
new_conv = [{"messages": messages}]

return {"conversations": messages}

# Apply the transformation and keep only the new "conversations" column
new_ds = ds.map(
process_conversation,
remove_columns=ds.column_names
)

# Optionally, push to Hugging Face or save
# new_ds.push_to_hub("new_dataset_name")
# or
new_ds.save_to_disk("ChartQA_Subset")

#print(new_ds[0])
Loading