Skip to content

Mismatch between datasize and real size for model flux #305

@wolfgang-desalvador

Description

@wolfgang-desalvador

When calculating data size for flux apparently there is a mismatch between the calculated number of files by datasize and the actual size generated by the datagen command.

Running datasize:

(mlperf) [hpcadmin@ccw-alma-htc-2 ~]$ mlpstorage training datasize --model=flux  -cm 1892 --max-accelerators 10 -g b200 --file --num-client-hosts 1 --hosts ccw-alma-htc-2
Setting attr from max_accelerators to 10
Hosts is: ['ccw-alma-htc-2']
Hosts is: ['ccw-alma-htc-2']
⠹ Validating environment... 0:00:002026-04-02 15:47:00|INFO: Environment validation passed
2026-04-02 15:47:00|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/flux/datasize/20260402_154659
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
2026-04-02 15:47:01|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-02 15:47:01|RESULT: Number of training files: 16291
2026-04-02 15:47:01|RESULT: Number of training subfolders: 0
2026-04-02 15:47:01|RESULT: Total disk space required for training: 9459.42 GB
2026-04-02 15:47:01|WARNING: The number of files required may be excessive for some filesystems. You can use the num_subfolders_train parameter to shard the dataset. To keep near 10,000 files per folder use "1x" subfolders by adding "--param dataset.num_subfolders_train=1"
2026-04-02 15:47:01|RESULT: Run the following command to generate data: 
mlpstorage training datagen --hosts=ccw-alma-htc-2 --model=flux --exec-type=mpi --param dataset.num_files_train=16291 --num-processes=10 --results-dir=/tmp/mlperf_storage_results --data-dir=<INSERT_DATA_DIR>
2026-04-02 15:47:01|WARNING: The parameter for --num-processes is the same as --max-accelerators. Adjust the value according to your system.
2026-04-02 15:47:02|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/flux/datasize/20260402_154659/training_20260402_154659_metadata.json

Then running the benchmark with the generated data we get even with closed the benchmark to proceed:

mlpstorage training run --model=flux --param dataset.num_files_train=16291 -g b200 -cm 1892 --closed --file --results-dir=/data/mlperf_storage_results -na 8 --data-dir=/nvme/flux --host ccw-alma-htc-2
Setting attr from num_accelerators to 8
Hosts is: ['ccw-alma-htc-2']
Hosts is: ['ccw-alma-htc-2']
⠸ Validating environment... 0:00:002026-04-02 15:24:55|INFO: Environment validation passed
2026-04-02 15:24:55|STATUS: Benchmark results directory: /data/mlperf_storage_results/training/flux/run/20260402_152454
2026-04-02 15:24:55|INFO: Created benchmark run: training_run_flux_20260402_152454
2026-04-02 15:24:55|STATUS: Verifying benchmark run for training_run_flux_20260402_152454
2026-04-02 15:24:55|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-02 15:24:55|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 16291 (Parameter: Overrode Parameters)
2026-04-02 15:24:55|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='flux', run_datetime='20260402_152454')])
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
2026-04-02 15:24:56|STATUS: Running benchmark command:: mpirun -n 8 -host ccw-alma-htc-2:8 --bind-to none --map-by socket /shared/apps/mlperf/bin/dlio_benchmark workload=flux_b200 ++hydra.run.dir=/data/mlperf_storage_results/training/flux/run/20260402_152454 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=16291 ++workload.dataset.data_folder=/nvme/flux --config-dir=/shared/apps/mlperf/lib64/python3.11/site-packages/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/nvme/flux'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 16291
[OUTPUT]   record_length  = 65536
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 48
[OUTPUT] 2026-04-02T15:25:01.978936 Running DLIO [Training] with 8 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-04-02T15:25:02.109255 Max steps per epoch: 12218 = 288 * 16291 / 48 / 8 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-02T15:25:05.403497 Starting epoch 1: 12218 steps expected
[OUTPUT] 2026-04-02T15:25:05.403962 Starting block 1

However DLIO calls out that the dataset is too small.

On disk, the data generated are not indeed 9459.42 GB but only 256GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions