When calculating data size for flux apparently there is a mismatch between the calculated number of files by datasize and the actual size generated by the datagen command.
Running datasize:
(mlperf) [hpcadmin@ccw-alma-htc-2 ~]$ mlpstorage training datasize --model=flux -cm 1892 --max-accelerators 10 -g b200 --file --num-client-hosts 1 --hosts ccw-alma-htc-2
Setting attr from max_accelerators to 10
Hosts is: ['ccw-alma-htc-2']
Hosts is: ['ccw-alma-htc-2']
⠹ Validating environment... 0:00:002026-04-02 15:47:00|INFO: Environment validation passed
2026-04-02 15:47:00|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/flux/datasize/20260402_154659
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
2026-04-02 15:47:01|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-02 15:47:01|RESULT: Number of training files: 16291
2026-04-02 15:47:01|RESULT: Number of training subfolders: 0
2026-04-02 15:47:01|RESULT: Total disk space required for training: 9459.42 GB
2026-04-02 15:47:01|WARNING: The number of files required may be excessive for some filesystems. You can use the num_subfolders_train parameter to shard the dataset. To keep near 10,000 files per folder use "1x" subfolders by adding "--param dataset.num_subfolders_train=1"
2026-04-02 15:47:01|RESULT: Run the following command to generate data:
mlpstorage training datagen --hosts=ccw-alma-htc-2 --model=flux --exec-type=mpi --param dataset.num_files_train=16291 --num-processes=10 --results-dir=/tmp/mlperf_storage_results --data-dir=<INSERT_DATA_DIR>
2026-04-02 15:47:01|WARNING: The parameter for --num-processes is the same as --max-accelerators. Adjust the value according to your system.
2026-04-02 15:47:02|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/flux/datasize/20260402_154659/training_20260402_154659_metadata.json
Then running the benchmark with the generated data we get even with closed the benchmark to proceed:
mlpstorage training run --model=flux --param dataset.num_files_train=16291 -g b200 -cm 1892 --closed --file --results-dir=/data/mlperf_storage_results -na 8 --data-dir=/nvme/flux --host ccw-alma-htc-2
Setting attr from num_accelerators to 8
Hosts is: ['ccw-alma-htc-2']
Hosts is: ['ccw-alma-htc-2']
⠸ Validating environment... 0:00:002026-04-02 15:24:55|INFO: Environment validation passed
2026-04-02 15:24:55|STATUS: Benchmark results directory: /data/mlperf_storage_results/training/flux/run/20260402_152454
2026-04-02 15:24:55|INFO: Created benchmark run: training_run_flux_20260402_152454
2026-04-02 15:24:55|STATUS: Verifying benchmark run for training_run_flux_20260402_152454
2026-04-02 15:24:55|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-02 15:24:55|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 16291 (Parameter: Overrode Parameters)
2026-04-02 15:24:55|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='flux', run_datetime='20260402_152454')])
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
2026-04-02 15:24:56|STATUS: Running benchmark command:: mpirun -n 8 -host ccw-alma-htc-2:8 --bind-to none --map-by socket /shared/apps/mlperf/bin/dlio_benchmark workload=flux_b200 ++hydra.run.dir=/data/mlperf_storage_results/training/flux/run/20260402_152454 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=16291 ++workload.dataset.data_folder=/nvme/flux --config-dir=/shared/apps/mlperf/lib64/python3.11/site-packages/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] storage_type = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT] storage_root = './'
[OUTPUT] storage_options= None
[OUTPUT] data_folder = '/nvme/flux'
[OUTPUT] framework = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT] num_files_train= 16291
[OUTPUT] record_length = 65536
[OUTPUT] generate_data = False
[OUTPUT] do_train = True
[OUTPUT] do_checkpoint = False
[OUTPUT] epochs = 1
[OUTPUT] batch_size = 48
[OUTPUT] 2026-04-02T15:25:01.978936 Running DLIO [Training] with 8 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-04-02T15:25:02.109255 Max steps per epoch: 12218 = 288 * 16291 / 48 / 8 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-02T15:25:05.403497 Starting epoch 1: 12218 steps expected
[OUTPUT] 2026-04-02T15:25:05.403962 Starting block 1
However DLIO calls out that the dataset is too small.
On disk, the data generated are not indeed 9459.42 GB but only 256GB
When calculating data size for
fluxapparently there is a mismatch between the calculated number of files bydatasizeand the actual size generated by thedatagencommand.Running
datasize:Then running the benchmark with the generated data we get even with
closedthe benchmark to proceed:However DLIO calls out that the dataset is too small.
On disk, the data generated are not indeed
9459.42 GBbut only256GB