Mismatch between `datasize` and real size for model `flux`

When calculating data size for `flux` apparently there is a mismatch between the calculated number of files by `datasize` and the actual size generated by the `datagen` command.

Running `datasize`:

```
(mlperf) [hpcadmin@ccw-alma-htc-2 ~]$ mlpstorage training datasize --model=flux  -cm 1892 --max-accelerators 10 -g b200 --file --num-client-hosts 1 --hosts ccw-alma-htc-2
Setting attr from max_accelerators to 10
Hosts is: ['ccw-alma-htc-2']
Hosts is: ['ccw-alma-htc-2']
⠹ Validating environment... 0:00:002026-04-02 15:47:00|INFO: Environment validation passed
2026-04-02 15:47:00|STATUS: Benchmark results directory: /tmp/mlperf_storage_results/training/flux/datasize/20260402_154659
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
2026-04-02 15:47:01|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-02 15:47:01|RESULT: Number of training files: 16291
2026-04-02 15:47:01|RESULT: Number of training subfolders: 0
2026-04-02 15:47:01|RESULT: Total disk space required for training: 9459.42 GB
2026-04-02 15:47:01|WARNING: The number of files required may be excessive for some filesystems. You can use the num_subfolders_train parameter to shard the dataset. To keep near 10,000 files per folder use "1x" subfolders by adding "--param dataset.num_subfolders_train=1"
2026-04-02 15:47:01|RESULT: Run the following command to generate data: 
mlpstorage training datagen --hosts=ccw-alma-htc-2 --model=flux --exec-type=mpi --param dataset.num_files_train=16291 --num-processes=10 --results-dir=/tmp/mlperf_storage_results --data-dir=<INSERT_DATA_DIR>
2026-04-02 15:47:01|WARNING: The parameter for --num-processes is the same as --max-accelerators. Adjust the value according to your system.
2026-04-02 15:47:02|STATUS: Writing metadata for benchmark to: /tmp/mlperf_storage_results/training/flux/datasize/20260402_154659/training_20260402_154659_metadata.json

```

Then running the benchmark with the generated data we get even with `closed` the benchmark to proceed:

```
mlpstorage training run --model=flux --param dataset.num_files_train=16291 -g b200 -cm 1892 --closed --file --results-dir=/data/mlperf_storage_results -na 8 --data-dir=/nvme/flux --host ccw-alma-htc-2
Setting attr from num_accelerators to 8
Hosts is: ['ccw-alma-htc-2']
Hosts is: ['ccw-alma-htc-2']
⠸ Validating environment... 0:00:002026-04-02 15:24:55|INFO: Environment validation passed
2026-04-02 15:24:55|STATUS: Benchmark results directory: /data/mlperf_storage_results/training/flux/run/20260402_152454
2026-04-02 15:24:55|INFO: Created benchmark run: training_run_flux_20260402_152454
2026-04-02 15:24:55|STATUS: Verifying benchmark run for training_run_flux_20260402_152454
2026-04-02 15:24:55|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-02 15:24:55|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 16291 (Parameter: Overrode Parameters)
2026-04-02 15:24:55|STATUS: Benchmark run qualifies for CLOSED category ([RunID(program='training', command='run', model='flux', run_datetime='20260402_152454')])
⠦ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:00
2026-04-02 15:24:56|STATUS: Running benchmark command:: mpirun -n 8 -host ccw-alma-htc-2:8 --bind-to none --map-by socket /shared/apps/mlperf/bin/dlio_benchmark workload=flux_b200 ++hydra.run.dir=/data/mlperf_storage_results/training/flux/run/20260402_152454 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=16291 ++workload.dataset.data_folder=/nvme/flux --config-dir=/shared/apps/mlperf/lib64/python3.11/site-packages/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/nvme/flux'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 16291
[OUTPUT]   record_length  = 65536
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 48
[OUTPUT] 2026-04-02T15:25:01.978936 Running DLIO [Training] with 8 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-04-02T15:25:02.109255 Max steps per epoch: 12218 = 288 * 16291 / 48 / 8 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-02T15:25:05.403497 Starting epoch 1: 12218 steps expected
[OUTPUT] 2026-04-02T15:25:05.403962 Starting block 1
```

However DLIO calls out that the dataset is too small.

On disk, the data generated are not indeed `9459.42 GB` but only `256GB`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between `datasize` and real size for model `flux` #305

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mismatch between datasize and real size for model flux #305

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Mismatch between `datasize` and real size for model `flux` #305