Skip to content

[Bug]: Images can fail to load somehow causing Illegal memory accesses and crashing training #1068

@O-J1

Description

@O-J1

What happened?

During caching of variations on epoch 2, Onetrainer failed to load images that I can still access and are not corrupted. I confirmed after closing the trainer that this was the case (images accessible and valid)

What did you expect would happen?

That we either skip the images that failed to load or retry, not cause an illegal memory access.

Relevant log output

could not load image, it might be corrupted: E:/datasets/unsplash-research-dataset-lite-latest/photos\374ymy99Has.jpg
could not load image, it might be corrupted: E:/datasets/unsplash-research-dataset-lite-latest/photos\376N0UZpURk.jpg
caching:   6%|████▏                                                             | 1574/24727 [04:05<1:00:13,  6.41it/s]
epoch:  33%|███████████████████████▎                                              | 1/3 [3:12:46<6:25:33, 11566.89s/it]
Traceback (most recent call last):
  File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 758, in __training_thread_function
    trainer.train()
  File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 633, in train
    self.data_loader.get_data_set().start_next_epoch()
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\MGDS.py", line 49, in start_next_epoch
    self.loading_pipeline.start_next_epoch()
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\LoadingPipeline.py", line 97, in start_next_epoch
    module.start(self.__current_epoch)
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\DiskCache.py", line 242, in start
    self.__refresh_cache(out_variation)
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\DiskCache.py", line 211, in __refresh_cache
    f.result()
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\DiskCache.py", line 198, in fn
    split_item[name] = self.__clone_for_cache(self._get_previous_item(in_variation, name, in_index))
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item
    item = module.get_item(variation, index, item_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\SampleVAEDistribution.py", line 25, in get_item
    distribution = self._get_previous_item(variation, self.in_name, index)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item
    item = module.get_item(variation, index, item_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\EncodeVAE.py", line 53, in get_item
    vae_output = self.vae.encode(image.unsqueeze(0))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\utils\accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\autoencoders\autoencoder_kl.py", line 278, in encode
    h = self._encode(x)
        ^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\autoencoders\autoencoder_kl.py", line 252, in _encode
    enc = self.encoder(x)
          ^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\autoencoders\vae.py", line 156, in forward
    sample = self.conv_in(sample)
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception in thread Thread-8 (__training_thread_function):
Traceback (most recent call last):
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 765, in __training_thread_function
    trainer.end()
  File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 824, in end
    self.model.to(self.temp_device)
  File "C:\repos\OneTrainer\modules\model\StableDiffusionXLModel.py", line 162, in to
    self.vae_to(device)
  File "C:\repos\OneTrainer\modules\model\StableDiffusionXLModel.py", line 131, in vae_to
    self.vae.to(device=device)
  File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\modeling_utils.py", line 1424, in to
    return super().to(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1355, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 915, in _apply
    module._apply(fn)
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 915, in _apply
    module._apply(fn)
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 942, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1341, in convert
    return t.to(
           ^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Generate and upload debug_report.log

debug_report.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions