-
-
Couldn't load subscription status.
- Fork 242
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
During caching of variations on epoch 2, Onetrainer failed to load images that I can still access and are not corrupted. I confirmed after closing the trainer that this was the case (images accessible and valid)
What did you expect would happen?
That we either skip the images that failed to load or retry, not cause an illegal memory access.
Relevant log output
could not load image, it might be corrupted: E:/datasets/unsplash-research-dataset-lite-latest/photos\374ymy99Has.jpg
could not load image, it might be corrupted: E:/datasets/unsplash-research-dataset-lite-latest/photos\376N0UZpURk.jpg
caching: 6%|████▏ | 1574/24727 [04:05<1:00:13, 6.41it/s]
epoch: 33%|███████████████████████▎ | 1/3 [3:12:46<6:25:33, 11566.89s/it]
Traceback (most recent call last):
File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 758, in __training_thread_function
trainer.train()
File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 633, in train
self.data_loader.get_data_set().start_next_epoch()
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\MGDS.py", line 49, in start_next_epoch
self.loading_pipeline.start_next_epoch()
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\LoadingPipeline.py", line 97, in start_next_epoch
module.start(self.__current_epoch)
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\DiskCache.py", line 242, in start
self.__refresh_cache(out_variation)
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\DiskCache.py", line 211, in __refresh_cache
f.result()
File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 401, in __get_result
raise self._exception
File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\DiskCache.py", line 198, in fn
split_item[name] = self.__clone_for_cache(self._get_previous_item(in_variation, name, in_index))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item
item = module.get_item(variation, index, item_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\SampleVAEDistribution.py", line 25, in get_item
distribution = self._get_previous_item(variation, self.in_name, index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\PipelineModule.py", line 96, in _get_previous_item
item = module.get_item(variation, index, item_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\mgds\src\mgds\pipelineModules\EncodeVAE.py", line 53, in get_item
vae_output = self.vae.encode(image.unsqueeze(0))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\autoencoders\autoencoder_kl.py", line 278, in encode
h = self._encode(x)
^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\autoencoders\autoencoder_kl.py", line 252, in _encode
enc = self.encoder(x)
^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\autoencoders\vae.py", line 156, in forward
sample = self.conv_in(sample)
^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\conv.py", line 554, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\conv.py", line 549, in _conv_forward
return F.conv2d(
^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception in thread Thread-8 (__training_thread_function):
Traceback (most recent call last):
File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\threading.py", line 1075, in _bootstrap_inner
self.run()
File "C:\Users\redacted\AppData\Local\Programs\Python\Python312\Lib\threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 765, in __training_thread_function
trainer.end()
File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 824, in end
self.model.to(self.temp_device)
File "C:\repos\OneTrainer\modules\model\StableDiffusionXLModel.py", line 162, in to
self.vae_to(device)
File "C:\repos\OneTrainer\modules\model\StableDiffusionXLModel.py", line 131, in vae_to
self.vae.to(device=device)
File "C:\repos\OneTrainer\venv\src\diffusers\src\diffusers\models\modeling_utils.py", line 1424, in to
return super().to(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1355, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 915, in _apply
module._apply(fn)
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 915, in _apply
module._apply(fn)
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 942, in _apply
param_applied = fn(param)
^^^^^^^^^
File "C:\repos\OneTrainer\venv\Lib\site-packages\torch\nn\modules\module.py", line 1341, in convert
return t.to(
^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.Generate and upload debug_report.log
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working