This is a ready-to-go template for training neural networks using PyTorch with distributed data parallel (DDP). The training and evaluation scheme is highly based on the official example on ImageNet.
The main purpose of this repository is to help beginners learning, training, and testing on multiple GPUs properly. It also should be helpful to non-professionals who need to take the advantages of multiple GPU but struggle to write the code of a whole project.
Don't worry if you don't have stable accessibility to multiple GPUs. You can simply set NumOfGPU to 1 to run it on a single GPU. Additionally, becuase some networks/injectors may not compatible with DDP, scripts for single GPU without using DDP are also included.
Currently, the code is still a little "hard" as it mainly works for classification tasks. You are free to modify it to suit your need. The injector interface for customed metric calculation may be introduced in the future.
A simple workflow can be:
- Implement your dataset in Dataset.py
- Implement your network in Network.py and the initialization in Train.py
- Set the configuration in Train.py/Test.py
- Run the script!
- Scripts for single/multiple GPU training;
- Dataloaders for MNIST, CIFAR-10/CIFAR-100, ImageNet and TinyImageNet;
- Simple path management for data and results;
- Logging and result saving;
- A detailed discussion about evaluation with DDP in Trainer.py;
The code was tested in CentOS Linux 7 and Windows 11 with Anaconda enviroment.
- python 3.10.12
- pytorch 2.0.1
- cuda-runtime 11.7.1
- torchvision 0.15.2
Distributed under the MIT License. See LICENSE for more information.