This repository contains multiple implementations of a simple Neural Network (NN) trained on the MNIST handwritten digits dataset. The purpose is to explore different parallel computing models and optimizations, from serial CPU execution to advanced GPU and OpenACC acceleration.
.
├── data/ # Contains MNIST dataset files
│ ├── t10k-images.idx3-ubyte
│ ├── t10k-labels.idx1-ubyte
│ ├── train-images.idx3-ubyte
│ └── train-labels.idx1-ubyte
│
├── src/
│ ├── V1/ # Serial CPU implementation
│ │ ├── nn.c
│ │ └── Makefile
│ ├── V2/ # Naive GPU (CUDA) implementation
│ │ ├── nn.cu
│ │ └── Makefile
│ ├── V3/ # Optimized GPU (CUDA) implementation
│ │ ├── nn.cu
│ │ └── Makefile
│ ├── V4/ # Tensor Core + Batch Processing (CUDA)
│ │ ├── nn.cu
│ │ └── Makefile
│ └── V5/ # OpenACC implementation
│ ├── nn.cpp
│ └── Makefile
│
└── README.md
To evaluate and benchmark different parallel programming models for neural network training on the MNIST dataset using:
- Serial execution (CPU)
- CUDA-based GPU acceleration (naive and optimized)
- Tensor Core acceleration with batching
- OpenACC-based GPU parallelism
- Linux/Unix-based OS
- CUDA Toolkit (for V2, V3, V4)
- NVIDIA GPU (Tensor Cores recommended for V4)
- OpenACC compiler (e.g., PGI/NVIDIA HPC compiler) for V5
- GCC for serial version
- Make utility
Place the following MNIST files inside the /data directory:
train-images.idx3-ubytetrain-labels.idx1-ubytet10k-images.idx3-ubytet10k-labels.idx1-ubyte
For example, to run the serial implementation:
cd src/V1
make
./run