SAXPY CPU and GPGPU Benchmarks

Table of Contents:

Benchmarks
Results
Machine Specifications
- Ubuntu 16.04, NVidia GTX 1080
- Windows 10, NVidia GTX 1080

Benchmarks

The following benchmarks have been implemented:


C++ Bulk [gpu]	Bulk is yet another parallel algorithms on top of CUDA. It claims to have better scalability than Thrust.
C++ CUDA [gpu]	NVidia CUDA toolkit is the base library for accessing GPUs.
C++ OCL [cpu]	OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
C++ OCL [gpu]	OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
C++ OMP [cpu]	OpenMP is API specification for parallel programming.
C++ TensorFlow [gpu]	TensorFlow is a deep learning library from Google.
C++ Thrust [gpu]	NVidia Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust is included with CUDA toolkit.
C++ cuBLAS [gpu]	NVidia cuBLAS is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS).
C++ loop [cpu]	Plain C++ `for` loop
Java loop [cpu]	Plain Java loop
Julia (loop) [cpu]	SIMD optimized Julia loop.
Julia (vec) [cpu]	With Julia array operation.
Octave [cpu]	GNU Octave is a high-level language primarily intended for numerical computations.
Py CNTK [cpu]	CNTK is a deep learning library.
Py CNTK [gpu]	CNTK is a deep learning library.
Py MXNet [cpu]	MXNet is a deep learning library.
Py MXNet [gpu]	MXNet is a deep learning library.
Py Numpy [cpu]	With Python Numpy array.
Py Pandas [cpu]	With Python Pandas dataframe.
Py TensorFlow [cpu]	TensorFlow is a deep learning library.
Py TensorFlow [gpu]	TensorFlow is a deep learning library.
PyCUDA [gpu]	PyCUDA is a Python wrapper for CUDA.
PyOCL [cpu]	PyOpenCL is a Python wrapper for OpenCL.
PyOCL [gpu]	PyOpenCL is a Python wrapper for OpenCL.
Python loop [cpu]	Simple Python `for` loop.
R (array) [cpu]	With array in R, a free software environment for statistical computing and graphics.
R (data.frame) [cpu]	With `data.frame` in R, a free software environment for statistical computing and graphics.
R (data.table) [cpu]	With `data.table` in R, a free software environment for statistical computing and graphics.
R (loop) [cpu]	Simple loop in R, a free software environment for statistical computing and graphics.
R (matrix) [cpu]	With matrix in R, a free software environment for statistical computing and graphics.

Results

Python: Loop vs Numpy (CPU)

Comparison between simple Python loop and Numpy

Py Numpy [cpu] (src/saxpy_numpy.py)
Python loop [cpu] (src/saxpy_loop.py)

Python: Loop vs Numpy 2 (CPU)

Same as above, on both Linux and Windows

Py Numpy [cpu] (src/saxpy_numpy.py)
Python loop [cpu] (src/saxpy_loop.py)

R: Loop vs Vectorized (CPU)

Benchmarking various vectorization methods in R (array, matrix, data.frame, data.table) vs plain loop

R (array) [cpu] (src/saxpy_array.R)
R (data.frame) [cpu] (src/saxpy_dataframe.R)
R (data.table) [cpu] (src/saxpy_datatable.R)
R (loop) [cpu] (src/saxpy_loop.R)
R (matrix) [cpu] (src/saxpy_matrix.R)

Python: Loop vs Numpy vs Pandas (CPU)

Benchmarking the performance of Numpy vs Panda (vs plain Python loop)

Py Numpy [cpu] (src/saxpy_numpy.py)
Py Pandas [cpu] (src/saxpy_pandas.py)
Python loop [cpu] (src/saxpy_loop.py)

Julia: Loop vs Vector (CPU)

Comparing the performance of Julia loop vs Julia vector/array (vs C++)

C++ loop [cpu] (src/saxpy_cpu.cpp)
Julia (loop) [cpu] (src/saxpy_loop.jl)
Julia (vec) [cpu] (src/saxpy_array.jl)

Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)

Comparing the performance of SAXPY in different programming languages

C++ loop [cpu] (src/saxpy_cpu.cpp)
Java loop [cpu] (src/SaxpyLoop.java)
Julia (loop) [cpu] (src/saxpy_loop.jl)
Julia (vec) [cpu] (src/saxpy_array.jl)
Octave [cpu] (src/saxpy.m)
Py Numpy [cpu] (src/saxpy_numpy.py)
R (array) [cpu] (src/saxpy_array.R)

Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)

SAXPY array operation in Numpy vs machine learning frameworks such as Tensorflow, MXNet, and CNTK. Only tested on Linux.

Note: CNTK result is way off, not sure why. Please have a look at the source code, maybe I did something wrong.

Py CNTK [cpu] (src/saxpy_cntk.py)
Py MXNet [cpu] (src/saxpy_mxnet.py)
Py Numpy [cpu] (src/saxpy_numpy.py)
Py TensorFlow [cpu] (src/saxpy_tf.py)

Numpy vs Deep Learning Frameworks (GPU and CPU)

Same as above, but on GPU as well

Py CNTK [cpu] (src/saxpy_cntk.py)
Py CNTK [gpu] (src/saxpy_cntk.py)
Py MXNet [cpu] (src/saxpy_mxnet.py)
Py MXNet [gpu] (src/saxpy_mxnet.py)
Py Numpy [cpu] (src/saxpy_numpy.py)
Py TensorFlow [cpu] (src/saxpy_tf.py)
Py TensorFlow [gpu] (src/saxpy_tf.py)

Deep Learning Frameworks GPU vs Loop CPU

Comparing frameworks running on GPU with naive C++ loop running on CPU.

C++ loop [cpu] (src/saxpy_cpu.cpp)
Py CNTK [gpu] (src/saxpy_cntk.py)
Py MXNet [gpu] (src/saxpy_mxnet.py)
Py TensorFlow [gpu] (src/saxpy_tf.py)

C++ Parallel APIs (CPU)

Comparing naive C++ loop with several parallel programming APIs (OpenCL and OpenMP) on CPU.

C++ OCL [cpu] (src/saxpy_ocl1.cpp)
C++ OMP [cpu] (src/saxpy_omp.cpp)
C++ loop [cpu] (src/saxpy_cpu.cpp)

C++ GPU (vs CPU)

Comparing various C++ GPU libraries (CUDA, OpenCL, Thrust, Bulk, cuBLAS)

C++ Bulk [gpu] (src/saxpy_bulk.cpp)
C++ CUDA [gpu] (src/saxpy_cuda.cpp)
C++ OCL [gpu] (src/saxpy_ocl1.cpp)
C++ Thrust [gpu] (src/saxpy_trust.cpp)
C++ cuBLAS [gpu] (src/saxpy_cublas.cpp)
C++ loop [cpu] (src/saxpy_cpu.cpp)

OpenCL vs PyOpenCL (CPU & GPU)

Comparing C++ OpenCL with PyOpenCL, the OpenCL Python wrapper.

C++ OCL [cpu] (src/saxpy_ocl1.cpp)
C++ OCL [gpu] (src/saxpy_ocl1.cpp)
PyOCL [cpu] (src/saxpy_pyocl.py)
PyOCL [gpu] (src/saxpy_pyocl.py)

PyCUDA vs C++ (GPU)

Comparing PyCUDA (Python CUDA wrapper) with native C++ CUDA GPU

C++ CUDA [gpu] (src/saxpy_cuda.cpp)
PyCUDA [gpu] (src/saxpy_pycuda.py)

Tensorflow: Python vs C++ (GPU)

Comparing Tensorflow C++ and Python performance

C++ TensorFlow [gpu] (src/saxpy_tf.cc)
Py TensorFlow [gpu] (src/saxpy_tf.py)

GPU Conclusion

Benchmarking various GPU APIs (only on Linux since it has the most APIs)

Excluded from this chart:

Linux Conclusion

Excluded from this chart:

Python loop [cpu] (src/saxpy_loop.py)
R (loop) [cpu] (src/saxpy_loop.R)

Windows Conclusion

Excluded from this chart:

Python loop [cpu] (src/saxpy_loop.py)
R (loop) [cpu] (src/saxpy_loop.R)
C++ TensorFlow [gpu] (src/saxpy_tf.cc)
Py CNTK [gpu] (src/saxpy_cntk.py)
Py CNTK [cpu] (src/saxpy_cntk.py)

Conclusion

Excluded from this chart:

Python loop [cpu] (src/saxpy_loop.py)
R (loop) [cpu] (src/saxpy_loop.R)
C++ TensorFlow [gpu] (src/saxpy_tf.cc)
Py CNTK [gpu] (src/saxpy_cntk.py)
Py CNTK [cpu] (src/saxpy_cntk.py)

Machine Specifications

Ubuntu 16.04, NVidia GTX 1080

Note: same machine as Windows below (dual-boot)


System	Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT)
OS	Ubuntu Linux 16.04 64bit
GPU	NVidia GeForce GTX 1080 8GB
C++ Compiler	g++ 5.4.0
Python3	3.5.2 64bit
TensorFlow	TensorFlow 1.4 (GPU)
CUDA	CUDA 9.0.61
	CudNN7
OpenCL	- Khronos OpenCL header 1.2
	- Intel OpenCL driver 16.1.1
	- NVidia OpenCL 1.2 driver
PyOpenCL	version 2015.1
Octave	version 4.0.0 64bit
R	version 3.2.3 64bit
MXNet	mxnet-cu90 (0.12.1)
CNTK	CNTK 2.3.1 (CUDA-8, CudNN6)

Windows 10, NVidia GTX 1080

Note: same machine as Linux above (dual-boot)


System	Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT)
OS	Windows 10 64bit
GPU	NVidia GeForce GTX 1080 8GB
C++ Compiler	Visual Studio 2015 C++ compiler 64bit version
Python	2.7.12 64bit
Python3	3.5.3 64bit
TensorFlow	TensorFlow 1.4 (GPU)
CUDA	Version 8.0.61
OpenCL	- Intel OpenCL SDK Version 7.0.0.2519
	- OpenCL from CUDA SDK
PyOpenCL	version 2017.2
Octave	version 4.2.1 64bit
R	version 3.4.2 64bit

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
extra		extra
results		results
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAXPY CPU and GPGPU Benchmarks

Benchmarks

Results

Python: Loop vs Numpy (CPU)

Python: Loop vs Numpy 2 (CPU)

R: Loop vs Vectorized (CPU)

Python: Loop vs Numpy vs Pandas (CPU)

Julia: Loop vs Vector (CPU)

Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)

Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)

Numpy vs Deep Learning Frameworks (GPU and CPU)

Deep Learning Frameworks GPU vs Loop CPU

C++ Parallel APIs (CPU)

C++ GPU (vs CPU)

OpenCL vs PyOpenCL (CPU & GPU)

PyCUDA vs C++ (GPU)

Tensorflow: Python vs C++ (GPU)

GPU Conclusion

Linux Conclusion

Windows Conclusion

Conclusion

Machine Specifications

Ubuntu 16.04, NVidia GTX 1080

Windows 10, NVidia GTX 1080

About

Uh oh!

Releases

Packages

Languages

bennylp/saxpy-benchmark

Folders and files

Latest commit

History

Repository files navigation

SAXPY CPU and GPGPU Benchmarks

Benchmarks

Results

Python: Loop vs Numpy (CPU)

Python: Loop vs Numpy 2 (CPU)

R: Loop vs Vectorized (CPU)

Python: Loop vs Numpy vs Pandas (CPU)

Julia: Loop vs Vector (CPU)

Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)

Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)

Numpy vs Deep Learning Frameworks (GPU and CPU)

Deep Learning Frameworks GPU vs Loop CPU

C++ Parallel APIs (CPU)

C++ GPU (vs CPU)

OpenCL vs PyOpenCL (CPU & GPU)

PyCUDA vs C++ (GPU)

Tensorflow: Python vs C++ (GPU)

GPU Conclusion

Linux Conclusion

Windows Conclusion

Conclusion

Machine Specifications

Ubuntu 16.04, NVidia GTX 1080

Windows 10, NVidia GTX 1080

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages