Table of Contents:
- Benchmarks
- Results
- Python: Loop vs Numpy (CPU)
- Python: Loop vs Numpy 2 (CPU)
- R: Loop vs Vectorized (CPU)
- Python: Loop vs Numpy vs Pandas (CPU)
- Julia: Loop vs Vector (CPU)
- Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)
- Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)
- Numpy vs Deep Learning Frameworks (GPU and CPU)
- Deep Learning Frameworks GPU vs Loop CPU
- C++ Parallel APIs (CPU)
- C++ GPU (vs CPU)
- OpenCL vs PyOpenCL (CPU & GPU)
- PyCUDA vs C++ (GPU)
- Tensorflow: Python vs C++ (GPU)
- GPU Conclusion
- Linux Conclusion
- Windows Conclusion
- Conclusion
- Machine Specifications
The following benchmarks have been implemented:
| C++ Bulk [gpu] | Bulk is yet another parallel algorithms on top of CUDA. It claims to have better scalability than Thrust. |
| C++ CUDA [gpu] | NVidia CUDA toolkit is the base library for accessing GPUs. |
| C++ OCL [cpu] | OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. |
| C++ OCL [gpu] | OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. |
| C++ OMP [cpu] | OpenMP is API specification for parallel programming. |
| C++ TensorFlow [gpu] | TensorFlow is a deep learning library from Google. |
| C++ Thrust [gpu] | NVidia Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust is included with CUDA toolkit. |
| C++ cuBLAS [gpu] | NVidia cuBLAS is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS). |
| C++ loop [cpu] | Plain C++ for loop |
| Java loop [cpu] | Plain Java loop |
| Julia (loop) [cpu] | SIMD optimized Julia loop. |
| Julia (vec) [cpu] | With Julia array operation. |
| Octave [cpu] | GNU Octave is a high-level language primarily intended for numerical computations. |
| Py CNTK [cpu] | CNTK is a deep learning library. |
| Py CNTK [gpu] | CNTK is a deep learning library. |
| Py MXNet [cpu] | MXNet is a deep learning library. |
| Py MXNet [gpu] | MXNet is a deep learning library. |
| Py Numpy [cpu] | With Python Numpy array. |
| Py Pandas [cpu] | With Python Pandas dataframe. |
| Py TensorFlow [cpu] | TensorFlow is a deep learning library. |
| Py TensorFlow [gpu] | TensorFlow is a deep learning library. |
| PyCUDA [gpu] | PyCUDA is a Python wrapper for CUDA. |
| PyOCL [cpu] | PyOpenCL is a Python wrapper for OpenCL. |
| PyOCL [gpu] | PyOpenCL is a Python wrapper for OpenCL. |
| Python loop [cpu] | Simple Python for loop. |
| R (array) [cpu] | With array in R, a free software environment for statistical computing and graphics. |
| R (data.frame) [cpu] | With data.frame in R, a free software environment for statistical computing and graphics. |
| R (data.table) [cpu] | With data.table in R, a free software environment for statistical computing and graphics. |
| R (loop) [cpu] | Simple loop in R, a free software environment for statistical computing and graphics. |
| R (matrix) [cpu] | With matrix in R, a free software environment for statistical computing and graphics. |
Comparison between simple Python loop and Numpy
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Python loop [cpu] (src/saxpy_loop.py)
Same as above, on both Linux and Windows
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Python loop [cpu] (src/saxpy_loop.py)
Benchmarking various vectorization methods in R (array, matrix, data.frame, data.table) vs plain loop
- R (array) [cpu] (src/saxpy_array.R)
- R (data.frame) [cpu] (src/saxpy_dataframe.R)
- R (data.table) [cpu] (src/saxpy_datatable.R)
- R (loop) [cpu] (src/saxpy_loop.R)
- R (matrix) [cpu] (src/saxpy_matrix.R)
Benchmarking the performance of Numpy vs Panda (vs plain Python loop)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Py Pandas [cpu] (src/saxpy_pandas.py)
- Python loop [cpu] (src/saxpy_loop.py)
Comparing the performance of Julia loop vs Julia vector/array (vs C++)
- C++ loop [cpu] (src/saxpy_cpu.cpp)
- Julia (loop) [cpu] (src/saxpy_loop.jl)
- Julia (vec) [cpu] (src/saxpy_array.jl)
Comparing the performance of SAXPY in different programming languages
- C++ loop [cpu] (src/saxpy_cpu.cpp)
- Java loop [cpu] (src/SaxpyLoop.java)
- Julia (loop) [cpu] (src/saxpy_loop.jl)
- Julia (vec) [cpu] (src/saxpy_array.jl)
- Octave [cpu] (src/saxpy.m)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- R (array) [cpu] (src/saxpy_array.R)
SAXPY array operation in Numpy vs machine learning frameworks such as Tensorflow, MXNet, and CNTK. Only tested on Linux.
Note: CNTK result is way off, not sure why. Please have a look at the source code, maybe I did something wrong.
- Py CNTK [cpu] (src/saxpy_cntk.py)
- Py MXNet [cpu] (src/saxpy_mxnet.py)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Py TensorFlow [cpu] (src/saxpy_tf.py)
Same as above, but on GPU as well
- Py CNTK [cpu] (src/saxpy_cntk.py)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py MXNet [cpu] (src/saxpy_mxnet.py)
- Py MXNet [gpu] (src/saxpy_mxnet.py)
- Py Numpy [cpu] (src/saxpy_numpy.py)
- Py TensorFlow [cpu] (src/saxpy_tf.py)
- Py TensorFlow [gpu] (src/saxpy_tf.py)
Comparing frameworks running on GPU with naive C++ loop running on CPU.
- C++ loop [cpu] (src/saxpy_cpu.cpp)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py MXNet [gpu] (src/saxpy_mxnet.py)
- Py TensorFlow [gpu] (src/saxpy_tf.py)
Comparing naive C++ loop with several parallel programming APIs (OpenCL and OpenMP) on CPU.
- C++ OCL [cpu] (src/saxpy_ocl1.cpp)
- C++ OMP [cpu] (src/saxpy_omp.cpp)
- C++ loop [cpu] (src/saxpy_cpu.cpp)
Comparing various C++ GPU libraries (CUDA, OpenCL, Thrust, Bulk, cuBLAS)
- C++ Bulk [gpu] (src/saxpy_bulk.cpp)
- C++ CUDA [gpu] (src/saxpy_cuda.cpp)
- C++ OCL [gpu] (src/saxpy_ocl1.cpp)
- C++ Thrust [gpu] (src/saxpy_trust.cpp)
- C++ cuBLAS [gpu] (src/saxpy_cublas.cpp)
- C++ loop [cpu] (src/saxpy_cpu.cpp)
Comparing C++ OpenCL with PyOpenCL, the OpenCL Python wrapper.
- C++ OCL [cpu] (src/saxpy_ocl1.cpp)
- C++ OCL [gpu] (src/saxpy_ocl1.cpp)
- PyOCL [cpu] (src/saxpy_pyocl.py)
- PyOCL [gpu] (src/saxpy_pyocl.py)
Comparing PyCUDA (Python CUDA wrapper) with native C++ CUDA GPU
- C++ CUDA [gpu] (src/saxpy_cuda.cpp)
- PyCUDA [gpu] (src/saxpy_pycuda.py)
Comparing Tensorflow C++ and Python performance
- C++ TensorFlow [gpu] (src/saxpy_tf.cc)
- Py TensorFlow [gpu] (src/saxpy_tf.py)
Benchmarking various GPU APIs (only on Linux since it has the most APIs)
Excluded from this chart:
Excluded from this chart:
- Python loop [cpu] (src/saxpy_loop.py)
- R (loop) [cpu] (src/saxpy_loop.R)
Excluded from this chart:
- Python loop [cpu] (src/saxpy_loop.py)
- R (loop) [cpu] (src/saxpy_loop.R)
- C++ TensorFlow [gpu] (src/saxpy_tf.cc)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py CNTK [cpu] (src/saxpy_cntk.py)
Excluded from this chart:
- Python loop [cpu] (src/saxpy_loop.py)
- R (loop) [cpu] (src/saxpy_loop.R)
- C++ TensorFlow [gpu] (src/saxpy_tf.cc)
- Py CNTK [gpu] (src/saxpy_cntk.py)
- Py CNTK [cpu] (src/saxpy_cntk.py)
Note: same machine as Windows below (dual-boot)
| System | Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT) |
| OS | Ubuntu Linux 16.04 64bit |
| GPU | NVidia GeForce GTX 1080 8GB |
| C++ Compiler | g++ 5.4.0 |
| Python3 | 3.5.2 64bit |
| TensorFlow | TensorFlow 1.4 (GPU) |
| CUDA | CUDA 9.0.61 |
| CudNN7 | |
| OpenCL | - Khronos OpenCL header 1.2 |
| - Intel OpenCL driver 16.1.1 | |
| - NVidia OpenCL 1.2 driver | |
| PyOpenCL | version 2015.1 |
| Octave | version 4.0.0 64bit |
| R | version 3.2.3 64bit |
| MXNet | mxnet-cu90 (0.12.1) |
| CNTK | CNTK 2.3.1 (CUDA-8, CudNN6) |
Note: same machine as Linux above (dual-boot)
| System | Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT) |
| OS | Windows 10 64bit |
| GPU | NVidia GeForce GTX 1080 8GB |
| C++ Compiler | Visual Studio 2015 C++ compiler 64bit version |
| Python | 2.7.12 64bit |
| Python3 | 3.5.3 64bit |
| TensorFlow | TensorFlow 1.4 (GPU) |
| CUDA | Version 8.0.61 |
| OpenCL | - Intel OpenCL SDK Version 7.0.0.2519 |
| - OpenCL from CUDA SDK | |
| PyOpenCL | version 2017.2 |
| Octave | version 4.2.1 64bit |
| R | version 3.4.2 64bit |

















