Skip to content

Commit 77afa5e

Browse files
committed
CUDA support for ImageBufAlgo (experimental and very incomplete)
First stab at this, it's experimental, the general organization may change as we extend it. * To get these features, you must build with `USE_CUDA=1`, in which case it will look for Cuda toolkit. For simplicity, we're setting a version floor of Cuda 7.0 and sm_30. * To enable at runtime (duh, still only if you built with Cuda support enabled), you can either set `OIIO::attribute("cuda",1)` or use the magic environment variable `OPENIMAGEIO_CUDA=1`. When running oiiotool, the command line argument `--cuda` turns the attribut on (or cheat with the aforementioned env variable). * When the attribute is set, ImageBuf of "local" (not ImageCache-backed) float (no other data types yet) buffers will allocate and free with cudaMallocManaged/cudaFree (other cases will use the usual malloc/free). We are thus heavily leveraging Unified Memory, never do any explicit copying of data back and forth. * Certain ImageBufAlgo functions, then, have the options of calling Cuda implementations when all the stars align -- Cuda support enabled, Cuda turned on, the ImageBufs in question all have local storage that was allocated as visible to Cuda, the buffers are all float, and other restrctions to just the most common cases (all image inputs have identical ROIs, etc.). * Implemented this for IBA::add() and sub() initially. Will extend to other operations in the future and as the need arises. Results and discussion: Perf: add and sub operations on 1920x1080 3 channel float images, on my workstation (16 core Xeon Silver 4110, it's ISA is AVX-512 but I'm only compiling for SSE4.2 support at the moment) runs in about 20ms single threaded, ~3.8ms multithreaded. With Cuda enabled (NVIDIA Quadro P5000, Pascal architecture), I am getting about 12ms (i.e., moderately faster than single core, quite a bit slower than fully using all the CPU cores). Now, this is not an especially good case for GPU -- the compute-to-memory ratio is very poor, just a single math op for every 12 bytes of transfer on or off the GPU. When I contrive to do an example with about 10x more math per pixel, the Cuda times are approximately equal to the CPU times when I take advantage of all the CPU cores. Maybe it only helps if we do a bunch of IBA operations in a row before needing the results. Maybe it's only worth Cuda-accelerating the most expensive operations (resize, area ops, etc.), but we'll never get gain from something simple like add? If anybody can point out ways in which I'm being very wasteful, please do let me know! Even after we flesh out many more image operations to be Cuda-accelerated, and even we see an improvement in all cases over CPU, I don't expect people to see much practical improvement in a typical oiiotool command line, since disk/network to read input images and write results are almost certain to dominate runtime, compared to the math. But if you have a program that's doing a whole bunch of repeated image math via IBA calls themselves, that's where the bigger payoff is going to be, I think. Note that CUDA is extremely finicky about what compilers it can use, with an especially narrow idea of which "host compiler" is required by each version of the Cuda Toolkit/nvcc. I'm still working through those issues, and am considering the merits of compiling the cuda itself with clang (if available) rather than nvcc, just to ease up on these requirements. We'll be making the rest of the build issues more robust over time as well.
1 parent c1dbc6c commit 77afa5e

File tree

15 files changed

+504
-11
lines changed

15 files changed

+504
-11
lines changed

Makefile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,10 @@ ifneq (${BUILD_OIIOUTIL_ONLY},)
239239
MY_CMAKE_FLAGS += -DBUILD_OIIOUTIL_ONLY:BOOL=${BUILD_OIIOUTIL_ONLY}
240240
endif
241241

242+
ifneq (${USE_CUDA},)
243+
MY_CMAKE_FLAGS += -DUSE_CUDA:BOOL=${USE_CUDA}
244+
endif
245+
242246
ifdef DEBUG
243247
MY_CMAKE_FLAGS += -DCMAKE_BUILD_TYPE:STRING=Debug
244248
endif
@@ -510,6 +514,7 @@ help:
510514
@echo " USE_OPENCV=0 Skip anything that needs OpenCV"
511515
@echo " USE_PTEX=0 Skip anything that needs PTex"
512516
@echo " USE_FREETYPE=0 Skip anything that needs Freetype"
517+
@echo " USE_CUDA=1 Build NVIDIA CUDA support (if found)"
513518
@echo " OIIO build-time options:"
514519
@echo " INSTALL_PREFIX=path Set installation prefix (default: ./${INSTALL_PREFIX_BRIEF})"
515520
@echo " NAMESPACE=name Override namespace base name (default: OpenImageIO)"

site/spi/Makefile-bits-arnold

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,11 @@ ifeq (${SP_OS}, rhel7)
120120
-DOPENEXR_CUSTOM_LIB_DIR=/usr/lib64/OpenEXR2
121121
endif
122122

123+
# CUDA customizations
124+
MY_CMAKE_FLAGS += \
125+
-DCUDA_TOOLKIT_ROOT_DIR=/net/soft_scratch/apps/arnold/tools/nvidia/cuda9.1 \
126+
-DCUDA_HOST_COMPILER=/shots/spi/home/lib/arnold/rhel7/llvm_4.0_final/bin/clang++
127+
123128
MY_CMAKE_FLAGS += \
124129
-DOCIO_PATH=${OCIO_PATH} \
125130
-DFIELD3D_HOME=${FIELD3D_HOME} \

src/cmake/compiler.cmake

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ set (CLANG_TIDY_CHECKS "-*" CACHE STRING "clang-tidy checks to perform")
2222
set (CLANG_TIDY_ARGS "" CACHE STRING "clang-tidy args")
2323
option (CLANG_TIDY_FIX "Have clang-tidy fix source" OFF)
2424
set (GLIBCXX_USE_CXX11_ABI "" CACHE STRING "For gcc, use the new C++11 library ABI (0|1)")
25+
option (USE_CUDA "Use CUDA if found" OFF)
26+
set (CUDA_TARGET_ARCH "sm_30" CACHE STRING "CUDA GPU architecture (e.g. sm_35)")
2527

2628
# Figure out which compiler we're using
2729
if (CMAKE_COMPILER_IS_GNUCC)

src/cmake/externalpackages.cmake

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
# When not in VERBOSE mode, try to make things as quiet as possible
55
if (NOT VERBOSE)
66
set (Boost_FIND_QUIETLY true)
7+
set (CUDA_FIND_QUIETLY true)
78
set (DCMTK_FIND_QUIETLY true)
89
set (FFmpeg_FIND_QUIETLY true)
910
set (Field3D_FIND_QUIETLY true)
@@ -624,4 +625,27 @@ endmacro()
624625

625626

626627
###########################################################################
628+
if (USE_CUDA)
629+
if (NOT CUDA_TOOLKIT_ROOT_DIR AND NOT $ENV{CUDA_TOOLKIT_ROOT_DIR} STREQUAL "")
630+
set (CUDA_TOOLKIT_ROOT_DIR $ENV{CUDA_TOOLKIT_ROOT_DIR})
631+
endif ()
632+
if (NOT CUDA_FIND_QUIETLY)
633+
message (STATUS "CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
634+
endif ()
635+
set (CUDA_PROPAGATE_HOST_FLAGS ON)
636+
set (CUDA_VERBOSE_BUILD ${VERBOSE})
637+
find_package (CUDA 7.0 REQUIRED)
638+
list (APPEND CUDA_NVCC_FLAGS ${CSTD_FLAGS} -expt-relaxed-constexpr)
639+
set (CUDA_INCLUDE_DIR ${CUDA_TOOLKIT_ROOT_DIR}/include)
640+
message (STATUS "CUDA version = ${CUDA_VERSION}")
641+
if (NOT CUDA_FIND_QUIETLY)
642+
message (STATUS "CUDA includes = ${CUDA_INCLUDE_DIR}")
643+
message (STATUS "CUDA libraries = ${CUDA_LIBRARIES}")
644+
message (STATUS "CUDA host compiler = ${CUDA_HOST_COMPILER}")
645+
message (STATUS "CUDA nvcc flags = ${CUDA_NVCC_FLAGS}")
646+
endif ()
647+
endif ()
648+
649+
# end Cuda
650+
###########################################################################
627651

src/include/OpenImageIO/imagebuf.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,9 @@ class OIIO_API ImageBuf {
183183
/// Which type of storage is being used for the pixels?
184184
IBStorage storage () const;
185185

186+
/// Is the pixel memory of this ImageBuf visible to Cuda?
187+
bool cuda_storage () const;
188+
186189
/// Is this ImageBuf object initialized?
187190
bool initialized () const;
188191

src/include/OpenImageIO/platform.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -314,8 +314,10 @@
314314
// OIIO_HOSTDEVICE is used to supply the function decorators needed when
315315
// compiling for CUDA devices.
316316
#ifdef __CUDACC__
317+
# define OIIO_HOST __host__
317318
# define OIIO_HOSTDEVICE __host__ __device__
318319
#else
320+
# define OIIO_HOST
319321
# define OIIO_HOSTDEVICE
320322
#endif
321323

src/libOpenImageIO/CMakeLists.txt

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ endif ()
44
file (TO_NATIVE_PATH "${PLUGIN_SEARCH_PATH}" PLUGIN_SEARCH_PATH_NATIVE)
55
configure_file (imageio_pvt.h.in "${CMAKE_BINARY_DIR}/include/imageio_pvt.h" @ONLY)
66

7+
if (BUILDSTATIC)
8+
set (OIIO_LIB_TYPE "STATIC")
9+
else ()
10+
set (OIIO_LIB_TYPE "SHARED")
11+
endif ()
12+
713
file (GLOB libOpenImageIO_hdrs ../include/OpenImageIO/*.h)
814

915
if (NOT USE_EXTERNAL_PUGIXML)
@@ -47,6 +53,7 @@ set (libOpenImageIO_srcs
4753
imageoutput.cpp iptc.cpp xmp.cpp
4854
color_ocio.cpp
4955
maketexture.cpp
56+
imageio_cuda.cpp
5057
../libutil/argparse.cpp
5158
../libutil/benchmark.cpp
5259
../libutil/errorhandler.cpp
@@ -73,6 +80,13 @@ set (libOpenImageIO_srcs
7380
${libOpenImageIO_hdrs}
7481
)
7582

83+
#set (cuda_using_srcs imagebufalgo_addsub.cpp)
84+
85+
if (USE_CUDA)
86+
file (GLOB gpu_source_files "*.cu")
87+
message (STATUS "Extra cuda files: ${gpu_source_files}")
88+
list (APPEND libOpenImageIO_srcs ${gpu_source_files})
89+
endif ()
7690

7791
# If the 'EMBEDPLUGINS' option is set, we want to compile the source for
7892
# all the plugins into libOpenImageIO.
@@ -100,10 +114,12 @@ endif ()
100114
source_group ("libutil" REGULAR_EXPRESSION ".+/libutil/.+")
101115
source_group ("libtexture" REGULAR_EXPRESSION ".+/libtexture/.+")
102116

103-
if (BUILDSTATIC)
104-
add_library (OpenImageIO STATIC ${libOpenImageIO_srcs})
117+
if (USE_CUDA)
118+
add_definitions ("-DOIIO_USE_CUDA=1")
119+
cuda_add_library (OpenImageIO ${OIIO_LIB_TYPE} ${libOpenImageIO_srcs}
120+
OPTIONS -arch ${CUDA_TARGET_ARCH})
105121
else ()
106-
add_library (OpenImageIO SHARED ${libOpenImageIO_srcs})
122+
add_library (OpenImageIO ${OIIO_LIB_TYPE} ${libOpenImageIO_srcs})
107123
endif ()
108124

109125
# if (SANITIZE AND ${CMAKE_SYSTEM_NAME} STREQUAL "Linux")
@@ -206,7 +222,9 @@ if (EXTRA_DSO_LINK_ARGS)
206222
set_target_properties (OpenImageIO PROPERTIES LINK_FLAGS ${EXTRA_DSO_LINK_ARGS})
207223
endif()
208224

209-
oiio_install_targets (OpenImageIO)
225+
oiio_install_targets (OpenImageIO
226+
#${OIIO_CUDA_LIB}
227+
)
210228

211229

212230
# Testing

src/libOpenImageIO/imagebuf.cpp

Lines changed: 44 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ class ImageBufImpl {
144144
void append_error (const std::string& message) const;
145145

146146
ImageBuf::IBStorage storage () const { return m_storage; }
147+
bool cuda_storage () const { return m_cuda_storage; }
147148

148149
TypeDesc pixeltype () const {
149150
validate_spec ();
@@ -194,6 +195,7 @@ class ImageBufImpl {
194195
m_current_subimage, m_current_miplevel);
195196
}
196197

198+
// Make sure the pixels are ready to read with an iterator.
197199
bool validate_pixels () const {
198200
if (m_pixels_valid)
199201
return true;
@@ -259,6 +261,7 @@ class ImageBufImpl {
259261
mutable spin_mutex m_valid_mutex;
260262
mutable bool m_spec_valid; ///< Is the spec valid
261263
mutable bool m_pixels_valid; ///< Image is valid
264+
mutable bool m_cuda_storage = false; ///< Is the pixel memory visible to Cuda?
262265
bool m_badfile; ///< File not found
263266
float m_pixelaspect; ///< Pixel aspect ratio of the image
264267
size_t m_pixel_bytes;
@@ -373,6 +376,7 @@ ImageBufImpl::ImageBufImpl (const ImageBufImpl &src)
373376
// Source had the image fully in memory (no cache)
374377
if (m_storage == ImageBuf::APPBUFFER) {
375378
// Source just wrapped the client app's pixels, we do the same
379+
m_allocated_size = src.m_localpixels ? src.spec().image_bytes() : 0;
376380
m_localpixels = src.m_localpixels;
377381
} else {
378382
// We own our pixels -- copy from source
@@ -498,7 +502,28 @@ ImageBufImpl::new_pixels (size_t size, const void *data)
498502
if (m_allocated_size)
499503
free_pixels();
500504
m_allocated_size = size;
501-
m_pixels.reset (size ? new char [size] : nullptr);
505+
m_cuda_storage = false;
506+
if (size) {
507+
#ifdef OIIO_USE_CUDA
508+
if (OIIO::get_int_attribute("cuda") && m_spec.format == TypeFloat) {
509+
char *cudaptr = (char *)OIIO::pvt::cuda_malloc (m_allocated_size);
510+
if (cudaptr) {
511+
OIIO::debug ("IB Cuda allocated %p\n", (void*)cudaptr);
512+
m_pixels.reset (cudaptr);
513+
m_cuda_storage = true;
514+
}
515+
else {
516+
OIIO::debug ("Requested cudaMallocManaged of %s FAILED\n",
517+
m_allocated_size);
518+
}
519+
}
520+
#endif
521+
if (! m_pixels) {
522+
m_pixels.reset (new char [size]);
523+
}
524+
} else {
525+
m_pixels.reset ();
526+
}
502527
IB_local_mem_current += m_allocated_size;
503528
if (data && size)
504529
memcpy (m_pixels.get(), data, size);
@@ -515,12 +540,21 @@ void
515540
ImageBufImpl::free_pixels ()
516541
{
517542
IB_local_mem_current -= m_allocated_size;
543+
#if OIIO_USE_CUDA
544+
if (m_cuda_storage) {
545+
OIIO::debug ("IB Cuda free %p\n", (void*)m_pixels.get());
546+
OIIO::pvt::cuda_free (m_pixels.release());
547+
m_cuda_storage = false;
548+
}
549+
#endif
518550
m_pixels.reset ();
519551
if (m_allocated_size && pvt::oiio_print_debug > 1)
520552
OIIO::debug ("IB freed %d MB, global IB memory now %d MB\n",
521553
m_allocated_size>>20, IB_local_mem_current>>20);
522554
m_allocated_size = 0;
523555
m_storage = ImageBuf::UNINITIALIZED;
556+
m_localpixels = nullptr;
557+
m_pixels_valid = false;
524558
}
525559

526560

@@ -577,6 +611,14 @@ ImageBuf::storage () const
577611

578612

579613

614+
bool
615+
ImageBuf::cuda_storage () const
616+
{
617+
return impl()->cuda_storage ();
618+
}
619+
620+
621+
580622
void
581623
ImageBufImpl::clear ()
582624
{
@@ -588,10 +630,8 @@ ImageBufImpl::clear ()
588630
m_current_miplevel = -1;
589631
m_spec = ImageSpec ();
590632
m_nativespec = ImageSpec ();
591-
m_pixels.reset ();
592-
m_localpixels = NULL;
633+
release_pixels ();
593634
m_spec_valid = false;
594-
m_pixels_valid = false;
595635
m_badfile = false;
596636
m_pixelaspect = 1;
597637
m_pixel_bytes = 0;
@@ -700,8 +740,6 @@ ImageBufImpl::realloc ()
700740
m_channel_bytes = m_spec.format.size();
701741
m_blackpixel.resize (round_to_multiple (m_pixel_bytes, OIIO_SIMD_MAX_SIZE_BYTES), 0);
702742
// NB make it big enough for SSE
703-
if (m_allocated_size)
704-
m_pixels_valid = true;
705743
if (m_spec.deep) {
706744
m_deepdata.init (m_spec);
707745
m_storage = ImageBuf::LOCALBUFFER;
@@ -2440,5 +2478,4 @@ ImageBuf::retile (int x, int y, int z, ImageCache::Tile* &tile,
24402478
}
24412479

24422480

2443-
24442481
OIIO_NAMESPACE_END

src/libOpenImageIO/imagebufalgo_addsub.cpp

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
#include <OpenImageIO/deepdata.h>
4545
#include <OpenImageIO/dassert.h>
4646
#include "imageio_pvt.h"
47+
#include "imagebufalgo_cuda.h"
4748

4849

4950

@@ -121,6 +122,15 @@ ImageBufAlgo::add (ImageBuf &dst, Image_or_Const A_, Image_or_Const B_,
121122
return false;
122123
ROI origroi = roi;
123124
roi.chend = std::min (roi.chend, std::min (A.nchannels(), B.nchannels()));
125+
126+
#ifdef OIIO_USE_CUDA
127+
if (dst.cuda_storage() && A.cuda_storage() && B.cuda_storage() &&
128+
dst.roi() == roi && A.roi() == roi && B.roi() == roi) {
129+
return pvt::add_impl_cuda (dst, A, B, dst.roi());
130+
}
131+
// make >/dev/null && OPENIMAGEIO_LOG_TIMES=2 oiiotool -cuda -frames 1-1 -pattern fill:topleft=0,0,0:topright=0.5,0,0:bottomleft=0,0.5,0:bottomright=0.5,0.5,0.5 1920x1080 3 -pattern fill:topleft=0.5,0,0:topright=0,0.5,0:bottomleft=0.5,0.5,0.5:bottomright=0,.5,.5 1920x1080 3 -add -o out.exr
132+
#endif
133+
124134
bool ok;
125135
OIIO_DISPATCH_COMMON_TYPES3 (ok, "add", add_impl, dst.spec().format,
126136
A.spec().format, B.spec().format,
@@ -209,6 +219,14 @@ ImageBufAlgo::sub (ImageBuf &dst, Image_or_Const A_, Image_or_Const B_,
209219
return false;
210220
ROI origroi = roi;
211221
roi.chend = std::min (roi.chend, std::min (A.nchannels(), B.nchannels()));
222+
223+
#ifdef OIIO_USE_CUDA
224+
if (dst.cuda_storage() && A.cuda_storage() && B.cuda_storage() &&
225+
dst.roi() == roi && A.roi() == roi && B.roi() == roi) {
226+
return pvt::sub_impl_cuda (dst, A, B, dst.roi());
227+
}
228+
#endif
229+
212230
bool ok;
213231
OIIO_DISPATCH_COMMON_TYPES3 (ok, "sub", sub_impl, dst.spec().format,
214232
A.spec().format, B.spec().format,

0 commit comments

Comments
 (0)