This tutorial shows how we can use the map clause with the target data directive to copy data to and from the GPU.
The CPU and GPU have separate memory spaces. When we want the GPU to access data allocated on the CPU (or vice versa), we need to copy the data from one memory to the other. By default OpenMP will copy all variables within lexical scope to and from the device. Notwithstanding, in certain cases we need to tell OpenMP which data we want copied.
The code below performs a vector addition. The target directive has been used to offload the vector addition task to the GPU.
%%writefile vec_add.cpp
#include<iostream>
#include<omp.h>
using namespace std;
int main() {
unsigned N = 10000;
float A[N];
float B[N];
float C[N];
for (unsigned i = 0; i < N; i++) {
A[i] = i * 2.17;
B[i] = i * 3.14;
C[i] = 0;
}
#pragma omp target
{
#pragma omp parallel for
for (unsigned i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
cout << "Computation Done!" << endl;
// verify results
for (unsigned i = 1; i < 2; i++)
cout << "C[1] = " << C[i] << endl;
return 0;
}Overwriting vec_add.cpp
Let's compile and run this code.
!g++ -o vec_add -fopenmp vec_add.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp!perf stat ./vec_addComputation Done!
C[1] = 5.31
Performance counter stats for './vec_add':
224.66 msec task-clock # 0.850 CPUs utilized
58 context-switches # 0.258 K/sec
0 cpu-migrations # 0.000 K/sec
3,472 page-faults # 0.015 M/sec
629,850,790 cycles # 2.804 GHz
606,730,402 instructions # 0.96 insn per cycle
128,837,039 branches # 573.471 M/sec
2,773,679 branch-misses # 2.15% of all branches
0.264281673 seconds time elapsed
0.068586000 seconds user
0.157345000 seconds sys
The example code uses static allocation for the array A, B, and C. This is very limiting and will rarely appear in practice. Let's modify the code to do dynamic allocation of A. We will save this version of the code as vec_add_dynamic.cpp
%%writefile vec_add_dynamic.cpp
#include<iostream>
#include<omp.h>
using namespace std;
int main() {
unsigned N = 10000;
float *A = (float *) malloc(sizeof(float) * N);
float B[N];
float C[N];
for (unsigned i = 0; i < N; i++) {
A[i] = i * 2.17;
B[i] = i * 3.14;
C[i] = 0;
}
#pragma omp target
{
#pragma omp parallel for
for (unsigned i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
cout << "Computation Done!" << endl;
// verify results
for (unsigned i = 1; i < 2; i++)
cout << "C[1] = " << C[i] << endl;
return 0;
}Overwriting vec_add_dynamic.cpp
Let's compile and run this version.
!g++ -o vec_add -fopenmp vec_add_dynamic.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp!./vec_addlibgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
What happened? The error message indicates that the GPU kernel is trying to access data that has not been allocated to GPU memory. Why did OpenMP not copy the A array? By default, OpenMP will copy, both to and from the device, all scalar variables and static arrays in scope. However, it will not copy dynamically allocated data. (The OpenMP 4.5 specs is a little unclear about this).
So, we need to tell OpenMP to create copy 'A' to device. Essentially, this is telling OpenMP that we are doing dynamic memory allocation, so make sure there is enough space. And this is why we are getting the error.
The map clause is used to explicitly map data to device memory. map takes a list of variables as its arugment and maps them to device memory. An optional qualifier can be specified to control how data is mapped. More on this latrer. For this example, we want to map A to device memory. When mapping dyanamically allocated data, the number of elements that need to be mapped must also be specified. Bad things will happen otheriwse.
%%writefile vec_add_dynamic.cpp
#include<iostream>
#include<omp.h>
using namespace std;
int main() {
unsigned N = 10000;
float *A = (float *) malloc(sizeof(float) * N);
float B[N];
float C[N];
for (unsigned i = 0; i < N; i++) {
A[i] = i * 2.17;
B[i] = i * 3.14;
}
#pragma omp target map(A[0:N])
{
#pragma omp parallel for
for (unsigned i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
cout << "Computation Done!" << endl;
// verify results
for (unsigned i = 1; i < 2; i++)
cout << "C[1] = " << C[i] << endl;
return 0;
}Overwriting vec_add_dynamic.cpp
Let's try out the corrected version.
!g++ -o vec_add -fopenmp vec_add_dynamic.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp! /usr/local/cuda/bin/nvprof ./vec_add 2>&1 | grep main GPU activities: 97.64% 1.6623ms 1 1.6623ms 1.6623ms 1.6623ms main$_omp_fn$0
Now, let's allocate B and C in dynamic memory and add the appropiate map clauses. Note, we are still relying on OpenMP to implicitly map N
%%writefile vec_add_dynamic.cpp
#include<iostream>
#include<omp.h>
using namespace std;
int main() {
unsigned N = 10000;
float *A = (float *) malloc(sizeof(float) * N);
float *B = (float *) malloc(sizeof(float) * N);
float *C = (float *) malloc(sizeof(float) * N);
for (unsigned i = 0; i < N; i++) {
A[i] = i * 2.17;
B[i] = i * 3.14;
}
#pragma omp target map(A[0:N],B[0:N],C[0:N])
{
#pragma omp parallel for
for (unsigned i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
cout << "Computation Done!" << endl;
// verify results
for (unsigned i = 1; i < 2; i++)
cout << "C[1] = " << C[i] << endl;
return 0;
}Overwriting vec_add_dynamic.cpp
!g++ -o vec_add -fopenmp vec_add_dynamic.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp!/usr/local/cuda/bin/nvprof ./vec_add 2>&1 | grep "main\|HtoD\|DtoH" GPU activities: 98.70% 1.6634ms 1 1.6634ms 1.6634ms 1.6634ms main$_omp_fn$0
0.96% 16.097us 3 5.3650us 832ns 7.6800us [CUDA memcpy HtoD]
0.34% 5.7930us 1 5.7930us 5.7930us 5.7930us [CUDA memcpy DtoH]
0.02% 60.233us 3 20.077us 12.680us 25.180us cuMemcpyHtoD
0.02% 51.392us 1 51.392us 51.392us 51.392us cuMemcpyDtoH
Be default OpenMP will copy all mapped data from the CPU to the GPU at the beginning of the offloaded and then copy everything back at the end of the taks. This may lead to many unnecessary copies. We can optimize this behavior with modifiers in the map clause.
The map clause accepts a modifier that allows us to specify the direction of data movement. In the code below, the says that the A and B arrays should be copied to device memory when executing the offloaded task while the C array should be copied from device memory to host memory. For the vector add computation, this makes sense. The initialized values in A and B are copied to the GPU. We do not need to copy them back since the GPU doesn't modify these arrays. On the other hand, we do not need to copy C to GPU but we did not to copy it back to the CPU to access the values updated by the GPU.
If there is data that is both read from and written to by the GPU, we can just use tofrom modifier.
%%writefile vec_add_dynamic.cpp
#include<iostream>
#include<omp.h>
using namespace std;
int main() {
unsigned N = 10000;
float *A = (float *) malloc(sizeof(float) * N);
float *B = (float *) malloc(sizeof(float) * N);
float *C = (float *) malloc(sizeof(float) * N);
for (unsigned i = 0; i < N; i++) {
A[i] = i * 2.17;
B[i] = i * 3.14;
}
#pragma omp target map(to:A[0:N],B[0:N]) map(from:C[0:N])
{
#pragma omp parallel for
for (unsigned i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
cout << "Computation Done!" << endl;
// verify results
for (unsigned i = 1; i < 2; i++)
cout << "C[1] = " << C[i] << endl;
return 0;
}Overwriting vec_add_dynamic.cpp
!g++ -o vec_add -fopenmp vec_add_dynamic.cpp -fno-stack-protector -foffload=nvptx-none -fopenmp!/usr/local/cuda/bin/nvprof ./vec_add 2>&1 | grep "main" GPU activities: 98.70% 1.6626ms 1 1.6626ms 1.6626ms 1.6626ms main$_omp_fn$0
In this tutorial we saw how we can use the map clause to copy data to and from the GPU device when executing an offloaded task. The map clause is necessary whenever we are accessing dynamically allocated data. For static data structures and scalar variables, OpenMP will do the mapping implicitly.