-
Notifications
You must be signed in to change notification settings - Fork 121
Add OpenMP support for Nvidia hardware #999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add OpenMP support for Nvidia hardware #999
Conversation
…, started work on enter and exit data, compiles
…, exit data, and update
…e, add mappers to derived types, change how allocate is done
…types, removed rest of pure functions, fix issue with acoustic on nvfortran
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level Suggestion
Refactor large, duplicated code blocks in files like m_igr.fpp and m_riemann_solvers.fpp to improve maintainability before adding OpenMP support. This could involve using macros or parameterized subroutines to handle logic that varies by direction. [High-level, importance: 7]
Solution Walkthrough:
Before:
subroutine s_igr_source_terms(idir, ...)
if (idir == 1) then
! Massive block of code for x-direction
#:call GPU_PARALLEL_LOOP(...)
do l = 0, p
do k = 0, n
do j = -1, m
! ... hundreds of lines of calculations for x-fluxes
end do
end do
end do
#:endcall GPU_PARALLEL_LOOP
else if (idir == 2) then
! Massive block of code for y-direction (very similar to x-direction)
...
else if (idir == 3) then
! Massive block of code for z-direction (very similar to x-direction)
...
end if
end subroutine
After:
subroutine s_compute_directional_terms(norm_dir, ...)
! Generic logic for any direction
#:call GPU_PARALLEL_LOOP(...)
do l = ..., ...
do k = ..., ...
do j = ..., ...
! ... hundreds of lines of calculations using norm_dir
end do
end do
end do
#:endcall GPU_PARALLEL_LOOP
end subroutine
subroutine s_igr_source_terms(idir, ...)
call s_compute_directional_terms(idir, ...)
end subroutine
| $:GPU_LOOP(parallelism='[seq]') | ||
| do i = 1, strxe - strxb + 1 | ||
| tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i) | ||
| tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i) | ||
| ! Elastic contribution to energy if G large enough | ||
| !TODO take out if statement if stable without | ||
| if ((G_L > 1000) .and. (G_R > 1000)) then | ||
| E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L) | ||
| E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R) | ||
| ! Double for shear stresses | ||
| if (any(strxb - 1 + i == shear_indices)) then | ||
| E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L) | ||
| E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R) | ||
| end if | ||
| end if | ||
| end if | ||
| end do | ||
| end if | ||
| end do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Optimize the GPU loop by replacing the any intrinsic function with a pre-calculated boolean array for a more efficient lookup of shear indices. [general, importance: 7]
| $:GPU_LOOP(parallelism='[seq]') | |
| do i = 1, strxe - strxb + 1 | |
| tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i) | |
| tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i) | |
| ! Elastic contribution to energy if G large enough | |
| !TODO take out if statement if stable without | |
| if ((G_L > 1000) .and. (G_R > 1000)) then | |
| E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L) | |
| E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R) | |
| ! Double for shear stresses | |
| if (any(strxb - 1 + i == shear_indices)) then | |
| E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L) | |
| E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R) | |
| end if | |
| end if | |
| end if | |
| end do | |
| end if | |
| end do | |
| logical :: is_shear(strxe - strxb + 1) | |
| is_shear = .false. | |
| do ii = 1, size(shear_indices) | |
| if (shear_indices(ii) >= strxb .and. shear_indices(ii) <= strxe) then | |
| is_shear(shear_indices(ii) - strxb + 1) = .true. | |
| end if | |
| end do | |
| $:GPU_LOOP(parallelism='[seq]') | |
| do i = 1, strxe - strxb + 1 | |
| tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i) | |
| tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i) | |
| ! Elastic contribution to energy if G large enough | |
| !TODO take out if statement if stable without | |
| if ((G_L > 1000) .and. (G_R > 1000)) then | |
| E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L) | |
| E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R) | |
| ! Double for shear stresses | |
| if (is_shear(i)) then | |
| E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L) | |
| E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R) | |
| end if | |
| end if | |
| end do |
| jac_sf(1)%sf => jac | ||
| $:GPU_ENTER_DATA(copyin='[jac_sf(1)%sf]') | ||
| $:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Remove the copyin directive for jac_sf(1)%sf to avoid overwriting the GPU-initialized jac array with uninitialized data from the host. [possible issue, importance: 9]
| jac_sf(1)%sf => jac | |
| $:GPU_ENTER_DATA(copyin='[jac_sf(1)%sf]') | |
| $:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]') | |
| jac_sf(1)%sf => jac | |
| $:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]') |
|
This looks good from my side, obviously will need a lot of changes for AMD and to stay up to date with openmp_rebased, but that can be done once this is merged |
|
great, can we confirm that there's no slowdown for openacc before merging? benchmarking isn't working because the benchmark files have new names |
|
It also failed to build with CCE GPU because of parameters in declare target statements in chemistry |
Didn't it pass the test cases on frontier ? |
The test suite doesn't use case optimization, so it doesn't have the same problem |
|
@prathi-wind does this seem ready to merge? |
User description
Description
All test cases pass with OpenMP on Nvidia hardware.
IPO is causing 1D chemistry cases to fail, so I turned it off when compiling for OpenMP.
Fixes #(issue) [optional]
Type of change
Please delete options that are not relevant.
Scope
If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration
Test Configuration:
Checklist
docs/)examples/that demonstrate my new feature performing as expected.They run to completion and demonstrate "interesting physics"
./mfc.sh formatbefore committing my codeIf your code changes any code source files (anything in
src/simulation)To make sure the code is performing as expected on GPU devices, I have:
nvtxranges so that they can be identified in profiles./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.PR Type
Enhancement
Description
Add OpenMP support for Nvidia hardware alongside existing OpenACC backend
Refactor GPU parallel loop macros from
$:GPU_PARALLEL_LOOP()syntax to#:call GPU_PARALLEL_LOOP()/#:endcall GPU_PARALLEL_LOOPblock syntax for OpenMP compatibilityConsolidate GPU macro infrastructure in
parallel_macros.fppto support both OpenACC and OpenMP backends with conditional compilation directivesRemove
purekeyword from 20+ subroutines and functions across multiple modules to enable GPU parallelization (incompatible with OpenMP parallel regions)Extend NVTX profiling support to OpenMP GPU backend by broadening preprocessor conditions from OpenACC-only to general GPU support
Add OpenMP configuration option to CMake output display
Update all simulation and post-processing modules to use new macro syntax consistently
Disable IPO (Interprocedural Optimization) for OpenMP compilation to prevent 1D chemistry test failures
All test cases pass with OpenMP on Nvidia hardware
Diagram Walkthrough
File Walkthrough
9 files
m_viscous.fpp
Convert GPU parallel loop macros to OpenMP-compatible syntaxsrc/simulation/m_viscous.fpp
$:GPU_PARALLEL_LOOPto#:callGPU_PARALLEL_LOOP/#:endcall GPU_PARALLEL_LOOPsyntax for OpenMPcompatibility
call/endcall structure
unchanged
and gradient reconstruction routines
m_derived_variables.fpp
Remove pure keyword from derived variable subroutinessrc/post_process/m_derived_variables.fpp
purekeyword from seven subroutine declarations to allowOpenMP parallelization
s_derive_specific_heat_ratio,s_derive_liquid_stiffness,s_derive_sound_speed,s_derive_flux_limiter,s_solve_linear_system,s_derive_vorticity_component, ands_derive_qmregions
m_rhs.fpp
Refactor GPU parallel loop macro syntax for OpenMP supportsrc/simulation/m_rhs.fpp
$:GPU_PARALLEL_LOOP()syntaxto
#:call GPU_PARALLEL_LOOP()and#:endcall GPU_PARALLEL_LOOPsyntaxthroughout the file
structure
single lines for consistency
handling flux calculations, RHS computations, and dimensional
splitting operations
m_derived_variables.fpp
Update GPU macros and remove pure attribute for GPU supportsrc/simulation/m_derived_variables.fpp
$:GPU_PARALLEL_LOOP()macros to#:call GPU_PARALLEL_LOOP()and
#:endcall GPU_PARALLEL_LOOPsyntax in multiple loop regionspurekeyword from subroutiness_derive_acceleration_component,f_convert_cyl_to_cart, andf_rto support GPU parallelizationbranches for different grid geometries in some cases
m_icpp_patches.fpp
Remove pure attribute from coordinate conversion functionssrc/pre_process/m_icpp_patches.fpp
purekeyword from functionf_convert_cyl_to_cartto enable GPUparallelization
purekeyword from elemental functionf_rand added blank lineafter function signature
m_cbc.fpp
Refactor GPU parallel macros to block syntax with AMD supportsrc/simulation/m_cbc.fpp
GPU_PARALLEL_LOOPmacro calls from function-style syntax(
$:GPU_PARALLEL_LOOP(...)) to block-style syntax (#:callGPU_PARALLEL_LOOP(...) ... #:endcall GPU_PARALLEL_LOOP)#:block DEF_AMDand#:block UNDEF_AMDto handle AMD compiler-specific code for
molecular_weightsvsmolecular_weights_nonparameterpurekeyword from subroutines_any_cbc_boundariesto supportGPU parallelization
structure
parallel_macros.fpp
Consolidate GPU macros with OpenMP and OpenACC supportsrc/common/include/parallel_macros.fpp
shared_parallel_macros.fpp,omp_macros.fpp, andacc_macros.fppGPU_PARALLEL,GPU_PARALLEL_LOOP,GPU_ROUTINE,GPU_DECLARE,GPU_LOOP,GPU_DATA,GPU_HOST_DATA,GPU_ENTER_DATA,GPU_EXIT_DATA,GPU_ATOMIC,GPU_UPDATE, andGPU_WAITmacros to support both OpenACCand OpenMP backends
#if defined(MFC_OpenACC)/#elif defined(MFC_OpenMP)) to select appropriate backendimplementations
USE_GPU_MODULE(),DEF_AMD(),UNDEF_CCE(),DEF_CCE(),UNDEF_NVIDIA()for compiler-specific code generationGEN_*functions) in favorof delegating to backend-specific macro files
m_helper.fpp
Remove pure attribute for GPU parallelization supportsrc/common/m_helper.fpp
purekeyword from 13 subroutines and functions to enable GPUparallelization compatibility
s_comp_n_from_prim,s_comp_n_from_cons,s_transcoeff,s_int_to_str,s_swap,f_create_transform_matrix,s_transform_vec,s_transform_triangle,s_transform_model,f_create_bbox,f_xor,f_logical_to_int,unassociated_legendre,spherical_harmonic_func,associated_legendre,double_factorial,factorialm_nvtx.f90
Extend NVTX profiling support to OpenMP GPU backendsrc/common/m_nvtx.f90
defined(MFC_OpenACC) &&defined(__PGI)todefined(MFC_GPU) && defined(__PGI)in threelocations
implementations on PGI/NVIDIA compilers
1 files
m_derived_types.fpp
Add trailing newline to module filesrc/common/m_derived_types.fpp
1 files
configuration.cmake.in
Add OpenMP to CMake configuration displaytoolchain/cmake/configuration.cmake.in
OpenMPconfiguration option to CMake output displayoptions
59 files