Add OpenMP support for Nvidia hardware #999

prathi-wind · 2025-09-15T03:58:58Z

User description

Description

All test cases pass with OpenMP on Nvidia hardware.

IPO is causing 1D chemistry cases to fail, so I turned it off when compiling for OpenMP.

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Something else

Scope

This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

Test A
Test B

Test Configuration:

What computers and compilers did you use to test this:

Checklist

I have added comments for the new code
I added Doxygen docstrings to the new code
I have made corresponding changes to the documentation (docs/)
I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
I have added example cases in examples/ that demonstrate my new feature performing as expected.
They run to completion and demonstrate "interesting physics"
I ran ./mfc.sh format before committing my code
New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
This PR does not introduce any repeated code (it follows the DRY principle)
I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in `src/simulation`)

To make sure the code is performing as expected on GPU devices, I have:

Checked that the code compiles using NVHPC compilers
Checked that the code compiles using CRAY compilers
Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
Enclosed the new feature via nvtx ranges so that they can be identified in profiles
Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement

Description

Add OpenMP support for Nvidia hardware alongside existing OpenACC backend
Refactor GPU parallel loop macros from $:GPU_PARALLEL_LOOP() syntax to #:call GPU_PARALLEL_LOOP() / #:endcall GPU_PARALLEL_LOOP block syntax for OpenMP compatibility
Consolidate GPU macro infrastructure in parallel_macros.fpp to support both OpenACC and OpenMP backends with conditional compilation directives
Remove pure keyword from 20+ subroutines and functions across multiple modules to enable GPU parallelization (incompatible with OpenMP parallel regions)
Extend NVTX profiling support to OpenMP GPU backend by broadening preprocessor conditions from OpenACC-only to general GPU support
Add OpenMP configuration option to CMake output display
Update all simulation and post-processing modules to use new macro syntax consistently
Disable IPO (Interprocedural Optimization) for OpenMP compilation to prevent 1D chemistry test failures
All test cases pass with OpenMP on Nvidia hardware

Diagram Walkthrough

flowchart LR
  A["GPU Parallel Macros<br/>Old Syntax"] -->|Refactor| B["Unified Macro System<br/>parallel_macros.fpp"]
  B -->|Backend Selection| C["OpenACC Backend<br/>acc_macros.fpp"]
  B -->|Backend Selection| D["OpenMP Backend<br/>omp_macros.fpp"]
  E["Pure Functions<br/>Incompatible with GPU"] -->|Remove pure| F["GPU-Compatible<br/>Subroutines"]
  C -->|Compile| G["GPU Code<br/>Nvidia Hardware"]
  D -->|Compile| G
  F -->|Use| G

File Walkthrough

Relevant files

Enhancement

9 files

m_viscous.fpp `Convert GPU parallel loop macros to OpenMP-compatible syntax` src/simulation/m_viscous.fpp Converted GPU parallel loop macros from `$:GPU_PARALLEL_LOOP` to `#:call` `GPU_PARALLEL_LOOP` / `#:endcall GPU_PARALLEL_LOOP` syntax for OpenMP compatibility Updated indentation of loop bodies to align with new macro call/endcall structure Maintained all computational logic and loop collapse parameters unchanged Applied changes consistently across viscous stress tensor computation and gradient reconstruction routines	+788/-746
m_derived_variables.fpp `Remove pure keyword from derived variable subroutines` src/post_process/m_derived_variables.fpp Removed `pure` keyword from seven subroutine declarations to allow OpenMP parallelization Affected subroutines: `s_derive_specific_heat_ratio`, `s_derive_liquid_stiffness`, `s_derive_sound_speed`, `s_derive_flux_limiter`, `s_solve_linear_system`, `s_derive_vorticity_component`, and `s_derive_qm` Pure function restrictions are incompatible with OpenMP parallel regions	+7/-7
m_rhs.fpp `Refactor GPU parallel loop macro syntax for OpenMP support` src/simulation/m_rhs.fpp Converted GPU parallel loop macros from `$:GPU_PARALLEL_LOOP()` syntax to `#:call GPU_PARALLEL_LOOP()` and `#:endcall GPU_PARALLEL_LOOP` syntax throughout the file Updated loop indentation to align with the new macro call/endcall structure Reformatted private variable declarations in macro calls to be on single lines for consistency Applied changes across approximately 50+ parallel loop regions handling flux calculations, RHS computations, and dimensional splitting operations	+572/-557
m_derived_variables.fpp `Update GPU macros and remove pure attribute for GPU support` src/simulation/m_derived_variables.fpp Converted `$:GPU_PARALLEL_LOOP()` macros to `#:call GPU_PARALLEL_LOOP()` and `#:endcall GPU_PARALLEL_LOOP` syntax in multiple loop regions Removed `pure` keyword from subroutines `s_derive_acceleration_component`, `f_convert_cyl_to_cart`, and `f_r` to support GPU parallelization Updated loop indentation to match new macro call structure Simplified acceleration component calculations by removing conditional branches for different grid geometries in some cases	+219/-203
m_icpp_patches.fpp `Remove pure attribute from coordinate conversion functions` src/pre_process/m_icpp_patches.fpp Removed `pure` keyword from function `f_convert_cyl_to_cart` to enable GPU parallelization Removed `pure` keyword from elemental function `f_r` and added blank line after function signature These changes allow GPU routines to be applied to these functions	+3/-2
m_cbc.fpp `Refactor GPU parallel macros to block syntax with AMD support` src/simulation/m_cbc.fpp Converted `GPU_PARALLEL_LOOP` macro calls from function-style syntax (`$:GPU_PARALLEL_LOOP(...)`) to block-style syntax (`#:call` `GPU_PARALLEL_LOOP(...) ... #:endcall GPU_PARALLEL_LOOP`) Added conditional blocks using `#:block DEF_AMD` and `#:block UNDEF_AMD` to handle AMD compiler-specific code for `molecular_weights` vs `molecular_weights_nonparameter` Removed `pure` keyword from subroutine `s_any_cbc_boundaries` to support GPU parallelization Updated indentation throughout to reflect the new block-style macro structure	+594/-552
parallel_macros.fpp `Consolidate GPU macros with OpenMP and OpenACC support` src/common/include/parallel_macros.fpp Replaced entire macro implementation with includes of `shared_parallel_macros.fpp`, `omp_macros.fpp`, and `acc_macros.fpp` Refactored `GPU_PARALLEL`, `GPU_PARALLEL_LOOP`, `GPU_ROUTINE`, `GPU_DECLARE`, `GPU_LOOP`, `GPU_DATA`, `GPU_HOST_DATA`, `GPU_ENTER_DATA`, `GPU_EXIT_DATA`, `GPU_ATOMIC`, `GPU_UPDATE`, and `GPU_WAIT` macros to support both OpenACC and OpenMP backends Added conditional compilation directives (`#if defined(MFC_OpenACC)` / `#elif defined(MFC_OpenMP)`) to select appropriate backend implementations Added new macros: `USE_GPU_MODULE()`, `DEF_AMD()`, `UNDEF_CCE()`, `DEF_CCE()`, `UNDEF_NVIDIA()` for compiler-specific code generation Removed hundreds of lines of helper macros (`GEN_*` functions) in favor of delegating to backend-specific macro files	+167/-369
m_helper.fpp `Remove pure attribute for GPU parallelization support` src/common/m_helper.fpp Removed `pure` keyword from 13 subroutines and functions to enable GPU parallelization compatibility Affected routines: `s_comp_n_from_prim`, `s_comp_n_from_cons`, `s_transcoeff`, `s_int_to_str`, `s_swap`, `f_create_transform_matrix`, `s_transform_vec`, `s_transform_triangle`, `s_transform_model`, `f_create_bbox`, `f_xor`, `f_logical_to_int`, `unassociated_legendre`, `spherical_harmonic_func`, `associated_legendre`, `double_factorial`, `factorial`	+17/-17
m_nvtx.f90 `Extend NVTX profiling support to OpenMP GPU backend` src/common/m_nvtx.f90 Changed preprocessor condition from `defined(MFC_OpenACC) &&` `defined(__PGI)` to `defined(MFC_GPU) && defined(__PGI)` in three locations This broadens NVTX range support to both OpenACC and OpenMP GPU implementations on PGI/NVIDIA compilers	+3/-3

Formatting

1 files

m_derived_types.fpp `Add trailing newline to module file` src/common/m_derived_types.fpp Added trailing newline at end of file for code style consistency	+1/-0

Configuration changes

1 files

configuration.cmake.in `Add OpenMP to CMake configuration display` toolchain/cmake/configuration.cmake.in Added `OpenMP` configuration option to CMake output display Displays OpenMP support status alongside existing MPI and OpenACC options	+1/-0

Additional files

59 files

bench.yml	+28/-9
bench.sh	+8/-2
build.sh	+11/-3
submit-bench.sh	+2/-1
submit.sh	+2/-1
test.sh	+11/-1
lint-source.yml	+1/-1
bench.sh	+9/-1
submit-bench.sh	+3/-2
submit.sh	+4/-4
test.sh	+5/-0
test.yml	+19/-7
CMakeLists.txt	+46/-15
gpuParallelization.md	+88/-34
acc_macros.fpp	+311/-0
macros.fpp	+12/-1
omp_macros.fpp	+343/-0
shared_parallel_macros.fpp	+110/-0
m_boundary_common.fpp	+361/-347
m_chemistry.fpp	+134/-125
m_compute_levelset.fpp	+7/-7
m_constants.fpp	+1/-0
m_finite_differences.fpp	+34/-33
m_helper_basic.fpp	+12/-11
m_model.fpp	+22/-12
m_mpi_common.fpp	+223/-205
m_phase_change.fpp	+130/-132
m_variables_conversion.fpp	+349/-350
m_assign_variables.fpp	+3/-3
m_acoustic_src.fpp	+147/-143
m_body_forces.fpp	+49/-46
m_bubbles.fpp	+27/-27
m_bubbles_EE.fpp	+173/-166
m_bubbles_EL.fpp	+291/-271
m_bubbles_EL_kernels.fpp	+110/-108
m_compute_cbc.fpp	+17/-13
m_data_output.fpp	+13/-12
m_fftw.fpp	+109/-96
m_global_parameters.fpp	+6/-6
m_hyperelastic.fpp	+116/-116
m_hypoelastic.fpp	+291/-277
m_ibm.fpp	+167/-162
m_igr.fpp	+1752/-1755
m_mhd.fpp	+58/-57
m_mpi_proxy.fpp	+54/-48
m_muscl.fpp	+141/-133
m_pressure_relaxation.fpp	+20/-19
m_qbmm.fpp	+231/-231
m_riemann_solvers.fpp	+3241/-3236
m_sim_helpers.fpp	+5/-5
m_start_up.fpp	+71/-62
m_surface_tension.fpp	+158/-154
m_time_steppers.fpp	+103/-96
m_weno.fpp	+496/-488
syscheck.fpp	+21/-0
args.py	+7/-2
build.py	+6/-2
input.py	+9/-2
state.py	+36/-6

…, started work on enter and exit data, compiles

…, exit data, and update

…d WAIT

…e, add mappers to derived types, change how allocate is done

…loop

…types, removed rest of pure functions, fix issue with acoustic on nvfortran

…utput

…ged_master

qodo-merge-pro · 2025-10-21T16:26:58Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Possible Issue Several GPU parallel blocks use collapse with manually privatized scalars (e.g., 'private=[inv_ds,...]') and index-dependent accesses like dx(k_loop), dy(k), dz(k). Validate that all temporaries modified inside the loop bodies are truly thread-private and that no loop-carried dependencies or out-of-bounds at k-1 occur at domain starts, especially where loops begin at 0 and read k-1. do j = 1, sys_size do q_loop = 0, p do l_loop = 0, n do k_loop = 0, m inv_ds = 1._wp/dx(k_loop) flux_face1 = flux_n(1)%vf(j)%sf(k_loop - 1, l_loop, q_loop) flux_face2 = flux_n(1)%vf(j)%sf(k_loop, l_loop, q_loop) rhs_vf(j)%sf(k_loop, l_loop, q_loop) = inv_ds(flux_face1 - flux_face2) end do end do end do end do #:endcall GPU_PARALLEL_LOOP if (model_eqns == 3) then #:call GPU_PARALLEL_LOOP(collapse=4,private='[inv_ds,advected_qty_val, pressure_val,flux_face1,flux_face2]') do q_loop = 0, p do l_loop = 0, n do k_loop = 0, m do i_fluid_loop = 1, num_fluids inv_ds = 1._wp/dx(k_loop) advected_qty_val = q_cons_vf%vf(i_fluid_loop + advxb - 1)%sf(k_loop, l_loop, q_loop) pressure_val = q_prim_vf%vf(E_idx)%sf(k_loop, l_loop, q_loop) flux_face1 = flux_src_n_vf%vf(advxb)%sf(k_loop, l_loop, q_loop) flux_face2 = flux_src_n_vf%vf(advxb)%sf(k_loop - 1, l_loop, q_loop) rhs_vf(i_fluid_loop + intxb - 1)%sf(k_loop, l_loop, q_loop) = & rhs_vf(i_fluid_loop + intxb - 1)%sf(k_loop, l_loop, q_loop) - & inv_dsadvected_qty_valpressure_val(flux_face1 - flux_face2) end do end do end do end do #:endcall GPU_PARALLEL_LOOP end if call s_add_directional_advection_source_terms(idir, rhs_vf, q_cons_vf, q_prim_vf, flux_src_n_vf, Kterm) Behavioral Change Removing 'pure' from many routines (including elemental/recursive) may introduce side-effects and alter compiler optimizations. Verify no callers rely on purity (e.g., FORALL, transformational intrinsics, or implicit assumptions in OpenMP) and that these are safe within parallel regions. !! @param vftmp is the void fraction !! @param Rtmp is the bubble radii !! @param ntmp is the output number bubble density subroutine s_comp_n_from_prim(vftmp, Rtmp, ntmp, weights) $:GPU_ROUTINE(parallelism='[seq]') real(wp), intent(in) :: vftmp real(wp), dimension(nb), intent(in) :: Rtmp real(wp), intent(out) :: ntmp real(wp), dimension(nb), intent(in) :: weights real(wp) :: R3 R3 = dot_product(weights, Rtmp*3._wp) ntmp = (3._wp/(4._wppi))vftmp/R3 end subroutine s_comp_n_from_prim subroutine s_comp_n_from_cons(vftmp, nRtmp, ntmp, weights) $:GPU_ROUTINE(parallelism='[seq]') real(wp), intent(in) :: vftmp real(wp), dimension(nb), intent(in) :: nRtmp real(wp), intent(out) :: ntmp real(wp), dimension(nb), intent(in) :: weights real(wp) :: nR3 nR3 = dot_product(weights, nRtmp3._wp) ntmp = sqrt((4._wppi/3._wp)nR3/vftmp) end subroutine s_comp_n_from_cons impure subroutine s_print_2D_array(A, div) real(wp), dimension(:, :), intent(in) :: A real(wp), optional, intent(in) :: div integer :: i, j integer :: local_m, local_n real(wp) :: c local_m = size(A, 1) local_n = size(A, 2) if (present(div)) then c = div else c = 1._wp end if print , local_m, local_n do i = 1, local_m do j = 1, local_n write (, fmt="(F12.4)", advance="no") A(i, j)/c end do write (, fmt="(A1)") " " end do write (, fmt="(A1)") " " end subroutine s_print_2D_array !> Initializes non-polydisperse bubble modeling impure subroutine s_initialize_nonpoly integer :: ir real(wp) :: rhol0, pl0, uu, D_m, temp, omega_ref real(wp), dimension(Nb) :: chi_vw0, cp_m0, k_m0, rho_m0, x_vw real(wp), parameter :: k_poly = 1._wp !< !! polytropic index used to compute isothermal natural frequency real(wp), parameter :: Ru = 8314._wp !< !! universal gas constant rhol0 = rhoref pl0 = pref #ifdef MFC_SIMULATION @:ALLOCATE(pb0(nb), mass_n0(nb), mass_v0(nb), Pe_T(nb)) @:ALLOCATE(k_n(nb), k_v(nb), omegaN(nb)) @:ALLOCATE(Re_trans_T(nb), Re_trans_c(nb), Im_trans_T(nb), Im_trans_c(nb)) #else @:ALLOCATE(pb0(nb), mass_n0(nb), mass_v0(nb), Pe_T(nb)) @:ALLOCATE(k_n(nb), k_v(nb), omegaN(nb)) @:ALLOCATE(Re_trans_T(nb), Re_trans_c(nb), Im_trans_T(nb), Im_trans_c(nb)) #endif pb0(:) = dflt_real mass_n0(:) = dflt_real mass_v0(:) = dflt_real Pe_T(:) = dflt_real omegaN(:) = dflt_real mul0 = fluid_pp(1)%mul0 ss = fluid_pp(1)%ss pv = fluid_pp(1)%pv gamma_v = fluid_pp(1)%gamma_v M_v = fluid_pp(1)%M_v mu_v = fluid_pp(1)%mu_v k_v(:) = fluid_pp(1)%k_v gamma_n = fluid_pp(2)%gamma_v M_n = fluid_pp(2)%M_v mu_n = fluid_pp(2)%mu_v k_n(:) = fluid_pp(2)%k_v gamma_m = gamma_n if (thermal == 2) gamma_m = 1._wp temp = 293.15_wp D_m = 0.242e-4_wp uu = sqrt(pl0/rhol0) omega_ref = 3._wpk_polyCa + 2._wp(3._wpk_poly - 1._wp)/Web !!! thermal properties !!! ! gas constants R_n = Ru/M_n R_v = Ru/M_v ! phi_vn & phi_nv (phi_nn = phi_vv = 1) phi_vn = (1._wp + sqrt(mu_v/mu_n)(M_n/M_v)(0.25_wp))2 & /(sqrt(8._wp)sqrt(1._wp + M_v/M_n)) phi_nv = (1._wp + sqrt(mu_n/mu_v)(M_v/M_n)(0.25_wp))2 & /(sqrt(8._wp)sqrt(1._wp + M_n/M_v)) ! internal bubble pressure pb0(:) = pl0 + 2._wpss/(R0refR0(:)) ! mass fraction of vapor chi_vw0 = 1._wp/(1._wp + R_v/R_n(pb0/pv - 1._wp)) ! specific heat for gas/vapor mixture cp_m0 = chi_vw0R_vgamma_v/(gamma_v - 1._wp) & + (1._wp - chi_vw0)R_ngamma_n/(gamma_n - 1._wp) ! mole fraction of vapor x_vw = M_nchi_vw0/(M_v + (M_n - M_v)chi_vw0) ! thermal conductivity for gas/vapor mixture k_m0 = x_vwk_v/(x_vw + (1._wp - x_vw)phi_vn) & + (1._wp - x_vw)k_n/(x_vwphi_nv + 1._wp - x_vw) ! mixture density rho_m0 = pv/(chi_vw0R_vtemp) ! mass of gas/vapor computed using dimensional quantities mass_n0(:) = 4._wp(pb0(:) - pv)pi/(3._wpR_ntemprhol0)R0(:)*3 mass_v0(:) = 4._wppvpi/(3._wpR_vtemprhol0)R0(:)3 ! Peclet numbers Pe_T(:) = rho_m0cp_m0(:)uuR0ref/k_m0(:) Pe_c = uuR0ref/D_m Tw = temp ! nondimensional properties !if(.not. qbmm) then R_n = rhol0R_ntemp/pl0 R_v = rhol0R_vtemp/pl0 k_n(:) = k_n(:)/k_m0(:) k_v(:) = k_v(:)/k_m0(:) pb0 = pb0/pl0 pv = pv/pl0 Tw = 1._wp pl0 = 1._wp rhoref = 1._wp pref = 1._wp !end if ! natural frequencies omegaN(:) = sqrt(3._wpk_polyCa + 2._wp(3._wpk_poly - 1._wp)/(WebR0))/R0 do ir = 1, Nb call s_transcoeff(omegaN(ir)R0(ir), Pe_T(ir)R0(ir), & Re_trans_T(ir), Im_trans_T(ir)) call s_transcoeff(omegaN(ir)R0(ir), Pe_cR0(ir), & Re_trans_c(ir), Im_trans_c(ir)) end do Im_trans_T = 0._wp end subroutine s_initialize_nonpoly !> Computes the transfer coefficient for the non-polytropic bubble compression process !! @param omega natural frequencies !! @param peclet Peclet number !! @param Re_trans Real part of the transport coefficients !! @param Im_trans Imaginary part of the transport coefficients elemental subroutine s_transcoeff(omega, peclet, Re_trans, Im_trans) real(wp), intent(in) :: omega, peclet real(wp), intent(out) :: Re_trans, Im_trans complex(wp) :: imag, trans, c1, c2, c3 imag = (0._wp, 1._wp) c1 = imagomegapeclet c2 = sqrt(c1) c3 = (exp(c2) - exp(-c2))/(exp(c2) + exp(-c2)) ! TANH(c2) trans = ((c2/c3 - 1._wp)(-1) - 3._wp/c1)(-1) ! transfer function Re_trans = trans Im_trans = aimag(trans) end subroutine s_transcoeff elemental subroutine s_int_to_str(i, res) integer, intent(in) :: i character(len=), intent(inout) :: res write (res, '(I0)') i res = trim(res) end subroutine s_int_to_str !> Computes the Simpson weights for quadrature subroutine s_simpson(local_weight, local_R0) real(wp), dimension(:), intent(inout) :: local_weight real(wp), dimension(:), intent(inout) :: local_R0 integer :: ir real(wp) :: R0mn, R0mx, dphi, tmp, sd real(wp), dimension(nb) :: phi sd = poly_sigma R0mn = 0.8_wpexp(-2.8_wpsd) R0mx = 0.2_wpexp(9.5_wpsd) + 1._wp ! phi = ln( R0 ) & return R0 do ir = 1, nb phi(ir) = log(R0mn) & + (ir - 1._wp)log(R0mx/R0mn)/(nb - 1._wp) local_R0(ir) = exp(phi(ir)) end do dphi = phi(2) - phi(1) ! weights for quadrature using Simpson's rule do ir = 2, nb - 1 ! Gaussian tmp = exp(-0.5_wp(phi(ir)/sd)2)/sqrt(2._wppi)/sd if (mod(ir, 2) == 0) then local_weight(ir) = tmp4._wpdphi/3._wp else local_weight(ir) = tmp2._wpdphi/3._wp end if end do tmp = exp(-0.5_wp(phi(1)/sd)2)/sqrt(2._wppi)/sd local_weight(1) = tmpdphi/3._wp tmp = exp(-0.5_wp(phi(nb)/sd)*2)/sqrt(2._wppi)/sd local_weight(nb) = tmpdphi/3._wp end subroutine s_simpson !> This procedure computes the cross product of two vectors. !! @param a First vector. !! @param b Second vector. !! @return The cross product of the two vectors. pure function f_cross(a, b) result(c) real(wp), dimension(3), intent(in) :: a, b real(wp), dimension(3) :: c c(1) = a(2)b(3) - a(3)b(2) c(2) = a(3)b(1) - a(1)b(3) c(3) = a(1)b(2) - a(2)b(1) end function f_cross !> This procedure swaps two real numbers. !! @param lhs Left-hand side. !! @param rhs Right-hand side. elemental subroutine s_swap(lhs, rhs) real(wp), intent(inout) :: lhs, rhs real(wp) :: ltemp ltemp = lhs lhs = rhs rhs = ltemp end subroutine s_swap !> This procedure creates a transformation matrix. !! @param p Parameters for the transformation. !! @return Transformation matrix. function f_create_transform_matrix(param, center) result(out_matrix) type(ic_model_parameters), intent(in) :: param real(wp), dimension(1:3), optional, intent(in) :: center real(wp), dimension(1:4, 1:4) :: sc, rz, rx, ry, tr, t_back, t_to_origin, out_matrix sc = transpose(reshape([ & param%scale(1), 0._wp, 0._wp, 0._wp, & 0._wp, param%scale(2), 0._wp, 0._wp, & 0._wp, 0._wp, param%scale(3), 0._wp, & 0._wp, 0._wp, 0._wp, 1._wp], shape(sc))) rz = transpose(reshape([ & cos(param%rotate(3)), -sin(param%rotate(3)), 0._wp, 0._wp, & sin(param%rotate(3)), cos(param%rotate(3)), 0._wp, 0._wp, & 0._wp, 0._wp, 1._wp, 0._wp, & 0._wp, 0._wp, 0._wp, 1._wp], shape(rz))) rx = transpose(reshape([ & 1._wp, 0._wp, 0._wp, 0._wp, & 0._wp, cos(param%rotate(1)), -sin(param%rotate(1)), 0._wp, & 0._wp, sin(param%rotate(1)), cos(param%rotate(1)), 0._wp, & 0._wp, 0._wp, 0._wp, 1._wp], shape(rx))) ry = transpose(reshape([ & cos(param%rotate(2)), 0._wp, sin(param%rotate(2)), 0._wp, & 0._wp, 1._wp, 0._wp, 0._wp, & -sin(param%rotate(2)), 0._wp, cos(param%rotate(2)), 0._wp, & 0._wp, 0._wp, 0._wp, 1._wp], shape(ry))) tr = transpose(reshape([ & 1._wp, 0._wp, 0._wp, param%translate(1), & 0._wp, 1._wp, 0._wp, param%translate(2), & 0._wp, 0._wp, 1._wp, param%translate(3), & 0._wp, 0._wp, 0._wp, 1._wp], shape(tr))) if (present(center)) then ! Translation matrix to move center to the origin t_to_origin = transpose(reshape([ & 1._wp, 0._wp, 0._wp, -center(1), & 0._wp, 1._wp, 0._wp, -center(2), & 0._wp, 0._wp, 1._wp, -center(3), & 0._wp, 0._wp, 0._wp, 1._wp], shape(tr))) ! Translation matrix to move center back to original position t_back = transpose(reshape([ & 1._wp, 0._wp, 0._wp, center(1), & 0._wp, 1._wp, 0._wp, center(2), & 0._wp, 0._wp, 1._wp, center(3), & 0._wp, 0._wp, 0._wp, 1._wp], shape(tr))) out_matrix = matmul(tr, matmul(t_back, matmul(ry, matmul(rx, matmul(rz, matmul(sc, t_to_origin)))))) else out_matrix = matmul(ry, matmul(rx, rz)) end if end function f_create_transform_matrix !> This procedure transforms a vector by a matrix. !! @param vec Vector to transform. !! @param matrix Transformation matrix. subroutine s_transform_vec(vec, matrix) real(wp), dimension(1:3), intent(inout) :: vec real(wp), dimension(1:4, 1:4), intent(in) :: matrix real(wp), dimension(1:4) :: tmp tmp = matmul(matrix, [vec(1), vec(2), vec(3), 1._wp]) vec = tmp(1:3) end subroutine s_transform_vec !> This procedure transforms a triangle by a matrix, one vertex at a time. !! @param triangle Triangle to transform. !! @param matrix Transformation matrix. subroutine s_transform_triangle(triangle, matrix, matrix_n) type(t_triangle), intent(inout) :: triangle real(wp), dimension(1:4, 1:4), intent(in) :: matrix, matrix_n integer :: i do i = 1, 3 call s_transform_vec(triangle%v(i, :), matrix) end do call s_transform_vec(triangle%n(1:3), matrix_n) end subroutine s_transform_triangle !> This procedure transforms a model by a matrix, one triangle at a time. !! @param model Model to transform. !! @param matrix Transformation matrix. subroutine s_transform_model(model, matrix, matrix_n) type(t_model), intent(inout) :: model real(wp), dimension(1:4, 1:4), intent(in) :: matrix, matrix_n integer :: i do i = 1, size(model%trs) call s_transform_triangle(model%trs(i), matrix, matrix_n) end do end subroutine s_transform_model !> This procedure creates a bounding box for a model. !! @param model Model to create bounding box for. !! @return Bounding box. function f_create_bbox(model) result(bbox) type(t_model), intent(in) :: model type(t_bbox) :: bbox integer :: i, j if (size(model%trs) == 0) then bbox%min = 0._wp bbox%max = 0._wp return end if bbox%min = model%trs(1)%v(1, :) bbox%max = model%trs(1)%v(1, :) do i = 1, size(model%trs) do j = 1, 3 bbox%min = min(bbox%min, model%trs(i)%v(j, :)) bbox%max = max(bbox%max, model%trs(i)%v(j, :)) end do end do end function f_create_bbox !> This procedure performs xor on lhs and rhs. !! @param lhs logical input. !! @param rhs other logical input. !! @return xored result. elemental function f_xor(lhs, rhs) result(res) logical, intent(in) :: lhs, rhs logical :: res res = (lhs .and. .not. rhs) .or. (.not. lhs .and. rhs) end function f_xor !> This procedure converts logical to 1 or 0. !! @param perdicate A Logical argument. !! @return 1 if .true., 0 if .false.. elemental function f_logical_to_int(predicate) result(int) logical, intent(in) :: predicate integer :: int if (predicate) then int = 1 else int = 0 end if end function f_logical_to_int !> This function generates the unassociated legendre poynomials !! @param x is the input value !! @param l is the degree !! @return P is the unassociated legendre polynomial evaluated at x recursive function unassociated_legendre(x, l) result(result_P) integer, intent(in) :: l real(wp), intent(in) :: x real(wp) :: result_P if (l == 0) then result_P = 1._wp else if (l == 1) then result_P = x else result_P = ((2l - 1)xunassociated_legendre(x, l - 1) - (l - 1)unassociated_legendre(x, l - 2))/l end if end function unassociated_legendre !> This function calculates the spherical harmonic function evaluated at x and phi !! @param x is the x coordinate !! @param phi is the phi coordinate !! @param l is the degree !! @param m_order is the order !! @return Y is the spherical harmonic function evaluated at x and phi recursive function spherical_harmonic_func(x, phi, l, m_order) result(Y) integer, intent(in) :: l, m_order real(wp), intent(in) :: x, phi real(wp) :: Y, prefactor, local_pi local_pi = acos(-1._wp) prefactor = sqrt((2l + 1)/(4local_pi)factorial(l - m_order)/factorial(l + m_order)); if (m_order == 0) then Y = prefactorassociated_legendre(x, l, m_order); elseif (m_order > 0) then Y = (-1._wp)m_ordersqrt(2._wp)prefactorassociated_legendre(x, l, m_order)cos(m_orderphi); end if end function spherical_harmonic_func !> This function generates the associated legendre polynomials evaluated !! at x with inputs l and m !! @param x is the input value !! @param l is the degree !! @param m_order is the order !! @return P is the associated legendre polynomial evaluated at x recursive function associated_legendre(x, l, m_order) result(result_P) integer, intent(in) :: l, m_order real(wp), intent(in) :: x real(wp) :: result_P if (m_order <= 0 .and. l <= 0) then result_P = 1; elseif (l == 1 .and. m_order <= 0) then result_P = x; elseif (l == 1 .and. m_order == 1) then result_P = -(1 - x2)(1._wp/2._wp); elseif (m_order == l) then result_P = (-1)*ldouble_factorial(2l - 1)(1 - x2)(l/2); elseif (m_order == l - 1) then result_P = x(2l - 1)associated_legendre(x, l - 1, l - 1); else result_P = ((2l - 1)xassociated_legendre(x, l - 1, m_order) - (l + m_order - 1)associated_legendre(x, l - 2, m_order))/(l - m_order); end if end function associated_legendre !> This function calculates the double factorial value of an integer !! @param n_in is the input integer !! @return R is the double factorial value of n elemental function double_factorial(n_in) result(R_result) integer, intent(in) :: n_in integer, parameter :: int64_kind = selected_int_kind(18) ! 18 bytes for 64-bit integer integer(kind=int64_kind) :: R_result integer :: i R_result = product((/(i, i=n_in, 1, -2)/)) end function double_factorial !> The following function calculates the factorial value of an integer !! @param n_in is the input integer !! @return R is the factorial value of n elemental function factorial(n_in) result(R_result) integer, intent(in) :: n_in integer, parameter :: int64_kind = selected_int_kind(18) ! 18 bytes for 64-bit integer Duplicate Code* Many nearly identical GPU_PARALLEL_LOOP sections for x/y/z directions and alt_soundspeed branches repeat patterns with only index/order differences. Consider consolidating via helper kernels/macros to reduce maintenance risk and ensure backend parity between OpenACC and OpenMP. real(wp) :: advected_qty_val, pressure_val, velocity_val if (alt_soundspeed) then #:call GPU_PARALLEL_LOOP(collapse=3) do q_loop = 0, p do l_loop = 0, n do k_loop = 0, m blkmod1(k_loop, l_loop, q_loop) = ((gammas(1) + 1._wp)q_prim_vf%vf(E_idx)%sf(k_loop, l_loop, q_loop) + & pi_infs(1))/gammas(1) blkmod2(k_loop, l_loop, q_loop) = ((gammas(2) + 1._wp)q_prim_vf%vf(E_idx)%sf(k_loop, l_loop, q_loop) + & pi_infs(2))/gammas(2) alpha1(k_loop, l_loop, q_loop) = q_cons_vf%vf(advxb)%sf(k_loop, l_loop, q_loop) if (bubbles_euler) then alpha2(k_loop, l_loop, q_loop) = q_cons_vf%vf(alf_idx - 1)%sf(k_loop, l_loop, q_loop) else alpha2(k_loop, l_loop, q_loop) = q_cons_vf%vf(advxe)%sf(k_loop, l_loop, q_loop) end if Kterm(k_loop, l_loop, q_loop) = alpha1(k_loop, l_loop, q_loop)alpha2(k_loop, l_loop, q_loop) & (blkmod2(k_loop, l_loop, q_loop) - blkmod1(k_loop, l_loop, q_loop))/ & (alpha1(k_loop, l_loop, q_loop)blkmod2(k_loop, l_loop, q_loop) + & alpha2(k_loop, l_loop, q_loop)blkmod1(k_loop, l_loop, q_loop)) end do end do end do #:endcall GPU_PARALLEL_LOOP end if select case (idir)

qodo-merge-pro · 2025-10-21T16:28:15Z

src/simulation/m_riemann_solvers.fpp

High-level Suggestion

Refactor large, duplicated code blocks in files like m_igr.fpp and m_riemann_solvers.fpp to improve maintainability before adding OpenMP support. This could involve using macros or parameterized subroutines to handle logic that varies by direction. [High-level, importance: 7]

Solution Walkthrough:

Before:

subroutine s_igr_source_terms(idir, ...) if (idir == 1) then ! Massive block of code for x-direction #:call GPU_PARALLEL_LOOP(...) do l = 0, p do k = 0, n do j = -1, m ! ... hundreds of lines of calculations for x-fluxes end do end do end do #:endcall GPU_PARALLEL_LOOP else if (idir == 2) then ! Massive block of code for y-direction (very similar to x-direction) ... else if (idir == 3) then ! Massive block of code for z-direction (very similar to x-direction) ... end if end subroutine

After:

subroutine s_compute_directional_terms(norm_dir, ...) ! Generic logic for any direction #:call GPU_PARALLEL_LOOP(...) do l = ..., ... do k = ..., ... do j = ..., ... ! ... hundreds of lines of calculations using norm_dir end do end do end do #:endcall GPU_PARALLEL_LOOP end subroutine subroutine s_igr_source_terms(idir, ...) call s_compute_directional_terms(idir, ...) end subroutine

qodo-merge-pro · 2025-10-21T16:28:16Z

src/simulation/m_riemann_solvers.fpp

+                                    $:GPU_LOOP(parallelism='[seq]')
+                                    do i = 1, strxe - strxb + 1
+                                        tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i)
+                                        tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i)
+                                        ! Elastic contribution to energy if G large enough
+                                        !TODO take out if statement if stable without
+                                        if ((G_L > 1000) .and. (G_R > 1000)) then
                                            E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
                                            E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
+                                            ! Double for shear stresses
+                                            if (any(strxb - 1 + i == shear_indices)) then
+                                                E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
+                                                E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
+                                            end if
                                        end if
-                                    end if
-                                end do
-                            end if
+                                    end do


Suggestion: Optimize the GPU loop by replacing the any intrinsic function with a pre-calculated boolean array for a more efficient lookup of shear indices. [general, importance: 7]

Suggested change

$:GPU_LOOP(parallelism='[seq]')

do i = 1, strxe - strxb + 1

tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i)

tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i)

! Elastic contribution to energy if G large enough

!TODO take out if statement if stable without

if ((G_L > 1000) .and. (G_R > 1000)) then

E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)

E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)

! Double for shear stresses

if (any(strxb - 1 + i == shear_indices)) then

E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)

E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)

end if

end if

end if

end do

end if

end do

logical :: is_shear(strxe - strxb + 1)

is_shear = .false.

do ii = 1, size(shear_indices)

if (shear_indices(ii) >= strxb .and. shear_indices(ii) <= strxe) then

is_shear(shear_indices(ii) - strxb + 1) = .true.

end if

end do

$:GPU_LOOP(parallelism='[seq]')

do i = 1, strxe - strxb + 1

tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i)

tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i)

! Elastic contribution to energy if G large enough

!TODO take out if statement if stable without

if ((G_L > 1000) .and. (G_R > 1000)) then

E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)

E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)

! Double for shear stresses

if (is_shear(i)) then

E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)

E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)

end if

end if

end do

qodo-merge-pro · 2025-10-21T16:28:17Z

src/simulation/m_igr.fpp

+        jac_sf(1)%sf => jac
+        $:GPU_ENTER_DATA(copyin='[jac_sf(1)%sf]')
+        $:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]')


Suggestion: Remove the copyin directive for jac_sf(1)%sf to avoid overwriting the GPU-initialized jac array with uninitialized data from the host. [possible issue, importance: 9]

Suggested change

jac_sf(1)%sf => jac

$:GPU_ENTER_DATA(copyin='[jac_sf(1)%sf]')

$:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]')

jac_sf(1)%sf => jac

$:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]')

anandrdbz · 2025-10-24T00:18:34Z

This looks good from my side, obviously will need a lot of changes for AMD and to stay up to date with openmp_rebased, but that can be done once this is merged

sbryngelson · 2025-10-24T03:07:28Z

great, can we confirm that there's no slowdown for openacc before merging? benchmarking isn't working because the benchmark files have new names

wilfonba · 2025-10-24T15:13:13Z

It also failed to build with CCE GPU because of parameters in declare target statements in chemistry

anandrdbz · 2025-10-25T19:34:47Z

It also failed to build with CCE GPU because of parameters in declare target statements in chemistry

Didn't it pass the test cases on frontier ?

wilfonba · 2025-10-25T19:37:49Z

It also failed to build with CCE GPU because of parameters in declare target statements in chemistry

Didn't it pass the test cases on frontier ?

The test suite doesn't use case optimization, so it doesn't have the same problem

…emporarily)" This reverts commit bdcefa4.

sbryngelson · 2025-10-30T13:59:07Z

@prathi-wind does this seem ready to merge?

prathi-wind and others added 30 commits July 22, 2025 13:16

Update mfc python and cmake to support OpenMP

8584e89

Fixed issue with not compiling on CPU builds

7521731

Temporary commit

90c6738

OMP parallel and parallel loop

06a783e

Removed pure markings

2abfad5

Added routine and declare and partial data, non compiling

db1b8c5

Some manual changes to codebase, and implemented attach

ed29d13

Changed parallel loop to also include the end parallel

c1b41a6

Ran formatter

29e9404

Fixed some issues with matching start and end of parallel loop macros…

cec9867

…, started work on enter and exit data, compiles

Moved macro code to their corresponding file, and finished enter data…

33eca5f

…, exit data, and update

remove line that sets default_val to empty string

11822f5

Fixed GPU_PARALLEL for omp and ran formatter

314fa13

Add syscheck of OpenMP, add omp support for GPU_HOST_DATA, ATOMIC, an…

9d23036

…d WAIT

Update var name

de586ad

Change how parallel loop is translated

84ddc01

Ran formatter

7246e9b

Remove extraneous build flags

d5381aa

Remove thermochem function calls

dbcb6f3

Remove LTO add always to data allocation omp, switch delete to releas…

94222f4

…e, add mappers to derived types, change how allocate is done

Fixed parallel loop when no OpenMP or OpenACC

473d19d

Update how allocate macro works, update riemann solver, and parallel …

5703c19

…loop

Passing most 1D cases

7021690

Fixed IGR 2D, readded parallel loop in cbc, undid changes in derived …

828d9d8

…types, removed rest of pure functions, fix issue with acoustic on nvfortran

Forgot to add something for IGR and add back parallel loop for data o…

cb02511

…utput

change binding + test suite works

69b5792

Added missing space

a01f262

Chemistry works with OpenACC and almost works with OpenMP

d7fbcab

Readd LTO to cmake

0d550aa

Update toolchain to support swapping between OpenACC and OpenMP

bac089d

prathi-wind and others added 11 commits October 7, 2025 01:13

Merge remote-tracking branch 'upstream/master' into openmp_nvidia_mer…

d0914c9

…ged_master

Ran formatter

4cdabd9

Fix issue in frontier test script

4db238b

1D Chemistry seems to have an issue with two-pass IPO

865c6e2

Issue with frontier test fix

cc25407

Fix issue with failing openacc cases on Cray

355e0e9

Merge remote-tracking branch 'upstream/master' into openmp_nvidia_mer…

a01e5b8

…ged_master

Ran formatter

b5b410c

Fixed some incorrect merging

4b67dfa

Merge remote-tracking branch 'upstream/master' into openmp_nvidia_mer…

2d1f096

…ged_master

Ran formatter

a967783

prathi-wind marked this pull request as ready for review October 21, 2025 16:25

prathi-wind requested review from a team and sbryngelson as code owners October 21, 2025 16:25

qodo-merge-pro bot added the Review effort 4/5 label Oct 21, 2025

qodo-merge-pro bot reviewed Oct 21, 2025

View reviewed changes

anandrdbz approved these changes Oct 24, 2025

View reviewed changes

prathi-wind added 3 commits October 27, 2025 13:01

Master doesn't generate bench file with interface attached (temporarily)

bdcefa4

Revert "Master doesn't generate bench file with interface attached (t…

8f45346

…emporarily)" This reverts commit bdcefa4.

Remove frontier omp benchmark

c1057f7

sbryngelson approved these changes Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenMP support for Nvidia hardware #999

Add OpenMP support for Nvidia hardware #999

Uh oh!

prathi-wind commented Sep 15, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

qodo-merge-pro bot commented Oct 21, 2025

Uh oh!

qodo-merge-pro bot Oct 21, 2025

Uh oh!

qodo-merge-pro bot Oct 21, 2025

Uh oh!

qodo-merge-pro bot Oct 21, 2025

Uh oh!

anandrdbz commented Oct 24, 2025

Uh oh!

sbryngelson commented Oct 24, 2025

Uh oh!

wilfonba commented Oct 24, 2025

Uh oh!

anandrdbz commented Oct 25, 2025

Uh oh!

wilfonba commented Oct 25, 2025

Uh oh!

sbryngelson commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Add OpenMP support for Nvidia hardware #999

Are you sure you want to change the base?

Add OpenMP support for Nvidia hardware #999

Uh oh!

Conversation

prathi-wind commented Sep 15, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Type of change

Scope

How Has This Been Tested?

Checklist

If your code changes any code source files (anything in src/simulation)

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Oct 21, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot Oct 21, 2025

Choose a reason for hiding this comment

High-level Suggestion

Solution Walkthrough:

Before:

After:

Uh oh!

qodo-merge-pro bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

qodo-merge-pro bot Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

anandrdbz commented Oct 24, 2025

Uh oh!

sbryngelson commented Oct 24, 2025

Uh oh!

wilfonba commented Oct 24, 2025

Uh oh!

anandrdbz commented Oct 25, 2025

Uh oh!

wilfonba commented Oct 25, 2025

Uh oh!

sbryngelson commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

prathi-wind commented Sep 15, 2025 •

edited by qodo-merge-pro bot

Loading

If your code changes any code source files (anything in `src/simulation`)