Skip to content

Conversation

@prathi-wind
Copy link
Collaborator

@prathi-wind prathi-wind commented Sep 15, 2025

User description

Description

All test cases pass with OpenMP on Nvidia hardware.

IPO is causing 1D chemistry cases to fail, so I turned it off when compiling for OpenMP.

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement


Description

  • Add OpenMP support for Nvidia hardware alongside existing OpenACC backend

  • Refactor GPU parallel loop macros from $:GPU_PARALLEL_LOOP() syntax to #:call GPU_PARALLEL_LOOP() / #:endcall GPU_PARALLEL_LOOP block syntax for OpenMP compatibility

  • Consolidate GPU macro infrastructure in parallel_macros.fpp to support both OpenACC and OpenMP backends with conditional compilation directives

  • Remove pure keyword from 20+ subroutines and functions across multiple modules to enable GPU parallelization (incompatible with OpenMP parallel regions)

  • Extend NVTX profiling support to OpenMP GPU backend by broadening preprocessor conditions from OpenACC-only to general GPU support

  • Add OpenMP configuration option to CMake output display

  • Update all simulation and post-processing modules to use new macro syntax consistently

  • Disable IPO (Interprocedural Optimization) for OpenMP compilation to prevent 1D chemistry test failures

  • All test cases pass with OpenMP on Nvidia hardware


Diagram Walkthrough

flowchart LR
  A["GPU Parallel Macros<br/>Old Syntax"] -->|Refactor| B["Unified Macro System<br/>parallel_macros.fpp"]
  B -->|Backend Selection| C["OpenACC Backend<br/>acc_macros.fpp"]
  B -->|Backend Selection| D["OpenMP Backend<br/>omp_macros.fpp"]
  E["Pure Functions<br/>Incompatible with GPU"] -->|Remove pure| F["GPU-Compatible<br/>Subroutines"]
  C -->|Compile| G["GPU Code<br/>Nvidia Hardware"]
  D -->|Compile| G
  F -->|Use| G
Loading

File Walkthrough

Relevant files
Enhancement
9 files
m_viscous.fpp
Convert GPU parallel loop macros to OpenMP-compatible syntax

src/simulation/m_viscous.fpp

  • Converted GPU parallel loop macros from $:GPU_PARALLEL_LOOP to #:call
    GPU_PARALLEL_LOOP / #:endcall GPU_PARALLEL_LOOP syntax for OpenMP
    compatibility
  • Updated indentation of loop bodies to align with new macro
    call/endcall structure
  • Maintained all computational logic and loop collapse parameters
    unchanged
  • Applied changes consistently across viscous stress tensor computation
    and gradient reconstruction routines
+788/-746
m_derived_variables.fpp
Remove pure keyword from derived variable subroutines       

src/post_process/m_derived_variables.fpp

  • Removed pure keyword from seven subroutine declarations to allow
    OpenMP parallelization
  • Affected subroutines: s_derive_specific_heat_ratio,
    s_derive_liquid_stiffness, s_derive_sound_speed,
    s_derive_flux_limiter, s_solve_linear_system,
    s_derive_vorticity_component, and s_derive_qm
  • Pure function restrictions are incompatible with OpenMP parallel
    regions
+7/-7     
m_rhs.fpp
Refactor GPU parallel loop macro syntax for OpenMP support

src/simulation/m_rhs.fpp

  • Converted GPU parallel loop macros from $:GPU_PARALLEL_LOOP() syntax
    to #:call GPU_PARALLEL_LOOP() and #:endcall GPU_PARALLEL_LOOP syntax
    throughout the file
  • Updated loop indentation to align with the new macro call/endcall
    structure
  • Reformatted private variable declarations in macro calls to be on
    single lines for consistency
  • Applied changes across approximately 50+ parallel loop regions
    handling flux calculations, RHS computations, and dimensional
    splitting operations
+572/-557
m_derived_variables.fpp
Update GPU macros and remove pure attribute for GPU support

src/simulation/m_derived_variables.fpp

  • Converted $:GPU_PARALLEL_LOOP() macros to #:call GPU_PARALLEL_LOOP()
    and #:endcall GPU_PARALLEL_LOOP syntax in multiple loop regions
  • Removed pure keyword from subroutines s_derive_acceleration_component,
    f_convert_cyl_to_cart, and f_r to support GPU parallelization
  • Updated loop indentation to match new macro call structure
  • Simplified acceleration component calculations by removing conditional
    branches for different grid geometries in some cases
+219/-203
m_icpp_patches.fpp
Remove pure attribute from coordinate conversion functions

src/pre_process/m_icpp_patches.fpp

  • Removed pure keyword from function f_convert_cyl_to_cart to enable GPU
    parallelization
  • Removed pure keyword from elemental function f_r and added blank line
    after function signature
  • These changes allow GPU routines to be applied to these functions
+3/-2     
m_cbc.fpp
Refactor GPU parallel macros to block syntax with AMD support

src/simulation/m_cbc.fpp

  • Converted GPU_PARALLEL_LOOP macro calls from function-style syntax
    ($:GPU_PARALLEL_LOOP(...)) to block-style syntax (#:call
    GPU_PARALLEL_LOOP(...) ... #:endcall GPU_PARALLEL_LOOP)
  • Added conditional blocks using #:block DEF_AMD and #:block UNDEF_AMD
    to handle AMD compiler-specific code for molecular_weights vs
    molecular_weights_nonparameter
  • Removed pure keyword from subroutine s_any_cbc_boundaries to support
    GPU parallelization
  • Updated indentation throughout to reflect the new block-style macro
    structure
+594/-552
parallel_macros.fpp
Consolidate GPU macros with OpenMP and OpenACC support     

src/common/include/parallel_macros.fpp

  • Replaced entire macro implementation with includes of
    shared_parallel_macros.fpp, omp_macros.fpp, and acc_macros.fpp
  • Refactored GPU_PARALLEL, GPU_PARALLEL_LOOP, GPU_ROUTINE, GPU_DECLARE,
    GPU_LOOP, GPU_DATA, GPU_HOST_DATA, GPU_ENTER_DATA, GPU_EXIT_DATA,
    GPU_ATOMIC, GPU_UPDATE, and GPU_WAIT macros to support both OpenACC
    and OpenMP backends
  • Added conditional compilation directives (#if defined(MFC_OpenACC) /
    #elif defined(MFC_OpenMP)) to select appropriate backend
    implementations
  • Added new macros: USE_GPU_MODULE(), DEF_AMD(), UNDEF_CCE(), DEF_CCE(),
    UNDEF_NVIDIA() for compiler-specific code generation
  • Removed hundreds of lines of helper macros (GEN_* functions) in favor
    of delegating to backend-specific macro files
+167/-369
m_helper.fpp
Remove pure attribute for GPU parallelization support       

src/common/m_helper.fpp

  • Removed pure keyword from 13 subroutines and functions to enable GPU
    parallelization compatibility
  • Affected routines: s_comp_n_from_prim, s_comp_n_from_cons,
    s_transcoeff, s_int_to_str, s_swap, f_create_transform_matrix,
    s_transform_vec, s_transform_triangle, s_transform_model,
    f_create_bbox, f_xor, f_logical_to_int, unassociated_legendre,
    spherical_harmonic_func, associated_legendre, double_factorial,
    factorial
+17/-17 
m_nvtx.f90
Extend NVTX profiling support to OpenMP GPU backend           

src/common/m_nvtx.f90

  • Changed preprocessor condition from defined(MFC_OpenACC) &&
    defined(__PGI) to defined(MFC_GPU) && defined(__PGI) in three
    locations
  • This broadens NVTX range support to both OpenACC and OpenMP GPU
    implementations on PGI/NVIDIA compilers
+3/-3     
Formatting
1 files
m_derived_types.fpp
Add trailing newline to module file                                           

src/common/m_derived_types.fpp

  • Added trailing newline at end of file for code style consistency
+1/-0     
Configuration changes
1 files
configuration.cmake.in
Add OpenMP to CMake configuration display                               

toolchain/cmake/configuration.cmake.in

  • Added OpenMP configuration option to CMake output display
  • Displays OpenMP support status alongside existing MPI and OpenACC
    options
+1/-0     
Additional files
59 files
bench.yml +28/-9   
bench.sh +8/-2     
build.sh +11/-3   
submit-bench.sh +2/-1     
submit.sh +2/-1     
test.sh +11/-1   
lint-source.yml +1/-1     
bench.sh +9/-1     
submit-bench.sh +3/-2     
submit.sh +4/-4     
test.sh +5/-0     
test.yml +19/-7   
CMakeLists.txt +46/-15 
gpuParallelization.md +88/-34 
acc_macros.fpp +311/-0 
macros.fpp +12/-1   
omp_macros.fpp +343/-0 
shared_parallel_macros.fpp +110/-0 
m_boundary_common.fpp +361/-347
m_chemistry.fpp +134/-125
m_compute_levelset.fpp +7/-7     
m_constants.fpp +1/-0     
m_finite_differences.fpp +34/-33 
m_helper_basic.fpp +12/-11 
m_model.fpp +22/-12 
m_mpi_common.fpp +223/-205
m_phase_change.fpp +130/-132
m_variables_conversion.fpp +349/-350
m_assign_variables.fpp +3/-3     
m_acoustic_src.fpp +147/-143
m_body_forces.fpp +49/-46 
m_bubbles.fpp +27/-27 
m_bubbles_EE.fpp +173/-166
m_bubbles_EL.fpp +291/-271
m_bubbles_EL_kernels.fpp +110/-108
m_compute_cbc.fpp +17/-13 
m_data_output.fpp +13/-12 
m_fftw.fpp +109/-96
m_global_parameters.fpp +6/-6     
m_hyperelastic.fpp +116/-116
m_hypoelastic.fpp +291/-277
m_ibm.fpp +167/-162
m_igr.fpp +1752/-1755
m_mhd.fpp +58/-57 
m_mpi_proxy.fpp +54/-48 
m_muscl.fpp +141/-133
m_pressure_relaxation.fpp +20/-19 
m_qbmm.fpp +231/-231
m_riemann_solvers.fpp +3241/-3236
m_sim_helpers.fpp +5/-5     
m_start_up.fpp +71/-62 
m_surface_tension.fpp +158/-154
m_time_steppers.fpp +103/-96
m_weno.fpp +496/-488
syscheck.fpp +21/-0   
args.py +7/-2     
build.py +6/-2     
input.py +9/-2     
state.py +36/-6   

prathi-wind and others added 30 commits July 22, 2025 13:16
…, started work on enter and exit data, compiles
…e, add mappers to derived types, change how allocate is done
…types, removed rest of pure functions, fix issue with acoustic on nvfortran
@prathi-wind prathi-wind marked this pull request as ready for review October 21, 2025 16:25
@prathi-wind prathi-wind requested review from a team and sbryngelson as code owners October 21, 2025 16:25
@qodo-merge-pro
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

Several GPU parallel blocks use collapse with manually privatized scalars (e.g., 'private=[inv_ds,...]') and index-dependent accesses like dx(k_loop), dy(k), dz(k). Validate that all temporaries modified inside the loop bodies are truly thread-private and that no loop-carried dependencies or out-of-bounds at k-1 occur at domain starts, especially where loops begin at 0 and read k-1.

    do j = 1, sys_size
        do q_loop = 0, p
            do l_loop = 0, n
                do k_loop = 0, m
                    inv_ds = 1._wp/dx(k_loop)
                    flux_face1 = flux_n(1)%vf(j)%sf(k_loop - 1, l_loop, q_loop)
                    flux_face2 = flux_n(1)%vf(j)%sf(k_loop, l_loop, q_loop)
                    rhs_vf(j)%sf(k_loop, l_loop, q_loop) = inv_ds*(flux_face1 - flux_face2)
                end do
            end do
        end do
    end do
#:endcall GPU_PARALLEL_LOOP

if (model_eqns == 3) then
    #:call GPU_PARALLEL_LOOP(collapse=4,private='[inv_ds,advected_qty_val, pressure_val,flux_face1,flux_face2]')
        do q_loop = 0, p
            do l_loop = 0, n
                do k_loop = 0, m
                    do i_fluid_loop = 1, num_fluids
                        inv_ds = 1._wp/dx(k_loop)
                        advected_qty_val = q_cons_vf%vf(i_fluid_loop + advxb - 1)%sf(k_loop, l_loop, q_loop)
                        pressure_val = q_prim_vf%vf(E_idx)%sf(k_loop, l_loop, q_loop)
                        flux_face1 = flux_src_n_vf%vf(advxb)%sf(k_loop, l_loop, q_loop)
                        flux_face2 = flux_src_n_vf%vf(advxb)%sf(k_loop - 1, l_loop, q_loop)
                        rhs_vf(i_fluid_loop + intxb - 1)%sf(k_loop, l_loop, q_loop) = &
                            rhs_vf(i_fluid_loop + intxb - 1)%sf(k_loop, l_loop, q_loop) - &
                            inv_ds*advected_qty_val*pressure_val*(flux_face1 - flux_face2)
                    end do
                end do
            end do
        end do
    #:endcall GPU_PARALLEL_LOOP
end if

call s_add_directional_advection_source_terms(idir, rhs_vf, q_cons_vf, q_prim_vf, flux_src_n_vf, Kterm)
Behavioral Change

Removing 'pure' from many routines (including elemental/recursive) may introduce side-effects and alter compiler optimizations. Verify no callers rely on purity (e.g., FORALL, transformational intrinsics, or implicit assumptions in OpenMP) and that these are safe within parallel regions.

        !! @param vftmp is the void fraction
        !! @param Rtmp is the  bubble radii
        !! @param ntmp is the output number bubble density
    subroutine s_comp_n_from_prim(vftmp, Rtmp, ntmp, weights)
        $:GPU_ROUTINE(parallelism='[seq]')
        real(wp), intent(in) :: vftmp
        real(wp), dimension(nb), intent(in) :: Rtmp
        real(wp), intent(out) :: ntmp
        real(wp), dimension(nb), intent(in) :: weights

        real(wp) :: R3

        R3 = dot_product(weights, Rtmp**3._wp)
        ntmp = (3._wp/(4._wp*pi))*vftmp/R3

    end subroutine s_comp_n_from_prim

    subroutine s_comp_n_from_cons(vftmp, nRtmp, ntmp, weights)
        $:GPU_ROUTINE(parallelism='[seq]')
        real(wp), intent(in) :: vftmp
        real(wp), dimension(nb), intent(in) :: nRtmp
        real(wp), intent(out) :: ntmp
        real(wp), dimension(nb), intent(in) :: weights

        real(wp) :: nR3

        nR3 = dot_product(weights, nRtmp**3._wp)
        ntmp = sqrt((4._wp*pi/3._wp)*nR3/vftmp)

    end subroutine s_comp_n_from_cons

    impure subroutine s_print_2D_array(A, div)

        real(wp), dimension(:, :), intent(in) :: A
        real(wp), optional, intent(in) :: div

        integer :: i, j
        integer :: local_m, local_n
        real(wp) :: c

        local_m = size(A, 1)
        local_n = size(A, 2)

        if (present(div)) then
            c = div
        else
            c = 1._wp
        end if

        print *, local_m, local_n

        do i = 1, local_m
            do j = 1, local_n
                write (*, fmt="(F12.4)", advance="no") A(i, j)/c
            end do
            write (*, fmt="(A1)") " "
        end do
        write (*, fmt="(A1)") " "

    end subroutine s_print_2D_array

    !> Initializes non-polydisperse bubble modeling
    impure subroutine s_initialize_nonpoly

        integer :: ir
        real(wp) :: rhol0, pl0, uu, D_m, temp, omega_ref
        real(wp), dimension(Nb) :: chi_vw0, cp_m0, k_m0, rho_m0, x_vw

        real(wp), parameter :: k_poly = 1._wp !<
            !! polytropic index used to compute isothermal natural frequency

        real(wp), parameter :: Ru = 8314._wp !<
            !! universal gas constant

        rhol0 = rhoref
        pl0 = pref
#ifdef MFC_SIMULATION
        @:ALLOCATE(pb0(nb), mass_n0(nb), mass_v0(nb), Pe_T(nb))
        @:ALLOCATE(k_n(nb), k_v(nb), omegaN(nb))
        @:ALLOCATE(Re_trans_T(nb), Re_trans_c(nb), Im_trans_T(nb), Im_trans_c(nb))
#else
        @:ALLOCATE(pb0(nb), mass_n0(nb), mass_v0(nb), Pe_T(nb))
        @:ALLOCATE(k_n(nb), k_v(nb), omegaN(nb))
        @:ALLOCATE(Re_trans_T(nb), Re_trans_c(nb), Im_trans_T(nb), Im_trans_c(nb))
#endif

        pb0(:) = dflt_real
        mass_n0(:) = dflt_real
        mass_v0(:) = dflt_real
        Pe_T(:) = dflt_real
        omegaN(:) = dflt_real

        mul0 = fluid_pp(1)%mul0
        ss = fluid_pp(1)%ss
        pv = fluid_pp(1)%pv
        gamma_v = fluid_pp(1)%gamma_v
        M_v = fluid_pp(1)%M_v
        mu_v = fluid_pp(1)%mu_v
        k_v(:) = fluid_pp(1)%k_v

        gamma_n = fluid_pp(2)%gamma_v
        M_n = fluid_pp(2)%M_v
        mu_n = fluid_pp(2)%mu_v
        k_n(:) = fluid_pp(2)%k_v

        gamma_m = gamma_n
        if (thermal == 2) gamma_m = 1._wp

        temp = 293.15_wp
        D_m = 0.242e-4_wp
        uu = sqrt(pl0/rhol0)

        omega_ref = 3._wp*k_poly*Ca + 2._wp*(3._wp*k_poly - 1._wp)/Web

            !!! thermal properties !!!
        ! gas constants
        R_n = Ru/M_n
        R_v = Ru/M_v
        ! phi_vn & phi_nv (phi_nn = phi_vv = 1)
        phi_vn = (1._wp + sqrt(mu_v/mu_n)*(M_n/M_v)**(0.25_wp))**2 &
                 /(sqrt(8._wp)*sqrt(1._wp + M_v/M_n))
        phi_nv = (1._wp + sqrt(mu_n/mu_v)*(M_v/M_n)**(0.25_wp))**2 &
                 /(sqrt(8._wp)*sqrt(1._wp + M_n/M_v))
        ! internal bubble pressure
        pb0(:) = pl0 + 2._wp*ss/(R0ref*R0(:))

        ! mass fraction of vapor
        chi_vw0 = 1._wp/(1._wp + R_v/R_n*(pb0/pv - 1._wp))
        ! specific heat for gas/vapor mixture
        cp_m0 = chi_vw0*R_v*gamma_v/(gamma_v - 1._wp) &
                + (1._wp - chi_vw0)*R_n*gamma_n/(gamma_n - 1._wp)
        ! mole fraction of vapor
        x_vw = M_n*chi_vw0/(M_v + (M_n - M_v)*chi_vw0)
        ! thermal conductivity for gas/vapor mixture
        k_m0 = x_vw*k_v/(x_vw + (1._wp - x_vw)*phi_vn) &
               + (1._wp - x_vw)*k_n/(x_vw*phi_nv + 1._wp - x_vw)
        ! mixture density
        rho_m0 = pv/(chi_vw0*R_v*temp)

        ! mass of gas/vapor computed using dimensional quantities
        mass_n0(:) = 4._wp*(pb0(:) - pv)*pi/(3._wp*R_n*temp*rhol0)*R0(:)**3
        mass_v0(:) = 4._wp*pv*pi/(3._wp*R_v*temp*rhol0)*R0(:)**3
        ! Peclet numbers
        Pe_T(:) = rho_m0*cp_m0(:)*uu*R0ref/k_m0(:)
        Pe_c = uu*R0ref/D_m

        Tw = temp

        ! nondimensional properties
        !if(.not. qbmm) then
        R_n = rhol0*R_n*temp/pl0
        R_v = rhol0*R_v*temp/pl0
        k_n(:) = k_n(:)/k_m0(:)
        k_v(:) = k_v(:)/k_m0(:)
        pb0 = pb0/pl0
        pv = pv/pl0
        Tw = 1._wp
        pl0 = 1._wp

        rhoref = 1._wp
        pref = 1._wp
        !end if

        ! natural frequencies
        omegaN(:) = sqrt(3._wp*k_poly*Ca + 2._wp*(3._wp*k_poly - 1._wp)/(Web*R0))/R0
        do ir = 1, Nb
            call s_transcoeff(omegaN(ir)*R0(ir), Pe_T(ir)*R0(ir), &
                              Re_trans_T(ir), Im_trans_T(ir))
            call s_transcoeff(omegaN(ir)*R0(ir), Pe_c*R0(ir), &
                              Re_trans_c(ir), Im_trans_c(ir))
        end do
        Im_trans_T = 0._wp

    end subroutine s_initialize_nonpoly

    !> Computes the transfer coefficient for the non-polytropic bubble compression process
        !! @param omega natural frequencies
        !! @param peclet Peclet number
        !! @param Re_trans Real part of the transport coefficients
        !! @param Im_trans Imaginary part of the transport coefficients
    elemental subroutine s_transcoeff(omega, peclet, Re_trans, Im_trans)

        real(wp), intent(in) :: omega, peclet
        real(wp), intent(out) :: Re_trans, Im_trans

        complex(wp) :: imag, trans, c1, c2, c3

        imag = (0._wp, 1._wp)

        c1 = imag*omega*peclet
        c2 = sqrt(c1)
        c3 = (exp(c2) - exp(-c2))/(exp(c2) + exp(-c2)) ! TANH(c2)
        trans = ((c2/c3 - 1._wp)**(-1) - 3._wp/c1)**(-1) ! transfer function

        Re_trans = trans
        Im_trans = aimag(trans)

    end subroutine s_transcoeff

    elemental subroutine s_int_to_str(i, res)

        integer, intent(in) :: i
        character(len=*), intent(inout) :: res

        write (res, '(I0)') i
        res = trim(res)
    end subroutine s_int_to_str

    !> Computes the Simpson weights for quadrature
    subroutine s_simpson(local_weight, local_R0)

        real(wp), dimension(:), intent(inout) :: local_weight
        real(wp), dimension(:), intent(inout) :: local_R0

        integer :: ir
        real(wp) :: R0mn, R0mx, dphi, tmp, sd
        real(wp), dimension(nb) :: phi

        sd = poly_sigma
        R0mn = 0.8_wp*exp(-2.8_wp*sd)
        R0mx = 0.2_wp*exp(9.5_wp*sd) + 1._wp

        ! phi = ln( R0 ) & return R0
        do ir = 1, nb
            phi(ir) = log(R0mn) &
                      + (ir - 1._wp)*log(R0mx/R0mn)/(nb - 1._wp)
            local_R0(ir) = exp(phi(ir))
        end do
        dphi = phi(2) - phi(1)

        ! weights for quadrature using Simpson's rule
        do ir = 2, nb - 1
            ! Gaussian
            tmp = exp(-0.5_wp*(phi(ir)/sd)**2)/sqrt(2._wp*pi)/sd
            if (mod(ir, 2) == 0) then
                local_weight(ir) = tmp*4._wp*dphi/3._wp
            else
                local_weight(ir) = tmp*2._wp*dphi/3._wp
            end if
        end do
        tmp = exp(-0.5_wp*(phi(1)/sd)**2)/sqrt(2._wp*pi)/sd
        local_weight(1) = tmp*dphi/3._wp
        tmp = exp(-0.5_wp*(phi(nb)/sd)**2)/sqrt(2._wp*pi)/sd
        local_weight(nb) = tmp*dphi/3._wp
    end subroutine s_simpson

    !> This procedure computes the cross product of two vectors.
    !! @param a First vector.
    !! @param b Second vector.
    !! @return The cross product of the two vectors.
    pure function f_cross(a, b) result(c)

        real(wp), dimension(3), intent(in) :: a, b
        real(wp), dimension(3) :: c

        c(1) = a(2)*b(3) - a(3)*b(2)
        c(2) = a(3)*b(1) - a(1)*b(3)
        c(3) = a(1)*b(2) - a(2)*b(1)
    end function f_cross

    !> This procedure swaps two real numbers.
    !! @param lhs Left-hand side.
    !! @param rhs Right-hand side.
    elemental subroutine s_swap(lhs, rhs)

        real(wp), intent(inout) :: lhs, rhs
        real(wp) :: ltemp

        ltemp = lhs
        lhs = rhs
        rhs = ltemp
    end subroutine s_swap

    !> This procedure creates a transformation matrix.
    !! @param  p Parameters for the transformation.
    !! @return Transformation matrix.
    function f_create_transform_matrix(param, center) result(out_matrix)

        type(ic_model_parameters), intent(in) :: param
        real(wp), dimension(1:3), optional, intent(in) :: center
        real(wp), dimension(1:4, 1:4) :: sc, rz, rx, ry, tr, t_back, t_to_origin, out_matrix

        sc = transpose(reshape([ &
                               param%scale(1), 0._wp, 0._wp, 0._wp, &
                               0._wp, param%scale(2), 0._wp, 0._wp, &
                               0._wp, 0._wp, param%scale(3), 0._wp, &
                               0._wp, 0._wp, 0._wp, 1._wp], shape(sc)))

        rz = transpose(reshape([ &
                               cos(param%rotate(3)), -sin(param%rotate(3)), 0._wp, 0._wp, &
                               sin(param%rotate(3)), cos(param%rotate(3)), 0._wp, 0._wp, &
                               0._wp, 0._wp, 1._wp, 0._wp, &
                               0._wp, 0._wp, 0._wp, 1._wp], shape(rz)))

        rx = transpose(reshape([ &
                               1._wp, 0._wp, 0._wp, 0._wp, &
                               0._wp, cos(param%rotate(1)), -sin(param%rotate(1)), 0._wp, &
                               0._wp, sin(param%rotate(1)), cos(param%rotate(1)), 0._wp, &
                               0._wp, 0._wp, 0._wp, 1._wp], shape(rx)))

        ry = transpose(reshape([ &
                               cos(param%rotate(2)), 0._wp, sin(param%rotate(2)), 0._wp, &
                               0._wp, 1._wp, 0._wp, 0._wp, &
                               -sin(param%rotate(2)), 0._wp, cos(param%rotate(2)), 0._wp, &
                               0._wp, 0._wp, 0._wp, 1._wp], shape(ry)))

        tr = transpose(reshape([ &
                               1._wp, 0._wp, 0._wp, param%translate(1), &
                               0._wp, 1._wp, 0._wp, param%translate(2), &
                               0._wp, 0._wp, 1._wp, param%translate(3), &
                               0._wp, 0._wp, 0._wp, 1._wp], shape(tr)))

        if (present(center)) then
            ! Translation matrix to move center to the origin
            t_to_origin = transpose(reshape([ &
                                            1._wp, 0._wp, 0._wp, -center(1), &
                                            0._wp, 1._wp, 0._wp, -center(2), &
                                            0._wp, 0._wp, 1._wp, -center(3), &
                                            0._wp, 0._wp, 0._wp, 1._wp], shape(tr)))

            ! Translation matrix to move center back to original position
            t_back = transpose(reshape([ &
                                       1._wp, 0._wp, 0._wp, center(1), &
                                       0._wp, 1._wp, 0._wp, center(2), &
                                       0._wp, 0._wp, 1._wp, center(3), &
                                       0._wp, 0._wp, 0._wp, 1._wp], shape(tr)))

            out_matrix = matmul(tr, matmul(t_back, matmul(ry, matmul(rx, matmul(rz, matmul(sc, t_to_origin))))))
        else
            out_matrix = matmul(ry, matmul(rx, rz))
        end if

    end function f_create_transform_matrix

    !> This procedure transforms a vector by a matrix.
    !! @param vec Vector to transform.
    !! @param matrix Transformation matrix.
    subroutine s_transform_vec(vec, matrix)

        real(wp), dimension(1:3), intent(inout) :: vec
        real(wp), dimension(1:4, 1:4), intent(in) :: matrix

        real(wp), dimension(1:4) :: tmp

        tmp = matmul(matrix, [vec(1), vec(2), vec(3), 1._wp])
        vec = tmp(1:3)

    end subroutine s_transform_vec

    !> This procedure transforms a triangle by a matrix, one vertex at a time.
    !! @param triangle Triangle to transform.
    !! @param matrix   Transformation matrix.
    subroutine s_transform_triangle(triangle, matrix, matrix_n)

        type(t_triangle), intent(inout) :: triangle
        real(wp), dimension(1:4, 1:4), intent(in) :: matrix, matrix_n

        integer :: i

        do i = 1, 3
            call s_transform_vec(triangle%v(i, :), matrix)
        end do

        call s_transform_vec(triangle%n(1:3), matrix_n)

    end subroutine s_transform_triangle

    !> This procedure transforms a model by a matrix, one triangle at a time.
    !! @param model  Model to transform.
    !! @param matrix Transformation matrix.
    subroutine s_transform_model(model, matrix, matrix_n)

        type(t_model), intent(inout) :: model
        real(wp), dimension(1:4, 1:4), intent(in) :: matrix, matrix_n

        integer :: i

        do i = 1, size(model%trs)
            call s_transform_triangle(model%trs(i), matrix, matrix_n)
        end do

    end subroutine s_transform_model

    !> This procedure creates a bounding box for a model.
    !! @param model Model to create bounding box for.
    !! @return Bounding box.
    function f_create_bbox(model) result(bbox)

        type(t_model), intent(in) :: model
        type(t_bbox) :: bbox

        integer :: i, j

        if (size(model%trs) == 0) then
            bbox%min = 0._wp
            bbox%max = 0._wp
            return
        end if

        bbox%min = model%trs(1)%v(1, :)
        bbox%max = model%trs(1)%v(1, :)

        do i = 1, size(model%trs)
            do j = 1, 3
                bbox%min = min(bbox%min, model%trs(i)%v(j, :))
                bbox%max = max(bbox%max, model%trs(i)%v(j, :))
            end do
        end do

    end function f_create_bbox

    !> This procedure performs xor on lhs and rhs.
    !! @param lhs logical input.
    !! @param rhs other logical input.
    !! @return xored result.
    elemental function f_xor(lhs, rhs) result(res)

        logical, intent(in) :: lhs, rhs
        logical :: res

        res = (lhs .and. .not. rhs) .or. (.not. lhs .and. rhs)
    end function f_xor

    !> This procedure converts logical to 1 or 0.
    !! @param perdicate A Logical argument.
    !! @return 1 if .true., 0 if .false..
    elemental function f_logical_to_int(predicate) result(int)

        logical, intent(in) :: predicate
        integer :: int

        if (predicate) then
            int = 1
        else
            int = 0
        end if
    end function f_logical_to_int

    !> This function generates the unassociated legendre poynomials
    !! @param x is the input value
    !! @param l is the degree
    !! @return P is the unassociated legendre polynomial evaluated at x
    recursive function unassociated_legendre(x, l) result(result_P)

        integer, intent(in) :: l
        real(wp), intent(in) :: x
        real(wp) :: result_P

        if (l == 0) then
            result_P = 1._wp
        else if (l == 1) then
            result_P = x
        else
            result_P = ((2*l - 1)*x*unassociated_legendre(x, l - 1) - (l - 1)*unassociated_legendre(x, l - 2))/l
        end if

    end function unassociated_legendre

    !> This function calculates the spherical harmonic function evaluated at x and phi
    !! @param x is the x coordinate
    !! @param phi is the phi coordinate
    !! @param l is the degree
    !! @param m_order is the order
    !! @return Y is the spherical harmonic function evaluated at x and phi
    recursive function spherical_harmonic_func(x, phi, l, m_order) result(Y)

        integer, intent(in) :: l, m_order
        real(wp), intent(in) :: x, phi
        real(wp) :: Y, prefactor, local_pi

        local_pi = acos(-1._wp)
        prefactor = sqrt((2*l + 1)/(4*local_pi)*factorial(l - m_order)/factorial(l + m_order)); 
        if (m_order == 0) then
            Y = prefactor*associated_legendre(x, l, m_order); 
        elseif (m_order > 0) then
            Y = (-1._wp)**m_order*sqrt(2._wp)*prefactor*associated_legendre(x, l, m_order)*cos(m_order*phi); 
        end if

    end function spherical_harmonic_func

    !> This function generates the associated legendre polynomials evaluated
    !! at x with inputs l and m
    !! @param x is the input value
    !! @param l is the degree
    !! @param m_order is the order
    !! @return P is the associated legendre polynomial evaluated at x
    recursive function associated_legendre(x, l, m_order) result(result_P)

        integer, intent(in) :: l, m_order
        real(wp), intent(in) :: x
        real(wp) :: result_P

        if (m_order <= 0 .and. l <= 0) then
            result_P = 1; 
        elseif (l == 1 .and. m_order <= 0) then
            result_P = x; 
        elseif (l == 1 .and. m_order == 1) then
            result_P = -(1 - x**2)**(1._wp/2._wp); 
        elseif (m_order == l) then
            result_P = (-1)**l*double_factorial(2*l - 1)*(1 - x**2)**(l/2); 
        elseif (m_order == l - 1) then
            result_P = x*(2*l - 1)*associated_legendre(x, l - 1, l - 1); 
        else
            result_P = ((2*l - 1)*x*associated_legendre(x, l - 1, m_order) - (l + m_order - 1)*associated_legendre(x, l - 2, m_order))/(l - m_order); 
        end if

    end function associated_legendre

    !> This function calculates the double factorial value of an integer
    !! @param n_in is the input integer
    !! @return R is the double factorial value of n
    elemental function double_factorial(n_in) result(R_result)

        integer, intent(in) :: n_in
        integer, parameter :: int64_kind = selected_int_kind(18) ! 18 bytes for 64-bit integer
        integer(kind=int64_kind) :: R_result
        integer :: i

        R_result = product((/(i, i=n_in, 1, -2)/))

    end function double_factorial

    !> The following function calculates the factorial value of an integer
    !! @param n_in is the input integer
    !! @return R is the factorial value of n
    elemental function factorial(n_in) result(R_result)

        integer, intent(in) :: n_in
        integer, parameter :: int64_kind = selected_int_kind(18) ! 18 bytes for 64-bit integer
Duplicate Code

Many nearly identical GPU_PARALLEL_LOOP sections for x/y/z directions and alt_soundspeed branches repeat patterns with only index/order differences. Consider consolidating via helper kernels/macros to reduce maintenance risk and ensure backend parity between OpenACC and OpenMP.

real(wp) :: advected_qty_val, pressure_val, velocity_val

if (alt_soundspeed) then
    #:call GPU_PARALLEL_LOOP(collapse=3)
        do q_loop = 0, p
            do l_loop = 0, n
                do k_loop = 0, m
                    blkmod1(k_loop, l_loop, q_loop) = ((gammas(1) + 1._wp)*q_prim_vf%vf(E_idx)%sf(k_loop, l_loop, q_loop) + &
                                                       pi_infs(1))/gammas(1)
                    blkmod2(k_loop, l_loop, q_loop) = ((gammas(2) + 1._wp)*q_prim_vf%vf(E_idx)%sf(k_loop, l_loop, q_loop) + &
                                                       pi_infs(2))/gammas(2)
                    alpha1(k_loop, l_loop, q_loop) = q_cons_vf%vf(advxb)%sf(k_loop, l_loop, q_loop)

                    if (bubbles_euler) then
                        alpha2(k_loop, l_loop, q_loop) = q_cons_vf%vf(alf_idx - 1)%sf(k_loop, l_loop, q_loop)
                    else
                        alpha2(k_loop, l_loop, q_loop) = q_cons_vf%vf(advxe)%sf(k_loop, l_loop, q_loop)
                    end if

                    Kterm(k_loop, l_loop, q_loop) = alpha1(k_loop, l_loop, q_loop)*alpha2(k_loop, l_loop, q_loop)* &
                                                    (blkmod2(k_loop, l_loop, q_loop) - blkmod1(k_loop, l_loop, q_loop))/ &
                                                    (alpha1(k_loop, l_loop, q_loop)*blkmod2(k_loop, l_loop, q_loop) + &
                                                     alpha2(k_loop, l_loop, q_loop)*blkmod1(k_loop, l_loop, q_loop))
                end do
            end do
        end do
    #:endcall GPU_PARALLEL_LOOP
end if

select case (idir)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level Suggestion

Refactor large, duplicated code blocks in files like m_igr.fpp and m_riemann_solvers.fpp to improve maintainability before adding OpenMP support. This could involve using macros or parameterized subroutines to handle logic that varies by direction. [High-level, importance: 7]

Solution Walkthrough:

Before:

subroutine s_igr_source_terms(idir, ...)
    if (idir == 1) then
        ! Massive block of code for x-direction
        #:call GPU_PARALLEL_LOOP(...)
            do l = 0, p
                do k = 0, n
                    do j = -1, m
                        ! ... hundreds of lines of calculations for x-fluxes
                    end do
                end do
            end do
        #:endcall GPU_PARALLEL_LOOP
    else if (idir == 2) then
        ! Massive block of code for y-direction (very similar to x-direction)
        ...
    else if (idir == 3) then
        ! Massive block of code for z-direction (very similar to x-direction)
        ...
    end if
end subroutine

After:

subroutine s_compute_directional_terms(norm_dir, ...)
    ! Generic logic for any direction
    #:call GPU_PARALLEL_LOOP(...)
        do l = ..., ...
            do k = ..., ...
                do j = ..., ...
                    ! ... hundreds of lines of calculations using norm_dir
                end do
            end do
        end do
    #:endcall GPU_PARALLEL_LOOP
end subroutine

subroutine s_igr_source_terms(idir, ...)
    call s_compute_directional_terms(idir, ...)
end subroutine

Comment on lines +574 to +589
$:GPU_LOOP(parallelism='[seq]')
do i = 1, strxe - strxb + 1
tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i)
tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i)
! Elastic contribution to energy if G large enough
!TODO take out if statement if stable without
if ((G_L > 1000) .and. (G_R > 1000)) then
E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
! Double for shear stresses
if (any(strxb - 1 + i == shear_indices)) then
E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
end if
end if
end if
end do
end if
end do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Optimize the GPU loop by replacing the any intrinsic function with a pre-calculated boolean array for a more efficient lookup of shear indices. [general, importance: 7]

Suggested change
$:GPU_LOOP(parallelism='[seq]')
do i = 1, strxe - strxb + 1
tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i)
tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i)
! Elastic contribution to energy if G large enough
!TODO take out if statement if stable without
if ((G_L > 1000) .and. (G_R > 1000)) then
E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
! Double for shear stresses
if (any(strxb - 1 + i == shear_indices)) then
E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
end if
end if
end if
end do
end if
end do
logical :: is_shear(strxe - strxb + 1)
is_shear = .false.
do ii = 1, size(shear_indices)
if (shear_indices(ii) >= strxb .and. shear_indices(ii) <= strxe) then
is_shear(shear_indices(ii) - strxb + 1) = .true.
end if
end do
$:GPU_LOOP(parallelism='[seq]')
do i = 1, strxe - strxb + 1
tau_e_L(i) = qL_prim_rs${XYZ}$_vf(j, k, l, strxb - 1 + i)
tau_e_R(i) = qR_prim_rs${XYZ}$_vf(j + 1, k, l, strxb - 1 + i)
! Elastic contribution to energy if G large enough
!TODO take out if statement if stable without
if ((G_L > 1000) .and. (G_R > 1000)) then
E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
! Double for shear stresses
if (is_shear(i)) then
E_L = E_L + (tau_e_L(i)*tau_e_L(i))/(4._wp*G_L)
E_R = E_R + (tau_e_R(i)*tau_e_R(i))/(4._wp*G_R)
end if
end if
end do

Comment on lines +222 to +224
jac_sf(1)%sf => jac
$:GPU_ENTER_DATA(copyin='[jac_sf(1)%sf]')
$:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Remove the copyin directive for jac_sf(1)%sf to avoid overwriting the GPU-initialized jac array with uninitialized data from the host. [possible issue, importance: 9]

Suggested change
jac_sf(1)%sf => jac
$:GPU_ENTER_DATA(copyin='[jac_sf(1)%sf]')
$:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]')
jac_sf(1)%sf => jac
$:GPU_ENTER_DATA(attach='[jac_sf(1)%sf]')

@anandrdbz
Copy link
Contributor

This looks good from my side, obviously will need a lot of changes for AMD and to stay up to date with openmp_rebased, but that can be done once this is merged

@sbryngelson
Copy link
Member

great, can we confirm that there's no slowdown for openacc before merging? benchmarking isn't working because the benchmark files have new names

@wilfonba
Copy link
Collaborator

It also failed to build with CCE GPU because of parameters in declare target statements in chemistry

@anandrdbz
Copy link
Contributor

It also failed to build with CCE GPU because of parameters in declare target statements in chemistry

Didn't it pass the test cases on frontier ?

@wilfonba
Copy link
Collaborator

It also failed to build with CCE GPU because of parameters in declare target statements in chemistry

Didn't it pass the test cases on frontier ?

The test suite doesn't use case optimization, so it doesn't have the same problem

@sbryngelson
Copy link
Member

@prathi-wind does this seem ready to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants