CUDA-based 3D topology optimization solver with multigrid preconditioned conjugate gradient method.
- Matrix-free stencil operators for memory-efficient computation
- Multigrid V-cycle preconditioner for faster convergence
- Conjugate gradient solver with diagonal preconditioning
- Multi-GPU support for large-scale problems
- Double precision computation for numerical stability
- VTK output for visualization in ParaView
- NVIDIA GPU with CUDA Compute Capability 8.9+ (tested on RTX 4090)
- CUDA Toolkit 12.2+
- NVCC compiler
- Make
make clean
makeThis will generate the executable top3d_cuda.
./top3d_cuda <nelx> <nely> <nelz> <num_mg_levels>Parameters:
nelx,nely,nelz: Number of elements in X, Y, Z directionsnum_mg_levels: Number of multigrid levels (typically 4)
Example:
# 16x8x8 problem with 4 multigrid levels
./top3d_cuda 16 8 8 4
# 64x32x32 problem
./top3d_cuda 64 32 32 4The solver generates VTK files for visualization:
result_<nelx>_<nely>_<nelz>.vtk- Optimized density field
Open the VTK file in ParaView to visualize the optimized structure.
- Method: SIMP (Solid Isotropic Material with Penalization)
- Optimizer: Optimality Criteria (OC) method
- Volume fraction: 12%
- Penalty factor: 3.0
- Filter radius: 1.5
- Method: Preconditioned Conjugate Gradient (PCG)
- Preconditioner: Diagonal (Jacobi)
- Convergence tolerance: 1e-5
- Max iterations: 500
- Levels: 4 (configurable)
- Smoother: Damped Jacobi (ω = 0.6)
- Coarse grid: 200 Jacobi iterations
- Restriction: Full-weighting
- Prolongation: Trilinear interpolation
- Elements: 1-indexed, range [1, nelx] × [1, nely] × [1, nelz]
- Nodes: 1-indexed, range [1, nelx+1] × [1, nely+1] × [1, nelz+1]
- Element index:
i * (wrapy-1) * (wrapz-1) + k * (wrapy-1) + j - Node index:
i * wrapy * wrapz + k * wrapy + j
- Fixed: Left face (i=1), all DOFs = 0
- Load: Right face edge (i=nelx, k=1, j=1 to nely+1), force in -Z direction
- 8-node hexahedral elements with 24 DOFs per element
- 27-point stencil for each node (8 neighboring elements)
- Material interpolation: E(ρ) = Emin + ρ³(E0 - Emin)
This CUDA implementation is based on the OpenMP multi-GPU version but with some key differences:
| Feature | OpenMP Version | CUDA Version |
|---|---|---|
| Coarse grid solver | CHOLMOD (direct) | Jacobi (iterative) |
| CG iterations | 5-20 per step | 200-230 per step |
| Precision | Mixed (float/double) | Double precision |
| Final compliance (16x8x8) | ~8,235 | ~6,447 |
-
Coarse grid solver: The OpenMP version uses CHOLMOD sparse direct solver for the coarsest multigrid level, while the CUDA version uses 200 Jacobi iterations. This leads to:
- Different convergence behavior
- Different optimization paths
- Different final structures
-
Numerical precision: CUDA version uses double precision for all computations to improve stability.
-
Results: Both versions produce valid optimized structures, but they may differ due to the different coarse grid solvers and resulting optimization paths.
TopOpt-CUDA/
├── src/ # Source files
│ ├── main.cu # Main entry point
│ ├── solver.cu # Solver initialization
│ ├── cg_solver.cu # CG solver with multigrid
│ ├── optimization.cu # Topology optimization loop
│ ├── stencil_kernels.cu # Matrix-free stencil operators
│ ├── multigrid_kernels.cu # Restriction/prolongation
│ ├── optimization_kernels.cu # Compliance/sensitivity
│ ├── vector_ops.cu # Vector operations
│ ├── vtk_output.cu # VTK file generation
│ ├── multi_gpu.cu # Multi-GPU support
│ ├── halo_exchange.cu # GPU communication
│ ├── distributed_cg.cu # Distributed CG solver
│ └── distributed_stencil.cu # Distributed stencil
├── include/ # Header files
│ ├── definitions.h # Type definitions
│ ├── solver.h # Solver data structures
│ ├── cuda_kernels.cuh # Kernel declarations
│ └── multi_gpu.h # Multi-GPU declarations
├── docs/ # Documentation
├── Makefile # Build configuration
└── README.md # This file
Tested on dual NVIDIA RTX 4090 GPUs:
| Problem Size | Elements | DOFs | Time/Iteration | Memory |
|---|---|---|---|---|
| 16×8×8 | 1,024 | 11,286 | ~0.5s | <1 GB |
| 64×32×32 | 65,536 | 295,470 | ~2s | ~2 GB |
| 128×64×64 | 524,288 | 2,146,689 | ~15s | ~10 GB |
- Implement cuSOLVER direct solver for coarse grid to match OpenMP version performance
- Optimize memory access patterns for better GPU utilization
- Add support for different boundary conditions and load cases
- Implement continuation method for better convergence
- Add support for stress constraints and multiple materials
-
Andreassen, E., Clausen, A., Schevenels, M., Lazarov, B. S., & Sigmund, O. (2011). Efficient topology optimization in MATLAB using 88 lines of code. Structural and Multidisciplinary Optimization, 43(1), 1-16.
-
Aage, N., Andreassen, E., & Lazarov, B. S. (2015). Topology optimization using PETSc: An easy-to-use, fully parallel, open source topology optimization framework. Structural and Multidisciplinary Optimization, 51(3), 565-572.
-
Liu, H., Zong, H., Shi, T., & Xia, Q. (2020). M-VCUT level set method for optimizing cellular structures. Computer Methods in Applied Mechanics and Engineering, 367, 113154.
This project is for research and educational purposes.
- Based on the OpenMP multi-GPU topology optimization framework
- Developed with assistance from Claude (Anthropic)
For questions or issues, please open an issue on GitHub.