CUDA/ROCm kernels can be built when you configure with -DUSE_CUDA=ON or -DUSE_ROCM=ON (see python/CMakeLists.txt). The bindings expose t81lib.where, t81lib.clamp, t81lib.lerp, and t81lib.addcmul, which accept either NumPy buffers or PyTorch tensors and dispatch directly to the GPU kernels.
The dispatcher is driven entirely by t81::TensorMetadata (include/t81/tensor_metadata.hpp): a lightweight struct that carries device tags, dtype codes, shape, strides, and data_ptr so the runtime can call the right CUDA/HIP kernel without copies. Torch-aware helpers in python/bindings.cpp create metadata for GPU tensors (including a requires_sync flag when needed) and fall back to contiguous CPU buffers when torch is unavailable.
t81lib.gemm_ternary now shares the same metadata plumbing. The CUDA/HIP kernels view ScalarType::TernaryLimb buffers as packed core::limb rows (TRYTES_PER_LIMB trytes packed into 16 bytes) and expect contiguous layouts (np.dtype('V16') rows or torch.uint8 views with dimensions (M, K_limbs) / (K_limbs, N)). The accumulator C must remain float32 and contiguous. With Backend::Auto, the binding dispatches to CUDA/ROCm when available; otherwise it falls back to the CPU path. Review the GPU dispatch diagram for how metadata flows from NumPy/Torch -> CUDA/HIP -> back to Python.