High-performance inter-node communication over AWS Elastic Fabric Adapter (EFA) is a key enabler for scaling large-language-model (LLM) training efficiently. Existing benchmarking tools primarily focus on collective communication libraries such as NCCL or NVSHMEM, making it difficult to isolate and understand the raw performance characteristics of EFA itself. At the same time, GPU-Initiated Networking (GIN) has gained significant attention following the release of Deep-EP, which demonstrated substantial MoE performance gains by enabling GPU-driven communication.
This repository provides a focused benchmarking framework for EFA, designed to analyze low-level inter-node communication performances. It complements existing tools such as nccl-tests by enabling direct measurement of EFA latency, bandwidth, and GIN behavior, helping engineers and researchers optimize distributed training pipelines on AWS.
The following snippets demonstrate how to build the source code for a simple test. To save time on environment setup and dependency management, this repository provides a Dockerfile that can be used to build the project in a consistent and reproducible environment.
# build a Docker image
docker build -f Dockerfile -t cuda:latest .
# build examples
make buildIf enroot is available in your environment, you can launch the experiment using the following commands:
# build an enroot sqush file
make sqush
# launch an interactive enroot environment
enroot create --name cuda cuda+latest.sqsh
enroot start --mount /fsx:/fsx cuda /bin/bash
# run a test via enroot on a Slurm cluster
srun -N 1 \
--container-image "${PWD}/cuda+latest.sqsh" \
--container-mounts /fsx:/fsx \
--container-name cuda \
--mpi=pmix \
--ntasks-per-node=1 \
"${PWD}/build/experiments/affinity/affinity"When implementing custom algorithms directly over EFA, developers often face the complexity of asynchronous RDMA APIs and event-driven scheduling. To simplify this workflow, this repository includes a coroutine-based scheduler built on C++20 coroutine, enabling a more straightforward programming model without manual callback management. The example below shows how to build a PoC using pure libfabric and MPI.
#include <io/runner.h>
#include <rdma/fabric/memory.h>
#include <bench/mpi/fabric.cuh>
struct PairBench {
int target;
template <typename T>
void operator()(FabricBench& peer, FabricBench::Buffers<T>& send, FabricBench::Buffers<T>& recv) {
for (auto& efa : peer.efas) IO::Get().Join<FabricSelector>(efa);
Run([&]() -> Coro<> {
size_t channel = 0;
if (peer.mpi.GetWorldRank() == 0) {
co_await send[target]->Sendall(channel);
co_await recv[target]->Recvall(channel);
} else if (peer.mpi.GetWorldRank() == target) {
co_await recv[0]->Recvall(channel);
co_await send[0]->Sendall(channel);
}
for (auto& efa : peer.efas) IO::Get().Quit<FabricSelector>(efa);
}());
}
};
template <typename BufType>
struct Test {
static BenchResult Run(size_t size) {
FabricBench peer;
peer.Exchange();
peer.Connect();
int rank = peer.mpi.GetWorldRank();
int world = peer.mpi.GetWorldSize();
auto send = peer.Alloc<BufType>(size, rank);
auto recv = peer.Alloc<BufType>(size, -1);
auto noop = [](auto&, auto&) {};
std::vector<BenchResult> res;
for (int t = 1; t < world; ++t) res.emplace_back(peer.Bench(send, recv, PairBench{t}, noop, 100));
return res;
}
};
using DeviceTest = Test<SymmetricDMAMemory>;
// mpirun -np 2 --npernode 1 example
int main(int argc, char *argv[]) {
size_t bufsize = 128 << 10; // 128k
auto results = DeviceTest::Run(bufsize);
return 0;
}To learn how to use the library provided in this repository, please refer to the following example experiments, which illustrate common usage patterns and benchmarking scenarios:
- Affinity: Demonstrates how to query and enumerate GPU device information.
- EFA: Shows how to discover and inspect available EFA devices.
- Echo: Implements a simple TCP echo server/client to illustrate usage of the coroutine-based scheduler.
- Bootstrap: Illustrates exchanging RDMA details via MPI communication.
- Send/Recv: Benchmarks libfabric SEND/RECV operations over EFA.
- Write: Benchmarks libfabric WRITE operations over EFA.
- Alltoall: Benchmarks a simple all-to-all communication pattern over EFA.
- Queue: Benchmarks a multi-producer, single-consumer (MPSC) queue between GPU and CPU.
- Proxy: Benchmarks GPU-initiated RDMA writes via a CPU proxy coroutine.
See CITATION.cff for machine-readable citation information.
@software{tsai2025aws_efa_gpu_benchmark,
title = {AWS EFA GPU Benchmark},
author = {Tsai, Chang-Ning},
year = {2025},
month = {12},
url = {https://github.com/crazyguitar/Libefaxx},
version = {1.0.0},
abstract = {High-performance RDMA communication experiments using CUDA and Amazon Elastic Fabric Adapter (EFA)},
keywords = {RDMA, CUDA, EFA, High-Performance Computing, GPU Communication, Amazon EFA, Fabric, MPI}
}Tsai, C.-N. (2025). AWS EFA GPU Benchmark (Version 1.0.0) [Computer software]. https://github.com/crazyguitar/Libefaxx
- Q. Le, "Libfabric EFA Series," 2024. [link]
- K. Punniyamurthy et al., "Optimizing Distributed ML Communication," arXiv:2305.06942, 2023. [link]
- S. Liu et al., "GPU-Initiated Networking," arXiv:2511.15076, 2025. [link]
- Netcan, "asyncio: C++20 coroutine library," GitHub. [link]
- UCCL Project, "UCCL: User-space Collective Communication Library," GitHub. [link]
- Microsoft, "MSCCL++: Multi-Scale Collective Communication Library," GitHub. [link]
- DeepSeek-AI, "DeepEP: Expert parallelism with GPU-initiated communication," GitHub. [link]