LLM serving engine and custom kernels built from scratch, documenting every step along the way.
- Flash Attention 1 (done)
- between 2 and 4 x faster then pytorch naive MHA
- Brought up GPT2
- Flash Attention 2
- Flash Attention 3
- KV Cache
- Paged Attention
- Tensor Parallelism
- MOE support
- No FP16, FP8, FP4 support
- head dim must equal 64
- can only handle 1 batch at a time (unless all batches are evenly sized)
- Python >= 3.10
- CUDA toolkit with
nvcc - An NVIDIA GPU with compute capability >= 8.0 (Ampere+)
So far all code has only been tested on systems with CUDA >= 12.8 and Ubuntu 22.04
git clone https://github.com/govindansriram/CobraML2.git
cd CobraML2
sudo chmod +x ./runner.shInstall torch for your CUDA version, then build cobraml:
python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install --no-build-isolation -e ".[dev]"Replace cu130 with your CUDA version: cu124, cu126, cu128, etc. Check with nvcc --version.
The C++ build uses CMake to locate PyTorch headers from the .venv. You don't need to build the full Python package first, but torch must be installed in the venv.
# Build and run a specific test
./runner.sh -r test_fmha_cc
# Run all tests
./runner.sh -a
# Run with benchmarking enabled
./runner.sh -c -b -r test_fmha_cc
# Filter specific test cases
./runner.sh -r test_fmha_cc -- --gtest_filter=*causal*Requires the Python package to be built first.
# Run all tests
pytest
# Run with benchmarking
pytest --benchmark
# Filter specific test cases
pytest -k "test_fmha_fp32[4-512-16-64-True]"The runner.sh script is the main entry point for building, testing, profiling, and formatting the project.
| Flag | Description |
|---|---|
-h, --help |
Show help message |
-c, --clean |
Clean build (removes build directory) |
-b, --benchmark |
Enable benchmarking |
-t, --target <name> |
Build specific target |
-r, --run <name> |
Build and run specific target |
-a, --run-all |
Build and run all tests via ctest |
-f, --format [file] |
Format all files, or a specific file |
-p, --profile <name> |
Build and profile target with ncu |
-o, --output <name> |
Custom name for .ncu-rep file |
--profile-opts <opts> |
Additional ncu options |
--no-tests |
Disable building tests |
-- |
Pass remaining args to executable |
Build everything:
./runner.shClean build:
./runner.sh -cBuild specific target:
./runner.sh -t test_fmha_ccBuild and run a test:
./runner.sh -r test_fmha_ccRun all tests:
./runner.sh -aRun with gtest filter:
./runner.sh -r test_fmha_cc -- --gtest_filter=*Perf*Clean build with benchmarking enabled:
./runner.sh -c -b -r test_fmha_ccProfile a kernel with ncu:
./runner.sh -p test_fmha_ccProfile with custom output name:
./runner.sh -p test_fmha_cc -o my_profileProfile specific kernel:
./runner.sh -p test_fmha_cc --profile-opts '--kernel-name fmha'All C++ files must be formatted with clang-format.
./runner.sh -f
./runner.sh -f include/cobraml2/kernels/fmha_cc.cuhruff check python/
ruff format python/...
