Skip to content

Performant kernels, and other ML Systems integrations

Notifications You must be signed in to change notification settings

govindansriram/CobraML2

Repository files navigation

/assets/

About

LLM serving engine and custom kernels built from scratch, documenting every step along the way.

Accomplishments

  1. Flash Attention 1 (done)
    • between 2 and 4 x faster then pytorch naive MHA
  2. Brought up GPT2

Milestones

  1. Flash Attention 2
  2. Flash Attention 3
  3. KV Cache
  4. Paged Attention
  5. Tensor Parallelism
  6. MOE support

Current limitations

  1. No FP16, FP8, FP4 support
  2. head dim must equal 64
  3. can only handle 1 batch at a time (unless all batches are evenly sized)

Installation

Prerequisites

  • Python >= 3.10
  • CUDA toolkit with nvcc
  • An NVIDIA GPU with compute capability >= 8.0 (Ampere+)

So far all code has only been tested on systems with CUDA >= 12.8 and Ubuntu 22.04

Build from source

Initial setup

git clone https://github.com/govindansriram/CobraML2.git
cd CobraML2

sudo chmod +x ./runner.sh

Python package

Install torch for your CUDA version, then build cobraml:

python3 -m venv .venv
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu130
pip install --no-build-isolation -e ".[dev]"

Replace cu130 with your CUDA version: cu124, cu126, cu128, etc. Check with nvcc --version.

C++ targets

The C++ build uses CMake to locate PyTorch headers from the .venv. You don't need to build the full Python package first, but torch must be installed in the venv.

Testing

C++ tests
# Build and run a specific test
./runner.sh -r test_fmha_cc

# Run all tests
./runner.sh -a

# Run with benchmarking enabled
./runner.sh -c -b -r test_fmha_cc

# Filter specific test cases
./runner.sh -r test_fmha_cc -- --gtest_filter=*causal*
Python tests

Requires the Python package to be built first.

# Run all tests
pytest

# Run with benchmarking
pytest --benchmark

# Filter specific test cases
pytest -k "test_fmha_fp32[4-512-16-64-True]"

Using the Runner

The runner.sh script is the main entry point for building, testing, profiling, and formatting the project.

Options

Flag Description
-h, --help Show help message
-c, --clean Clean build (removes build directory)
-b, --benchmark Enable benchmarking
-t, --target <name> Build specific target
-r, --run <name> Build and run specific target
-a, --run-all Build and run all tests via ctest
-f, --format [file] Format all files, or a specific file
-p, --profile <name> Build and profile target with ncu
-o, --output <name> Custom name for .ncu-rep file
--profile-opts <opts> Additional ncu options
--no-tests Disable building tests
-- Pass remaining args to executable

Examples

Build everything:

./runner.sh

Clean build:

./runner.sh -c

Build specific target:

./runner.sh -t test_fmha_cc

Build and run a test:

./runner.sh -r test_fmha_cc

Run all tests:

./runner.sh -a

Run with gtest filter:

./runner.sh -r test_fmha_cc -- --gtest_filter=*Perf*

Clean build with benchmarking enabled:

./runner.sh -c -b -r test_fmha_cc

Profile a kernel with ncu:

./runner.sh -p test_fmha_cc

Profile with custom output name:

./runner.sh -p test_fmha_cc -o my_profile

Profile specific kernel:

./runner.sh -p test_fmha_cc --profile-opts '--kernel-name fmha'

Linting

C++

All C++ files must be formatted with clang-format.

./runner.sh -f
./runner.sh -f include/cobraml2/kernels/fmha_cc.cuh

Python

ruff check python/
ruff format python/

Contributing

...

About

Performant kernels, and other ML Systems integrations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •