Skip to content

Latest commit

 

History

History
433 lines (322 loc) · 13.2 KB

File metadata and controls

433 lines (322 loc) · 13.2 KB

Writing Kernels

A kernel is a function that runs in parallel across many threads. SpawnDev.ILGPU compiles your C# kernel code into the target backend's native language — WGSL or GLSL for browser GPUs, PTX or OpenCL C for desktop GPUs, or WebAssembly / native threads for CPU backends.

Kernel Basics

A kernel is typically a static void method. The first parameter is an index type that identifies which thread is running. Think of it as the body of a massively parallel for loop:

// This kernel runs once per element — each thread gets a unique index
static void MyKernel(Index1D index, ArrayView<float> data, float multiplier)
{
    data[index] = data[index] * multiplier;
}

When you launch this kernel with 1000 elements, 1000 threads execute simultaneously, each with a different index value from 0 to 999.

Lambda Kernels

You can also write kernels as C# lambdas that capture local variables. Captured scalar values (int, float, long, etc.) are automatically passed to the GPU at dispatch time:

int multiplier = 5;
float offset = 0.5f;
var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(
    (index, buf) => { buf[index] = index * multiplier + offset; });
kernel((Index1D)length, buffer.View);

Note: Only scalar value types can be captured. ArrayView captures are not supported — pass them as explicit kernel parameters instead.

Higher-Order Kernels with DelegateSpecialization

DelegateSpecialization<T> lets you write one kernel that accepts different operations as parameters. The delegate is resolved at dispatch time and inlined at compile time — the GPU never sees a function pointer:

static int Negate(int x) => -x;
static int Square(int x) => x * x;

static void MapKernel(Index1D index, ArrayView<int> buf,
    DelegateSpecialization<Func<int, int>> transform)
{
    buf[index] = transform.Value(buf[index]);
}

var kernel = accelerator.LoadAutoGroupedStreamKernel<
    Index1D, ArrayView<int>, DelegateSpecialization<Func<int, int>>>(MapKernel);

// Same kernel, different operations
kernel(size, buffer, new DelegateSpecialization<Func<int, int>>(Negate));
kernel(size, buffer, new DelegateSpecialization<Func<int, int>>(Square));

Each unique target method produces a cached specialized kernel compilation. Target methods must be static.

Kernel Rules

These rules apply to all kernel code — they come from ILGPU's design and the constraints of GPU execution:

Rule Details
Must be static (or a lambda) Instance methods are not supported (except capturing lambdas)
Must return void Kernels don't return values — use output buffers
First parameter is the index Index1D, Index2D, or Index3D
Value types only No classes, no string, no reference types
No throw No backend supports exception handling in kernels
No ref / out Parameters are passed by value
No recursion GPU hardware doesn't support call stacks
No dynamic allocation No new inside kernels (except fixed-size structs)

Index Types

The index type determines the dimensionality of the kernel's execution grid:

Index1D — Linear Processing

static void Process1D(Index1D index, ArrayView<float> data, float value)
{
    data[index] = value;
}

// Launch: each element gets one thread
kernel((Index1D)data.Length, data.View, 42.0f);

Index2D — Image/Matrix Processing

static void Process2D(
    Index2D index,
    ArrayView2D<uint, Stride2D.DenseX> pixels,
    int width, int height)
{
    int x = index.X;
    int y = index.Y;
    if (x >= width || y >= height) return;

    // Process pixel at (x, y)
    uint r = (uint)(255 * x / width);
    uint g = (uint)(255 * y / height);
    pixels[index] = (0xFFu << 24) | (r << 16) | (g << 8) | 0xFF;
}

// Launch with 2D extent
kernel(buffer.IntExtent, buffer.View, width, height);

Index3D — Volume/Voxel Processing

static void Process3D(
    Index3D index,
    ArrayView<float> volume,
    int width, int height, int depth)
{
    int x = index.X, y = index.Y, z = index.Z;
    int i = x + y * width + z * width * height;
    volume[i] = x + y + z;
}

Loading and Launching Kernels

LoadAutoGroupedStreamKernel (Recommended)

The simplest way to load a kernel. ILGPU automatically determines the optimal workgroup size:

// Load once (compile + cache)
var kernel = accelerator.LoadAutoGroupedStreamKernel<
    Index1D, ArrayView<float>, ArrayView<float>, ArrayView<float>>(VectorAddKernel);

// Launch (fire-and-forget — work is queued)
kernel((Index1D)length, bufA.View, bufB.View, bufC.View);

// Wait for completion
await accelerator.SynchronizeAsync();

Kernel Delegate Caching

For render loops and repeated invocations, cache the kernel delegate:

// Declare as a field
private Action<Index2D, ArrayView2D<uint, Stride2D.DenseX>, float, float>? _renderKernel;

// Load once
_renderKernel = accelerator.LoadAutoGroupedStreamKernel<
    Index2D, ArrayView2D<uint, Stride2D.DenseX>, float, float>(RenderKernel);

// Invoke repeatedly (no stream argument needed for auto-grouped)
_renderKernel(buffer.IntExtent, buffer.View, time, zoom);

Note: The delegate type for LoadAutoGroupedStreamKernel does not include an AcceleratorStream parameter. The index type is the first argument when calling.

Explicitly Grouped Kernels

For full control over workgroup size (required for shared memory and barriers):

static void GroupedKernel(ArrayView<int> data, ArrayView<int> output)
{
    var globalIdx = Grid.GlobalIndex.X;
    var localIdx = Group.IdxX;
    var groupSize = Group.DimX;

    // Use shared memory
    var sharedMem = SharedMemory.Allocate<int>(64);
    sharedMem[localIdx] = data[globalIdx];

    Group.Barrier(); // Wait for all threads in group

    // Process with shared data...
    output[globalIdx] = sharedMem[(localIdx + 1) % groupSize];
}

Parameter Types

Scalar Parameters

Scalars (int, float, double, etc.) are passed by value:

static void ScalarKernel(Index1D index, ArrayView<float> data, float multiplier, int offset)
{
    data[index] = data[index] * multiplier + offset;
}

Struct Parameters

Custom structs work if they are value types with fixed size:

public struct SimParams
{
    public float DeltaTime;
    public float Gravity;
    public int MaxIterations;
}

static void PhysicsKernel(Index1D index, ArrayView<float> positions, SimParams p)
{
    positions[index] += p.Gravity * p.DeltaTime;
}

GpuMatrix4x4 — GPU-Friendly 4×4 Matrix

SpawnDev.ILGPU includes GpuMatrix4x4, a GPU-friendly 4×4 matrix struct that auto-transposes from .NET's row-major System.Numerics.Matrix4x4 to GPU column-major order. Use it for 3D transformations inside kernels:

using SpawnDev.ILGPU;
using System.Numerics;

// On the host: create from a .NET Matrix4x4 (auto-transposes to GPU column-major)
var viewMatrix = Matrix4x4.CreateLookAt(
    new Vector3(0, 0, 5),   // eye
    Vector3.Zero,            // target
    Vector3.UnitY);          // up
var gpuMatrix = GpuMatrix4x4.FromMatrix4x4(viewMatrix);

// Pass directly as a kernel parameter
kernel((Index1D)count, positionsView, outputView, gpuMatrix);
// In the kernel: use static transform methods
static void TransformKernel(
    Index1D index,
    ArrayView<float> positions,
    ArrayView<float> output,
    GpuMatrix4x4 matrix)
{
    int i = index * 3;
    float x = positions[i], y = positions[i + 1], z = positions[i + 2];

    // Transform point (rotation + translation)
    GpuMatrix4x4.TransformPoint(matrix, x, y, z, out float rx, out float ry, out float rz);

    output[i] = rx;
    output[i + 1] = ry;
    output[i + 2] = rz;
}
Method Description
GpuMatrix4x4.FromMatrix4x4(Matrix4x4) Auto-transposes from .NET row-major to GPU column-major
GpuMatrix4x4.Identity Returns the identity matrix
GpuMatrix4x4.TransformPoint(m, x, y, z, out rx, ry, rz) Applies rotation + translation
GpuMatrix4x4.TransformDirection(m, x, y, z, out rx, ry, rz) Applies rotation only (no translation)

Why not System.Numerics.Matrix4x4? .NET uses row-major layout with v * M convention, while GPUs use column-major with M * v. GpuMatrix4x4 handles this transpose automatically so your transforms work correctly on all backends.

ArrayView Parameters

ArrayView<T> is the primary way to access GPU memory from kernels:

static void CopyKernel(Index1D index, ArrayView<float> source, ArrayView<float> dest)
{
    dest[index] = source[index];
}

Multi-dimensional views:

static void MatrixKernel(
    Index2D index,
    ArrayView2D<float, Stride2D.DenseX> matrix,
    ArrayView<float> result)
{
    int x = index.X, y = index.Y;
    result[y * matrix.IntExtent.X + x] = matrix[index] * 2.0f;
}

Math Functions

Supported Functions

ILGPU maps standard .NET math to GPU-native operations:

C# GPU Mapping Notes
MathF.Sin(x) sin(x) ✅ All backends
MathF.Cos(x) cos(x) ✅ All backends
MathF.Tan(x) tan(x) ✅ All backends
MathF.Sqrt(x) sqrt(x) ✅ All backends
MathF.Pow(x, y) pow(x, y) ✅ All backends
MathF.Log(x) log(x) ✅ All backends
MathF.Exp(x) exp(x) ✅ All backends
MathF.Abs(x) abs(x) ✅ All backends
MathF.Floor(x) floor(x) ✅ All backends
MathF.Ceiling(x) ceil(x) ✅ All backends
Math.Min(a, b) min(a, b) ✅ All backends
Math.Max(a, b) max(a, b) ✅ All backends
MathF.FusedMultiplyAdd fma(a, b, c) ✅ All backends
MathF.Atan2(y, x) atan2(y, x) ✅ All backends

Previously Unsupported Functions (Now Auto-Redirected)

These .NET methods contain internal throw statements, but all browser backends now include throw-free redirects that handle them automatically:

C# Status Notes
Math.Clamp(val, min, max) ✅ Auto-redirected Replaced with Min(Max(val, min), max)
Math.Round(x) ✅ Auto-redirected Throw-free wrapper
Math.Truncate(x) ✅ Auto-redirected Throw-free wrapper
Math.Sign(x) ✅ Auto-redirected Throw-free wrapper
MathF.FusedMultiplyAdd ✅ Auto-redirected Throw-free wrapper

Safe to use: These functions work directly in kernels on all backends thanks to RegisterMathIntrinsics(). See Limitations for the general throw constraint.

Shared Memory

Shared memory allows threads within a workgroup to share data. It's much faster than global memory but limited in size.

Availability: Supported on WebGPU, Wasm, CUDA, OpenCL, and CPU backends. WebGL does not support shared memory.

Static Shared Memory

static void SharedMemKernel(ArrayView<int> data, ArrayView<int> output)
{
    // Allocate shared memory (compile-time size)
    var shared = SharedMemory.Allocate<int>(64);

    var localIdx = Group.IdxX;
    var globalIdx = Grid.GlobalIndex.X;

    // Load data into shared memory
    shared[localIdx] = data[globalIdx];

    // Wait for all threads
    Group.Barrier();

    // Read from neighbor in shared memory
    output[globalIdx] = shared[(localIdx + 1) % Group.DimX];
}

Dynamic Shared Memory

Dynamic shared memory is sized at launch time:

static void DynamicSharedKernel(ArrayView<int> data)
{
    var shared = SharedMemory.GetDynamic<int>();
    // Size is determined by the launch configuration
}

// Launch with dynamic shared memory config
var config = SharedMemoryConfig.RequestDynamic<int>(groupSize);
kernel((gridDim, groupDim, config), data.View);

Control Flow

Standard C# control flow works in kernels:

static void ControlFlowKernel(Index1D index, ArrayView<float> data, float threshold)
{
    float val = data[index];

    // If/else
    if (val > threshold)
        data[index] = threshold;
    else
        data[index] = val * 2.0f;

    // Loops
    float sum = 0;
    for (int i = 0; i < 10; i++)
        sum += val * i;

    data[index] = sum;
}

Performance tip: Avoid divergent branches within a workgroup. When threads in the same workgroup take different paths, performance degrades because the GPU executes both paths sequentially.

Common Patterns

Stencil (Neighbor Access)

static void Stencil1D(Index1D index, ArrayView<float> input, ArrayView<float> output)
{
    int i = index;
    int len = (int)input.Length;

    float left  = i > 0 ? input[i - 1] : 0;
    float center = input[i];
    float right = i < len - 1 ? input[i + 1] : 0;

    output[i] = (left + center + right) / 3.0f;
}

Bounds Checking

Always guard against out-of-bounds access when the dispatch size may exceed the data size:

static void SafeKernel(Index1D index, ArrayView<float> data, int actualLength)
{
    if (index >= actualLength) return;
    data[index] = data[index] * 2.0f;
}

Packed Parameters

When you need many parameters, pack them into a struct or encode multiple values into fewer parameters:

// Pack width and height into a single int
int packedSize = width * 65536 + height;

// In kernel:
int width = packedSize / 65536;
int height = packedSize - width * 65536;