A kernel is a function that runs in parallel across many threads. SpawnDev.ILGPU compiles your C# kernel code into the target backend's native language — WGSL or GLSL for browser GPUs, PTX or OpenCL C for desktop GPUs, or WebAssembly / native threads for CPU backends.
A kernel is typically a static void method. The first parameter is an index type that identifies which thread is running. Think of it as the body of a massively parallel for loop:
// This kernel runs once per element — each thread gets a unique index
static void MyKernel(Index1D index, ArrayView<float> data, float multiplier)
{
data[index] = data[index] * multiplier;
}When you launch this kernel with 1000 elements, 1000 threads execute simultaneously, each with a different index value from 0 to 999.
You can also write kernels as C# lambdas that capture local variables. Captured scalar values (int, float, long, etc.) are automatically passed to the GPU at dispatch time:
int multiplier = 5;
float offset = 0.5f;
var kernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(
(index, buf) => { buf[index] = index * multiplier + offset; });
kernel((Index1D)length, buffer.View);Note: Only scalar value types can be captured.
ArrayViewcaptures are not supported — pass them as explicit kernel parameters instead.
DelegateSpecialization<T> lets you write one kernel that accepts different operations as parameters. The delegate is resolved at dispatch time and inlined at compile time — the GPU never sees a function pointer:
static int Negate(int x) => -x;
static int Square(int x) => x * x;
static void MapKernel(Index1D index, ArrayView<int> buf,
DelegateSpecialization<Func<int, int>> transform)
{
buf[index] = transform.Value(buf[index]);
}
var kernel = accelerator.LoadAutoGroupedStreamKernel<
Index1D, ArrayView<int>, DelegateSpecialization<Func<int, int>>>(MapKernel);
// Same kernel, different operations
kernel(size, buffer, new DelegateSpecialization<Func<int, int>>(Negate));
kernel(size, buffer, new DelegateSpecialization<Func<int, int>>(Square));Each unique target method produces a cached specialized kernel compilation. Target methods must be static.
These rules apply to all kernel code — they come from ILGPU's design and the constraints of GPU execution:
| Rule | Details |
|---|---|
Must be static (or a lambda) |
Instance methods are not supported (except capturing lambdas) |
Must return void |
Kernels don't return values — use output buffers |
| First parameter is the index | Index1D, Index2D, or Index3D |
| Value types only | No classes, no string, no reference types |
No throw |
No backend supports exception handling in kernels |
No ref / out |
Parameters are passed by value |
| No recursion | GPU hardware doesn't support call stacks |
| No dynamic allocation | No new inside kernels (except fixed-size structs) |
The index type determines the dimensionality of the kernel's execution grid:
static void Process1D(Index1D index, ArrayView<float> data, float value)
{
data[index] = value;
}
// Launch: each element gets one thread
kernel((Index1D)data.Length, data.View, 42.0f);static void Process2D(
Index2D index,
ArrayView2D<uint, Stride2D.DenseX> pixels,
int width, int height)
{
int x = index.X;
int y = index.Y;
if (x >= width || y >= height) return;
// Process pixel at (x, y)
uint r = (uint)(255 * x / width);
uint g = (uint)(255 * y / height);
pixels[index] = (0xFFu << 24) | (r << 16) | (g << 8) | 0xFF;
}
// Launch with 2D extent
kernel(buffer.IntExtent, buffer.View, width, height);static void Process3D(
Index3D index,
ArrayView<float> volume,
int width, int height, int depth)
{
int x = index.X, y = index.Y, z = index.Z;
int i = x + y * width + z * width * height;
volume[i] = x + y + z;
}The simplest way to load a kernel. ILGPU automatically determines the optimal workgroup size:
// Load once (compile + cache)
var kernel = accelerator.LoadAutoGroupedStreamKernel<
Index1D, ArrayView<float>, ArrayView<float>, ArrayView<float>>(VectorAddKernel);
// Launch (fire-and-forget — work is queued)
kernel((Index1D)length, bufA.View, bufB.View, bufC.View);
// Wait for completion
await accelerator.SynchronizeAsync();For render loops and repeated invocations, cache the kernel delegate:
// Declare as a field
private Action<Index2D, ArrayView2D<uint, Stride2D.DenseX>, float, float>? _renderKernel;
// Load once
_renderKernel = accelerator.LoadAutoGroupedStreamKernel<
Index2D, ArrayView2D<uint, Stride2D.DenseX>, float, float>(RenderKernel);
// Invoke repeatedly (no stream argument needed for auto-grouped)
_renderKernel(buffer.IntExtent, buffer.View, time, zoom);Note: The delegate type for
LoadAutoGroupedStreamKerneldoes not include anAcceleratorStreamparameter. The index type is the first argument when calling.
For full control over workgroup size (required for shared memory and barriers):
static void GroupedKernel(ArrayView<int> data, ArrayView<int> output)
{
var globalIdx = Grid.GlobalIndex.X;
var localIdx = Group.IdxX;
var groupSize = Group.DimX;
// Use shared memory
var sharedMem = SharedMemory.Allocate<int>(64);
sharedMem[localIdx] = data[globalIdx];
Group.Barrier(); // Wait for all threads in group
// Process with shared data...
output[globalIdx] = sharedMem[(localIdx + 1) % groupSize];
}Scalars (int, float, double, etc.) are passed by value:
static void ScalarKernel(Index1D index, ArrayView<float> data, float multiplier, int offset)
{
data[index] = data[index] * multiplier + offset;
}Custom structs work if they are value types with fixed size:
public struct SimParams
{
public float DeltaTime;
public float Gravity;
public int MaxIterations;
}
static void PhysicsKernel(Index1D index, ArrayView<float> positions, SimParams p)
{
positions[index] += p.Gravity * p.DeltaTime;
}SpawnDev.ILGPU includes GpuMatrix4x4, a GPU-friendly 4×4 matrix struct that auto-transposes from .NET's row-major System.Numerics.Matrix4x4 to GPU column-major order. Use it for 3D transformations inside kernels:
using SpawnDev.ILGPU;
using System.Numerics;
// On the host: create from a .NET Matrix4x4 (auto-transposes to GPU column-major)
var viewMatrix = Matrix4x4.CreateLookAt(
new Vector3(0, 0, 5), // eye
Vector3.Zero, // target
Vector3.UnitY); // up
var gpuMatrix = GpuMatrix4x4.FromMatrix4x4(viewMatrix);
// Pass directly as a kernel parameter
kernel((Index1D)count, positionsView, outputView, gpuMatrix);// In the kernel: use static transform methods
static void TransformKernel(
Index1D index,
ArrayView<float> positions,
ArrayView<float> output,
GpuMatrix4x4 matrix)
{
int i = index * 3;
float x = positions[i], y = positions[i + 1], z = positions[i + 2];
// Transform point (rotation + translation)
GpuMatrix4x4.TransformPoint(matrix, x, y, z, out float rx, out float ry, out float rz);
output[i] = rx;
output[i + 1] = ry;
output[i + 2] = rz;
}| Method | Description |
|---|---|
GpuMatrix4x4.FromMatrix4x4(Matrix4x4) |
Auto-transposes from .NET row-major to GPU column-major |
GpuMatrix4x4.Identity |
Returns the identity matrix |
GpuMatrix4x4.TransformPoint(m, x, y, z, out rx, ry, rz) |
Applies rotation + translation |
GpuMatrix4x4.TransformDirection(m, x, y, z, out rx, ry, rz) |
Applies rotation only (no translation) |
Why not
System.Numerics.Matrix4x4? .NET uses row-major layout withv * Mconvention, while GPUs use column-major withM * v.GpuMatrix4x4handles this transpose automatically so your transforms work correctly on all backends.
ArrayView<T> is the primary way to access GPU memory from kernels:
static void CopyKernel(Index1D index, ArrayView<float> source, ArrayView<float> dest)
{
dest[index] = source[index];
}Multi-dimensional views:
static void MatrixKernel(
Index2D index,
ArrayView2D<float, Stride2D.DenseX> matrix,
ArrayView<float> result)
{
int x = index.X, y = index.Y;
result[y * matrix.IntExtent.X + x] = matrix[index] * 2.0f;
}ILGPU maps standard .NET math to GPU-native operations:
| C# | GPU Mapping | Notes |
|---|---|---|
MathF.Sin(x) |
sin(x) |
✅ All backends |
MathF.Cos(x) |
cos(x) |
✅ All backends |
MathF.Tan(x) |
tan(x) |
✅ All backends |
MathF.Sqrt(x) |
sqrt(x) |
✅ All backends |
MathF.Pow(x, y) |
pow(x, y) |
✅ All backends |
MathF.Log(x) |
log(x) |
✅ All backends |
MathF.Exp(x) |
exp(x) |
✅ All backends |
MathF.Abs(x) |
abs(x) |
✅ All backends |
MathF.Floor(x) |
floor(x) |
✅ All backends |
MathF.Ceiling(x) |
ceil(x) |
✅ All backends |
Math.Min(a, b) |
min(a, b) |
✅ All backends |
Math.Max(a, b) |
max(a, b) |
✅ All backends |
MathF.FusedMultiplyAdd |
fma(a, b, c) |
✅ All backends |
MathF.Atan2(y, x) |
atan2(y, x) |
✅ All backends |
These .NET methods contain internal throw statements, but all browser backends now include throw-free redirects that handle them automatically:
| C# | Status | Notes |
|---|---|---|
Math.Clamp(val, min, max) |
✅ Auto-redirected | Replaced with Min(Max(val, min), max) |
Math.Round(x) |
✅ Auto-redirected | Throw-free wrapper |
Math.Truncate(x) |
✅ Auto-redirected | Throw-free wrapper |
Math.Sign(x) |
✅ Auto-redirected | Throw-free wrapper |
MathF.FusedMultiplyAdd |
✅ Auto-redirected | Throw-free wrapper |
Safe to use: These functions work directly in kernels on all backends thanks to
RegisterMathIntrinsics(). See Limitations for the generalthrowconstraint.
Shared memory allows threads within a workgroup to share data. It's much faster than global memory but limited in size.
Availability: Supported on WebGPU, Wasm, CUDA, OpenCL, and CPU backends. WebGL does not support shared memory.
static void SharedMemKernel(ArrayView<int> data, ArrayView<int> output)
{
// Allocate shared memory (compile-time size)
var shared = SharedMemory.Allocate<int>(64);
var localIdx = Group.IdxX;
var globalIdx = Grid.GlobalIndex.X;
// Load data into shared memory
shared[localIdx] = data[globalIdx];
// Wait for all threads
Group.Barrier();
// Read from neighbor in shared memory
output[globalIdx] = shared[(localIdx + 1) % Group.DimX];
}Dynamic shared memory is sized at launch time:
static void DynamicSharedKernel(ArrayView<int> data)
{
var shared = SharedMemory.GetDynamic<int>();
// Size is determined by the launch configuration
}
// Launch with dynamic shared memory config
var config = SharedMemoryConfig.RequestDynamic<int>(groupSize);
kernel((gridDim, groupDim, config), data.View);Standard C# control flow works in kernels:
static void ControlFlowKernel(Index1D index, ArrayView<float> data, float threshold)
{
float val = data[index];
// If/else
if (val > threshold)
data[index] = threshold;
else
data[index] = val * 2.0f;
// Loops
float sum = 0;
for (int i = 0; i < 10; i++)
sum += val * i;
data[index] = sum;
}Performance tip: Avoid divergent branches within a workgroup. When threads in the same workgroup take different paths, performance degrades because the GPU executes both paths sequentially.
static void Stencil1D(Index1D index, ArrayView<float> input, ArrayView<float> output)
{
int i = index;
int len = (int)input.Length;
float left = i > 0 ? input[i - 1] : 0;
float center = input[i];
float right = i < len - 1 ? input[i + 1] : 0;
output[i] = (left + center + right) / 3.0f;
}Always guard against out-of-bounds access when the dispatch size may exceed the data size:
static void SafeKernel(Index1D index, ArrayView<float> data, int actualLength)
{
if (index >= actualLength) return;
data[index] = data[index] * 2.0f;
}When you need many parameters, pack them into a struct or encode multiple values into fewer parameters:
// Pack width and height into a single int
int packedSize = width * 65536 + height;
// In kernel:
int width = packedSize / 65536;
int height = packedSize - width * 65536;