engine that leverages Apple's Metal Performance Shaders to dramatically speed up tensor operations in Go.*
rnxa (pronounced "RNA") provides hardware-accelerated tensor operations for machine learning workloads in Go. Initially developed to accelerate the relux neural network framework, rnxa is designed as a universal compute backend that can integrate with any Go ML framework.
Performance that Scales:
- ๐ 20-50x speedup for large matrix operations on Apple Silicon
- โก GPU acceleration via Metal Performance Shaders
- ๐ Smart fallbacks to optimized CPU implementations
- ๐ Zero overhead when GPU acceleration isn't beneficial
Framework Agnostic:
- ๐ Universal interface - works with any Go ML framework
- ๐งฉ Clean abstractions - no framework-specific dependencies
- ๐ก๏ธ Production ready - comprehensive error handling and resource management
- ๐ Scalable - from small models to large neural networks
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Go ML โ โ rnxa โ โ Hardware โ
โ Framework โโโโโถโ ComputeEngine โโโโโถโ Acceleration โ
โ (relux, etc.) โ โ Interface โ โ (Metal/CPU) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
- ComputeEngine Interface: Framework-agnostic tensor operations
- Metal Backend: GPU acceleration via Apple's Metal Performance Shaders
- CPU Backend: Optimized fallback with SIMD vectorization
- Device Management: Automatic detection and selection of compute devices
- Tensor Abstraction: Efficient n-dimensional array representation
- macOS with Apple Silicon (M1/M2/M3) or Intel Mac with Metal support
- Go 1.25+
- Xcode Command Line Tools (xcode-select version 2410+)
# Install Xcode Command Line Tools
xcode-select --install
# Verify installation
xcode-select --version
# Should output: xcode-select version 2410 or highergo get github.com/xDarkicex/rnxapackage main
import (
"context"
"fmt"
"github.com/xDarkicex/rnxa"
)
func main() {
// Create compute engine (auto-detects best device)
engine, err := rnxa.NewEngine()
if err != nil {
panic(err)
}
defer engine.Close()
fmt.Printf("Using device: %s (%s)\n",
engine.Device().Name, engine.Device().Platform)
// Create tensors
A := rnxa.NewTensor([]float64{1, 2, 3, 4, 5, 6}, 2, 3)
B := rnxa.NewTensor([]float64{7, 8, 9, 10, 11, 12}, 3, 2)
// Perform matrix multiplication
ctx := context.Background()
C, err := engine.MatMul(ctx, A, B)
if err != nil {
panic(err)
}
fmt.Printf("Result: %v\n", C.Data())
// Output: [58 64 139 154]
}func neuralLayerForward(engine rnxa.ComputeEngine,
inputs []float64,
weights [][]float64,
bias []float64) ([]float64, error) {
ctx := context.Background()
// Convert to tensors
inputTensor := rnxa.NewTensor(inputs, 1, len(inputs))
weightTensor := convertToTensor(weights)
biasTensor := rnxa.NewTensor(bias, len(bias))
// Matrix multiplication: inputs ร weights
matmulResult, err := engine.MatMul(ctx, inputTensor, weightTensor)
if err != nil {
return nil, err
}
// Add bias
biasResult, err := engine.VectorAdd(ctx, matmulResult, biasTensor)
if err != nil {
return nil, err
}
// Apply ReLU activation
activated, err := engine.ReLU(ctx, biasResult)
if err != nil {
return nil, err
}
return activated.Data(), nil
}func compareActivations() {
engine, _ := rnxa.NewEngine()
defer engine.Close()
ctx := context.Background()
input := rnxa.NewTensor([]float64{-2, -1, 0, 1, 2})
// Compare different activation functions
activations := map[string]func(context.Context, *rnxa.Tensor) (*rnxa.Tensor, error){
"ReLU": engine.ReLU,
"Sigmoid": engine.Sigmoid,
"Tanh": engine.Tanh,
}
fmt.Printf("Input: %v\n", input.Data())
for name, fn := range activations {
result, _ := fn(ctx, input)
fmt.Printf("%s: %v\n", name, result.Data())
}
}func deviceInfo() {
// List all available devices
devices := rnxa.DetectDevices()
for i, device := range devices {
fmt.Printf("Device %d: %s\n", i, device.Name)
fmt.Printf(" Platform: %s\n", device.Platform)
fmt.Printf(" Cores: %d\n", device.Cores)
fmt.Printf(" Memory: %.1fGB\n", float64(device.Memory)/1e9)
}
// Create engine and check memory
engine, _ := rnxa.NewEngine()
defer engine.Close()
memory := engine.Memory()
fmt.Printf("\nActive Device Memory:\n")
fmt.Printf(" Total: %.1fGB\n", float64(memory.Total)/1e9)
fmt.Printf(" Available: %.1fGB\n", float64(memory.Available)/1e9)
}rnxa exposes low-level tensor operations that any ML framework can use:
// Core operations available to any framework
type ComputeEngine interface {
// Matrix operations
MatMul(ctx context.Context, A, B *Tensor) (*Tensor, error)
// Element-wise operations
VectorAdd(ctx context.Context, A, B *Tensor) (*Tensor, error)
VectorSub(ctx context.Context, A, B *Tensor) (*Tensor, error)
VectorMul(ctx context.Context, A, B *Tensor) (*Tensor, error)
// Activation functions
ReLU(ctx context.Context, X *Tensor) (*Tensor, error)
Sigmoid(ctx context.Context, X *Tensor) (*Tensor, error)
Tanh(ctx context.Context, X *Tensor) (*Tensor, error)
Softmax(ctx context.Context, X *Tensor) (*Tensor, error)
// Reduction operations
Sum(ctx context.Context, X *Tensor, axis int) (*Tensor, error)
Mean(ctx context.Context, X *Tensor, axis int) (*Tensor, error)
// Device management
Device() Device
Available() bool
Memory() MemoryInfo
Close() error
}With relux (Native Integration):
// relux automatically uses rnxa when available
net, _ := relux.NewNetwork(
relux.WithConfig(config),
relux.WithBackend("auto"), // Uses rnxa if available
)With Custom Frameworks:
// Any framework can use rnxa as a compute backend
type MyFramework struct {
engine rnxa.ComputeEngine
}
func (f *MyFramework) ForwardPass(inputs []float64) []float64 {
// Use rnxa for heavy computation
tensor := rnxa.NewTensor(inputs)
result, _ := f.engine.ReLU(context.Background(), tensor)
return result.Data()
}| Matrix Size | Pure Go | rnxa (CPU) | rnxa (Metal) | Speedup |
|---|---|---|---|---|
| 32ร32 | 0.12ms | 0.08ms | 0.15ms | 1.5x (CPU) |
| 128ร128 | 2.1ms | 0.9ms | 0.3ms | 7x (Metal) |
| 512ร512 | 85ms | 28ms | 4.2ms | 20x (Metal) |
| 1024ร1024 | 680ms | 195ms | 28ms | 24x (Metal) |
Neural Network Training (Iris Dataset):
- Pure Go: 0.85 seconds
- rnxa (Metal): 0.12 seconds
- Speedup: 7.1x
Large Model Inference:
- Pure Go: 45 seconds
- rnxa (Metal): 2.8 seconds
- Speedup: 16x
// Force CPU backend
engine, err := rnxa.NewEngineWithDevice(1) // Assuming device 1 is CPU
// Check device capabilities
if rnxa.IsMetalAvailable() {
fmt.Println("Metal acceleration available!")
device := rnxa.GetBestDevice()
fmt.Printf("Best device: %s\n", device.Name)
}func robustComputation(A, B *rnxa.Tensor) (*rnxa.Tensor, error) {
engine, err := rnxa.NewEngine()
if err != nil {
return nil, fmt.Errorf("failed to create engine: %w", err)
}
defer engine.Close()
ctx := context.Background()
// Attempt Metal computation
result, err := engine.MatMul(ctx, A, B)
if err != nil {
// Automatic fallback to CPU is handled internally
return nil, fmt.Errorf("computation failed: %w", err)
}
return result, nil
}Run the comprehensive test suite:
# Run all tests
go test -v
# Run benchmarks
go test -bench=. -benchmem
# Test specific operations
go test -run TestMatrixMultiplication -v
go test -run TestActivationFunctions -vSample Test Output:
=== RUN TestDeviceDetection
metal_test.go:29: Found 2 devices:
metal_test.go:31: Device 0: (Metal) - 10 cores, 17.2GB memory
metal_test.go:31: Device 1: CPU (CPU) - 8 cores, 8.6GB memory
--- PASS: TestDeviceDetection (0.05s)
=== RUN TestMatrixMultiplication
metal_test.go:80: MatMul test passed on Metal
--- PASS: TestMatrixMultiplication (0.00s)
- โ Metal Performance Shaders integration
- โ Automatic GPU/CPU selection
- โ Comprehensive activation functions
- โ Production-ready error handling
- ๐ Linux CUDA Support
- NVIDIA GPU acceleration on Linux
- cuBLAS and cuDNN integration
- Docker containerization support
- ๐ Windows CUDA Support
- Native Windows CUDA integration
- DirectML fallback support
- Visual Studio compatibility
- ๐ฎ Multi-GPU Support - Distributed computation across devices
- ๐ฎ Custom Kernels - User-defined compute shaders
- ๐ฎ Memory Optimization - Smart memory pooling and reuse
- ๐ฎ Mixed Precision - FP16/BF16 support for faster training
- ๐ฎ TensorFlow Go Bindings - Interoperability with TensorFlow models
- ๐ฎ ONNX Runtime Integration - Load and run ONNX models
- ๐ฎ Distributed Training - Multi-node training support
We welcome contributions to make rnxa the premier ML acceleration framework for Go!
- ๐ Performance Optimization - CUDA kernels, memory management
- ๐งช Testing - Cross-platform testing, edge case validation
- ๐ Documentation - Tutorials, integration guides
- ๐ Debugging - Profiling tools, performance analysis
- ๐ Platform Support - AMD ROCm, Intel oneAPI
git clone https://github.com/xDarkicex/rnxa.git
cd rnxa
go mod tidy
go test ./...- OS: macOS 10.15+ (Catalina or later)
- Hardware: Apple Silicon (M1/M2/M3) or Intel Mac with Metal support
- Tools: Xcode Command Line Tools (xcode-select 2410+)
- Go: Version 1.25+
Linux (Planned Q2 2026)
- OS: Ubuntu 20.04+, CentOS 8+, RHEL 8+
- Hardware: NVIDIA GPU with Compute Capability 6.0+
- CUDA: Version 11.0+
- Drivers: NVIDIA Driver 450.80.02+
Windows (Planned Q3 2026)
- OS: Windows 10/11 (64-bit)
- Hardware: NVIDIA GPU with Compute Capability 6.0+
- CUDA: Version 11.0+
- Visual Studio: 2019 or later
MIT License - see LICENSE for details.
- Apple Metal Team - For creating the Metal Performance Shaders framework
- Go Team - For building a language that makes systems programming approachable
- relux Project - The original motivation and testing ground for rnxa
- Go ML Community - For driving innovation in Go-based machine learning
- ๐ Issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ง Email: [gentry@xdarkicex.codes]
- ๐ Documentation: docs.xdarkicex.codes
โก Accelerate your Go ML workloads with rnxa โก
Built with โค๏ธ for the Go ML community
rnxa: Because your ML deserves better performance.