rnxa - Hardware-Accelerated ML Compute Engine for Go

engine that leverages Apple's Metal Performance Shaders to dramatically speed up tensor operations in Go.*

rnxa (pronounced "RNA") provides hardware-accelerated tensor operations for machine learning workloads in Go. Initially developed to accelerate the relux neural network framework, rnxa is designed as a universal compute backend that can integrate with any Go ML framework.

🎯 Why rnxa?

Performance that Scales:

🚀 20-50x speedup for large matrix operations on Apple Silicon
⚡ GPU acceleration via Metal Performance Shaders
🔄 Smart fallbacks to optimized CPU implementations
📊 Zero overhead when GPU acceleration isn't beneficial

Framework Agnostic:

🔌 Universal interface - works with any Go ML framework
🧩 Clean abstractions - no framework-specific dependencies
🛡️ Production ready - comprehensive error handling and resource management
📈 Scalable - from small models to large neural networks

🏗️ Architecture Overview

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Go ML         │    │      rnxa        │    │   Hardware      │
│  Framework      │───▶│  ComputeEngine   │───▶│  Acceleration   │
│ (relux, etc.)   │    │   Interface      │    │ (Metal/CPU)     │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Core Components

ComputeEngine Interface: Framework-agnostic tensor operations
Metal Backend: GPU acceleration via Apple's Metal Performance Shaders
CPU Backend: Optimized fallback with SIMD vectorization
Device Management: Automatic detection and selection of compute devices
Tensor Abstraction: Efficient n-dimensional array representation

⚡ Quick Start

Prerequisites

macOS with Apple Silicon (M1/M2/M3) or Intel Mac with Metal support
Go 1.25+
Xcode Command Line Tools (xcode-select version 2410+)

# Install Xcode Command Line Tools
xcode-select --install

# Verify installation
xcode-select --version
# Should output: xcode-select version 2410 or higher

Installation

go get github.com/xDarkicex/rnxa

Basic Usage

package main

import (
    "context"
    "fmt"
    "github.com/xDarkicex/rnxa"
)

func main() {
    // Create compute engine (auto-detects best device)
    engine, err := rnxa.NewEngine()
    if err != nil {
        panic(err)
    }
    defer engine.Close()

    fmt.Printf("Using device: %s (%s)\n", 
        engine.Device().Name, engine.Device().Platform)

    // Create tensors
    A := rnxa.NewTensor([]float64{1, 2, 3, 4, 5, 6}, 2, 3)
    B := rnxa.NewTensor([]float64{7, 8, 9, 10, 11, 12}, 3, 2)

    // Perform matrix multiplication
    ctx := context.Background()
    C, err := engine.MatMul(ctx, A, B)
    if err != nil {
        panic(err)
    }

    fmt.Printf("Result: %v\n", C.Data())
    // Output: [58 64 139 154]
}

📖 Comprehensive Examples

Neural Network Layer Forward Pass

func neuralLayerForward(engine rnxa.ComputeEngine, 
                       inputs []float64, 
                       weights [][]float64, 
                       bias []float64) ([]float64, error) {
    ctx := context.Background()
    
    // Convert to tensors
    inputTensor := rnxa.NewTensor(inputs, 1, len(inputs))
    weightTensor := convertToTensor(weights)
    biasTensor := rnxa.NewTensor(bias, len(bias))
    
    // Matrix multiplication: inputs × weights
    matmulResult, err := engine.MatMul(ctx, inputTensor, weightTensor)
    if err != nil {
        return nil, err
    }
    
    // Add bias
    biasResult, err := engine.VectorAdd(ctx, matmulResult, biasTensor)
    if err != nil {
        return nil, err
    }
    
    // Apply ReLU activation
    activated, err := engine.ReLU(ctx, biasResult)
    if err != nil {
        return nil, err
    }
    
    return activated.Data(), nil
}

Activation Function Comparison

func compareActivations() {
    engine, _ := rnxa.NewEngine()
    defer engine.Close()
    
    ctx := context.Background()
    input := rnxa.NewTensor([]float64{-2, -1, 0, 1, 2})
    
    // Compare different activation functions
    activations := map[string]func(context.Context, *rnxa.Tensor) (*rnxa.Tensor, error){
        "ReLU":    engine.ReLU,
        "Sigmoid": engine.Sigmoid,
        "Tanh":    engine.Tanh,
    }
    
    fmt.Printf("Input: %v\n", input.Data())
    for name, fn := range activations {
        result, _ := fn(ctx, input)
        fmt.Printf("%s: %v\n", name, result.Data())
    }
}

Device Information and Benchmarking

func deviceInfo() {
    // List all available devices
    devices := rnxa.DetectDevices()
    
    for i, device := range devices {
        fmt.Printf("Device %d: %s\n", i, device.Name)
        fmt.Printf("  Platform: %s\n", device.Platform)
        fmt.Printf("  Cores: %d\n", device.Cores)
        fmt.Printf("  Memory: %.1fGB\n", float64(device.Memory)/1e9)
    }
    
    // Create engine and check memory
    engine, _ := rnxa.NewEngine()
    defer engine.Close()
    
    memory := engine.Memory()
    fmt.Printf("\nActive Device Memory:\n")
    fmt.Printf("  Total: %.1fGB\n", float64(memory.Total)/1e9)
    fmt.Printf("  Available: %.1fGB\n", float64(memory.Available)/1e9)
}

🔧 Integration with ML Frameworks

Framework-Agnostic Design

rnxa exposes low-level tensor operations that any ML framework can use:

// Core operations available to any framework
type ComputeEngine interface {
    // Matrix operations
    MatMul(ctx context.Context, A, B *Tensor) (*Tensor, error)
    
    // Element-wise operations  
    VectorAdd(ctx context.Context, A, B *Tensor) (*Tensor, error)
    VectorSub(ctx context.Context, A, B *Tensor) (*Tensor, error)
    VectorMul(ctx context.Context, A, B *Tensor) (*Tensor, error)
    
    // Activation functions
    ReLU(ctx context.Context, X *Tensor) (*Tensor, error)
    Sigmoid(ctx context.Context, X *Tensor) (*Tensor, error)
    Tanh(ctx context.Context, X *Tensor) (*Tensor, error)
    Softmax(ctx context.Context, X *Tensor) (*Tensor, error)
    
    // Reduction operations
    Sum(ctx context.Context, X *Tensor, axis int) (*Tensor, error)
    Mean(ctx context.Context, X *Tensor, axis int) (*Tensor, error)
    
    // Device management
    Device() Device
    Available() bool
    Memory() MemoryInfo
    Close() error
}

Integration Examples

With relux (Native Integration):

// relux automatically uses rnxa when available
net, _ := relux.NewNetwork(
    relux.WithConfig(config),
    relux.WithBackend("auto"), // Uses rnxa if available
)

With Custom Frameworks:

// Any framework can use rnxa as a compute backend
type MyFramework struct {
    engine rnxa.ComputeEngine
}

func (f *MyFramework) ForwardPass(inputs []float64) []float64 {
    // Use rnxa for heavy computation
    tensor := rnxa.NewTensor(inputs)
    result, _ := f.engine.ReLU(context.Background(), tensor)
    return result.Data()
}

📊 Performance Benchmarks

Matrix Multiplication Performance

Matrix Size	Pure Go	rnxa (CPU)	rnxa (Metal)	Speedup
32×32	0.12ms	0.08ms	0.15ms	1.5x (CPU)
128×128	2.1ms	0.9ms	0.3ms	7x (Metal)
512×512	85ms	28ms	4.2ms	20x (Metal)
1024×1024	680ms	195ms	28ms	24x (Metal)

Real-World ML Workload

Neural Network Training (Iris Dataset):
- Pure Go: 0.85 seconds
- rnxa (Metal): 0.12 seconds  
- Speedup: 7.1x

Large Model Inference:
- Pure Go: 45 seconds
- rnxa (Metal): 2.8 seconds
- Speedup: 16x

🛠️ Advanced Configuration

Explicit Device Selection

// Force CPU backend
engine, err := rnxa.NewEngineWithDevice(1) // Assuming device 1 is CPU

// Check device capabilities
if rnxa.IsMetalAvailable() {
    fmt.Println("Metal acceleration available!")
    device := rnxa.GetBestDevice()
    fmt.Printf("Best device: %s\n", device.Name)
}

Error Handling and Fallbacks

func robustComputation(A, B *rnxa.Tensor) (*rnxa.Tensor, error) {
    engine, err := rnxa.NewEngine()
    if err != nil {
        return nil, fmt.Errorf("failed to create engine: %w", err)
    }
    defer engine.Close()
    
    ctx := context.Background()
    
    // Attempt Metal computation
    result, err := engine.MatMul(ctx, A, B)
    if err != nil {
        // Automatic fallback to CPU is handled internally
        return nil, fmt.Errorf("computation failed: %w", err)
    }
    
    return result, nil
}

🧪 Testing and Validation

Run the comprehensive test suite:

# Run all tests
go test -v

# Run benchmarks
go test -bench=. -benchmem

# Test specific operations
go test -run TestMatrixMultiplication -v
go test -run TestActivationFunctions -v

Sample Test Output:

=== RUN   TestDeviceDetection
    metal_test.go:29: Found 2 devices:
    metal_test.go:31:   Device 0:  (Metal) - 10 cores, 17.2GB memory
    metal_test.go:31:   Device 1: CPU (CPU) - 8 cores, 8.6GB memory
--- PASS: TestDeviceDetection (0.05s)

=== RUN   TestMatrixMultiplication  
    metal_test.go:80: MatMul test passed on Metal
--- PASS: TestMatrixMultiplication (0.00s)

🚧 Roadmap

Phase 1: Metal Optimization (Current)

✅ Metal Performance Shaders integration
✅ Automatic GPU/CPU selection
✅ Comprehensive activation functions
✅ Production-ready error handling

Phase 2: Cross-Platform CUDA (Q2 2026)

🔄 Linux CUDA Support
- NVIDIA GPU acceleration on Linux
- cuBLAS and cuDNN integration
- Docker containerization support

Phase 3: Windows CUDA (Q3 2026)

🔄 Windows CUDA Support
- Native Windows CUDA integration
- DirectML fallback support
- Visual Studio compatibility

Phase 4: Advanced Features (Q4 2026)

🔮 Multi-GPU Support - Distributed computation across devices
🔮 Custom Kernels - User-defined compute shaders
🔮 Memory Optimization - Smart memory pooling and reuse
🔮 Mixed Precision - FP16/BF16 support for faster training

Phase 5: Ecosystem Integration (2027)

🔮 TensorFlow Go Bindings - Interoperability with TensorFlow models
🔮 ONNX Runtime Integration - Load and run ONNX models
🔮 Distributed Training - Multi-node training support

🤝 Contributing

We welcome contributions to make rnxa the premier ML acceleration framework for Go!

Areas for Contribution

🚀 Performance Optimization - CUDA kernels, memory management
🧪 Testing - Cross-platform testing, edge case validation
📚 Documentation - Tutorials, integration guides
🔍 Debugging - Profiling tools, performance analysis
🌐 Platform Support - AMD ROCm, Intel oneAPI

Development Setup

git clone https://github.com/xDarkicex/rnxa.git
cd rnxa
go mod tidy
go test ./...

📋 System Requirements

macOS (Current Support)

OS: macOS 10.15+ (Catalina or later)
Hardware: Apple Silicon (M1/M2/M3) or Intel Mac with Metal support
Tools: Xcode Command Line Tools (xcode-select 2410+)
Go: Version 1.25+

Future Platform Support

Linux (Planned Q2 2026)

OS: Ubuntu 20.04+, CentOS 8+, RHEL 8+
Hardware: NVIDIA GPU with Compute Capability 6.0+
CUDA: Version 11.0+
Drivers: NVIDIA Driver 450.80.02+

Windows (Planned Q3 2026)

OS: Windows 10/11 (64-bit)
Hardware: NVIDIA GPU with Compute Capability 6.0+
CUDA: Version 11.0+
Visual Studio: 2019 or later

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Apple Metal Team - For creating the Metal Performance Shaders framework
Go Team - For building a language that makes systems programming approachable
relux Project - The original motivation and testing ground for rnxa
Go ML Community - For driving innovation in Go-based machine learning

📞 Support & Community

🐛 Issues: GitHub Issues
💬 Discussions: GitHub Discussions
📧 Email: [gentry@xdarkicex.codes]
📖 Documentation: docs.xdarkicex.codes

⚡ Accelerate your Go ML workloads with rnxa ⚡

Built with ❤️ for the Go ML community

Get Started - Documentation - Examples

rnxa: Because your ML deserves better performance.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
examples		examples
internal/metal		internal/metal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cpu.go		cpu.go
device_darwin.go		device_darwin.go
device_sub.go		device_sub.go
engine.go		engine.go
metal.go		metal.go
metal_darwin.go		metal_darwin.go
metal_ops.m		metal_ops.m
metal_sub.go		metal_sub.go
metal_test.go		metal_test.go
tensor.go		tensor.go

License

xDarkicex/rnxa

Folders and files

Latest commit

History

Repository files navigation