Skip to content

Conversation

@shivasankarka
Copy link
Collaborator

@shivasankarka shivasankarka commented Oct 26, 2025

This PR introduces initial GPU support for Numojo #273

It adds unified device and storage abstractions, a basic matrix representation for GPU computations, and several core GPU kernels (elementwise add/sub/mul, matmul, fill, and a block-level reduction). This work lays the foundation for using Mojo GPU features to accelerate array operations.

The design is inspired by PyTorch Tensor while keeping NumPy-like API choices where possible.

Notes

  • The StaticMatrix is still a very basic structure with only some getter and setter functions to showcase the proof of concept of a GPU backend in NuMojo. We will expand in future to include all features from Matrix type.
  • It's named as StaticMatrix as a compile time shape and strides would help optimize a lot of the loops and gpu kernels! This would be a Matrix type that takes advantage of Mojo's compile time capabilities as much as possible! We will modify the API to support compile time optimisations in future updates.

What’s Included

Device & context abstraction

  • numojo/core/gpu/device.mojo — device and context primitives to target GPU

Unified storage

  • numojo/core/gpu/storage.mojo — unified CPU/GPU memory management for buffers

Matrix primitives

  • numojo/core/staticmatrix.mojo — adds a StaticMatrix struct to prototype GPU usage before extending to N-D arrays

GPU kernels

  • numojo/core/gpu/matrix_kernels.mojo — implements:
    • Vectorized elementwise kernels: add, mul, fill (and sub)
    • Tiled matmul helpers
    • matrix_reduce_sum_kernel (per-block reduction)

Other

  • Launch-parameter helpers and dtype-specialization hooks for future optimizations
  • Updated for latest Mojo nightly (pixi updates)
  • Small dtype fixes and deprecation error fixes

Example

fn main() raises:
    alias SIZE: Int = 1024
    alias cpu: Device = Device.CPU
    alias mps: Device = Device.MPS

    var arr_cpu_1 = StaticMatrix[DType.float32](shape=(SIZE, SIZE), order="C", fill_value=1.0)
    var arr_cpu_2 = StaticMatrix[DType.float32]((SIZE, SIZE), fill_value=2.0)
    var matmul_cpu = arr_cpu_1 @ arr_cpu_2
    print(matmul_cpu)

    var arr_gpu_1 = StaticMatrix[DType.float32, device=mps](shape=(SIZE, SIZE), order="C", fill_value=1.0)
    var arr_gpu_2 = StaticMatrix[DType.float32, device=mps](shape=(SIZE, SIZE), order="C", fill_value=2.0)
    var matmul_gpu = arr_gpu_1 @ arr_gpu_2
    print(matmul_gpu)

    var arr_gpu_fromcpu_1 = arr_cpu_1.to[mps]()
    var arr_gpu_fromcpu_2 = arr_cpu_2.to[mps]()
    var matmul_gpu_fromcpu = arr_gpu_fromcpu_1 @ arr_gpu_fromcpu_2
    print(matmul_gpu_fromcpu)

shivasankarka and others added 23 commits October 21, 2025 23:29
…and-Algorithms-group#275)

## Pull Request Overview (From Copilot)

This PR enhances ComplexNDArray functionality by adding comparison
operators, trait methods, statistical/reduction methods, and array
manipulation capabilities. It also introduces temporary Int conversions
for strides/shape operations and implements SIMD load/store methods for
vectorized calculations.

### Key Changes
- Added trait implementations (ImplicitlyCopyable, Movable) and
conversion methods (__bool__, __int__, __float__) for ComplexNDArray
- Implemented magnitude-based comparison operators (__lt__, __le__,
__gt__, __ge__) for complex arrays
- Added statistical methods (all, any, sum, prod, mean, max, min,
argmax, argmin, cumsum, cumprod) and array manipulation methods
(flatten, fill, row, col, clip, round, T, diagonal, trace, tolist,
resize)
- Changed internal buffer types from `UnsafePointer[Int]` to
`UnsafePointer[Scalar[DType.int]]` in NDArrayShape, NDArrayStrides, and
Item structs
- Added SIMD load/store methods (load, store, unsafe_load, unsafe_store)
for Item, Shape, and Strides

<details>
<summary>Show a summary per file</summary>

| File | Description |
| ---- | ----------- |
| numojo/routines/indexing.mojo | Added Int conversions for stride
operations in compress function |
| numojo/routines/creation.mojo | Removed duplicate import statements |
| numojo/core/ndstrides.mojo | Changed buffer type to Scalar[DType.int],
updated __setitem__ validation, added SIMD load/store methods |
| numojo/core/ndshape.mojo | Changed buffer type to Scalar[DType.int],
updated __setitem__ validation, added SIMD load/store methods, modified
size_of_array calculation |
| numojo/core/ndarray.mojo | Added Int conversions for stride/shape
buffer accesses throughout |
| numojo/core/item.mojo | Changed buffer type to Scalar[DType.int],
removed Item.__init__(idx, shape) constructor and offset() method, added
SIMD load/store methods |
| numojo/core/complex/complex_simd.mojo | Added ImplicitlyCopyable and
Movable traits to ComplexSIMD |
| numojo/core/complex/complex_ndarray.mojo | Added comparison operators,
conversion methods, power operations, statistical methods, and array
manipulation methods; added Int conversions for stride operations |
</details>

---------

Co-authored-by: ZHU Yuhao 朱宇浩 <dr.yuhao.zhu@outlook.com>
@shivasankarka shivasankarka marked this pull request as draft October 26, 2025 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants