tensorax/
├── 📄 README.md # Main project documentation
├── 📄 LICENSE # MIT License
├── 📄 setup.py # Build configuration
├── 📄 pyproject.toml # Python project metadata
├── 📄 MANIFEST.in # Package manifest
├── 📄 requirements.txt # Runtime dependencies (pybind11)
├── 📄 requirements-dev.txt # Development dependencies
├── 📄 .gitignore # Git ignore rules
├── 🔧 build.sh # Quick build script
├── 📄 demo.py # Comprehensive demo script
├── 📄 PROJECT_STRUCTURE.md # This file
├── 📄 REFACTORING_SUMMARY.md # NumPy removal summary
│
├── 📁 csrc/ # C++ and CUDA source code
│ ├── 📄 tensor_ops.h # Operation declarations
│ ├── 📄 tensor_ops.cpp # Python bindings (pybind11)
│ ├── 📁 cpu/ # CPU implementations
│ │ └── 📄 tensor_cpu.cpp # CPU operations (add, mul, matmul, etc.)
│ └── 📁 cuda/ # CUDA implementations
│ ├── 📄 cuda_utils.cuh # CUDA utilities and macros
│ ├── 📄 tensor_cuda.cu # CUDA memory management
│ └── 📁 kernels/ # Optimized CUDA kernels
│ ├── 📄 elementwise.cu # Element-wise ops (add, mul, sqrt, etc.)
│ ├── 📄 reduction.cu # Reduction ops (sum, max, etc.)
│ └── 📄 matmul.cu # Tiled matrix multiplication
│
├── 📁 tensorax/ # Python package
│ ├── 📄 __init__.py # Package initialization
│ ├── 📄 tensor.py # Core Tensor class with autograd
│ ├── 📄 functional.py # Functional API (F.relu, losses, etc.)
│ ├── 📄 optim.py # Optimizers (SGD, Adam)
│ └── 📁 nn/ # Neural network modules
│ ├── 📄 __init__.py
│ ├── 📄 module.py # Base Module class
│ └── 📄 layers.py # Layers (Linear, ReLU, Sequential, etc.)
│
├── 📁 tests/ # Test suite
│ ├── 📄 __init__.py
│ ├── 📄 test_tensor.py # Tensor operation tests
│ ├── 📄 test_nn.py # Neural network tests
│ ├── 📄 test_optim.py # Optimizer tests
│ └── 📄 test_functional.py # Functional API tests
│
├── 📁 examples/ # Usage examples
│ ├── 📄 README.md
│ ├── 📄 basic_operations.py # Basic tensor ops demo
│ ├── 📄 simple_nn.py # Neural network example
│ └── 📄 cuda_example.py # GPU acceleration demo
│
└── 📁 docs/ # Documentation
├── 📄 ARCHITECTURE.md # System architecture details
├── 📄 DEVELOPMENT.md # Development workflow
└── 📄 GUIDE.md # Complete development roadmap
Purpose: High-performance tensor operations with zero PyTorch/NumPy dependency
Key Files:
tensor_ops.cpp: Python bindings using pybind11, high-level operation wrapperstensor_ops.h: Operation declarations and TensorImpl classcuda/kernels/*.cu: Optimized CUDA kernelscpu/tensor_cpu.cpp: CPU fallback implementations
Operations Fully Implemented:
- Element-wise: add, subtract, multiply, divide, sqrt
- Matrix: matmul (tiled CUDA algorithm with 448x speedup), transpose
- Activations: ReLU, sigmoid, tanh, softmax
- Losses: MSE, cross-entropy
- Utilities: random normal distribution, device transfers (CPU↔CUDA)
Purpose: User-friendly PyTorch-like interface with automatic differentiation
Key Classes:
Tensor: Core multi-dimensional array with full autograd support- Operations:
+,-,*,/,@(matmul) - Properties:
.T(transpose),.shape,.device,.requires_grad - Methods:
.backward(),.zero_grad(),.sqrt(),.cuda(),.cpu() - Factory methods:
.zeros(),.ones(),.full(),.randn()
- Operations:
Module: Base class for neural network layers- Parameter management
- Device transfer support
Optimizer: Base class for optimization algorithms- Parameter updates with gradient descent
Neural Network Modules (nn/):
Linear: Fully connected layer with Xavier initializationReLU/Sigmoid/Tanh: Activation layersSequential: Layer container accepting list or varargs
Optimizers (optim.py):
SGD: Stochastic Gradient Descent with momentum supportAdam: Adaptive moment estimation with bias correction
Purpose: Stateless operations for functional programming style
Functions:
- Activations:
relu(),sigmoid(),tanh(),softmax() - Losses:
mse_loss(),cross_entropy_loss() - Operations:
linear()
Complete backpropagation system supporting:
- All arithmetic operations (
+,-,*,/) - Matrix operations (matmul, transpose)
- Activation functions (ReLU, sigmoid, tanh)
- Loss functions (MSE)
- Gradient accumulation and parameter updates
Gradient flow tracked through computational graph with proper chain rule application.
Test Coverage:
- Tensor operations and device transfers
- Gradient computation and backpropagation
- Layer functionality
- Optimizer behavior (SGD with momentum, Adam)
- Functional API
- ARCHITECTURE.md: System design, memory management, kernel design
- DEVELOPMENT.md: Build process, testing, debugging
- GUIDE.md: Complete development roadmap
- Python 3.8+
- C++17 compiler (g++ or clang++)
- CUDA Toolkit 11.0+ (optional, for GPU support)
- pybind11 (automatically installed)
From PyPI:
pip install tensoraxFrom Source:
# Clone repository
git clone https://github.com/NotShrirang/tensorax.git
cd tensorax
# Quick build (automatically detects CUDA)
bash build.sh
# Install in development mode
pip install -e .# Comprehensive demonstration of all features
python demo.py# Run all tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=tensorax --cov-report=html# Basic operations
python examples/basic_operations.py
# Neural network
python examples/simple_nn.py
# CUDA demo (requires GPU)
python examples/cuda_example.py- Project structure and build system
- Basic tensor operations (CPU/CUDA)
- Tensor class with device management
- Module system for neural networks
- Common layers (Linear, activations)
- Optimizers (SGD, Adam)
- Test infrastructure
- Documentation and examples
- Complete autograd system
- Convolution and pooling layers
- More reduction operations
- Batch normalization
- Learning rate schedulers
- Model serialization
- Multi-GPU support
- Mixed precision training
Runtime:
- Python 3.8+
- NumPy
- PyTorch (for build utilities only)
Build:
- C++17 compiler
- pybind11
- CUDA Toolkit 11.0+ (optional)
setup.py: Main build scriptpyproject.toml: Modern Python packagingMANIFEST.in: Files to include in distribution
Automatically detected via CUDA_HOME environment variable.
Falls back to CPU-only if CUDA not available.
- CUDA C Programming Guide
- CUDA Best Practices Guide
- Nsight Compute/Systems profiling tools
- PyTorch source code (reference implementation)
- CS231n course (Stanford)
- Deep Learning book (Goodfellow et al.)
- PyTorch
- TinyGrad
- JAX
- Tinygrad
See docs/DEVELOPMENT.md for:
- Development environment setup
- Code style guidelines
- Testing procedures
- Pull request process
- Use tiled algorithms for matrix ops
- Maximize shared memory usage
- Ensure coalesced memory access
- Profile with Nsight Compute
- Consider kernel fusion
- Minimize Python/C++ boundary crossings
- Batch operations when possible
- Use NumPy vectorization
- Consider Cython for critical paths
- Check CUDA_HOME is set correctly
- Verify C++ compiler supports C++17
- Ensure pybind11 is installed
- Check CUDA availability with
cuda_is_available() - Verify tensor devices match for operations
- Check array shapes for broadcasting
- CUDA memory leaks: Check cudaFree calls
- CPU memory: Let Python GC handle it
- Use smaller batch sizes if OOM
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: docs/ directory
- Examples: examples/ directory
Next Steps: Follow the development roadmap in docs/GUIDE.md to implement remaining features!