Rkeramati · Rkeramati · Oct 25, 2025 · Oct 25, 2025
diff --git a/README.md b/README.md
@@ -14,8 +14,7 @@ Module 3 focuses on **optimizing tensor operations** through parallel computing
 - **CPU Parallelization**: Implement parallel tensor operations with Numba
 - **GPU Programming**: Write CUDA kernels for tensor operations
 - **Performance Optimization**: Achieve significant speedup through hardware acceleration
-- **Matrix Multiplication**: Optimize the most computationally intensive operations
-- **Backend Architecture**: Build multiple computational backends for flexible performance
+- **Matrix Multiplication**: Optimize the most computationally intensive operations with operator fusion
 
 ## Tasks Overview
 
@@ -27,15 +26,15 @@ Feel free to use numpy functions like `np.array_equal()` and `np.zeros()`.
 File to edit: `minitorch/fast_ops.py`
 Implement optimized batched matrix multiplication with parallel outer loops.
 
-**Task 3.3**: GPU Operations
+**Task 3.3**: GPU Operations (requires GPU)
 File to edit: `minitorch/cuda_ops.py`
 Implement CUDA kernels for tensor map, zip, and reduce operations.
 
-**Task 3.4**: GPU Matrix Multiplication
+**Task 3.4**: GPU Matrix Multiplication (requires GPU)
 File to edit: `minitorch/cuda_ops.py`
 Implement CUDA matrix multiplication with shared memory optimization for maximum performance.
 
-**Task 3.5**: Training
+**Task 3.5**: Training (requires GPU)
 File to edit: `project/run_fast_tensor.py`
 Implement missing functions and train models on all datasets to demonstrate performance improvements.
 
@@ -44,95 +43,12 @@ Implement missing functions and train models on all datasets to demonstrate perf
 - **[Installation Guide](installation.md)** - Setup instructions including GPU configuration
 - **[Testing Guide](testing.md)** - How to run tests locally and handle GPU requirements
 
-## Quick Start
-
-### 1. Environment Setup
-```bash
-# Clone and navigate to your assignment
-git clone <your-assignment-repo>
-cd <assignment-directory>
-
-# Create virtual environment (recommended)
-conda create --name minitorch python
-conda activate minitorch
-
-# Install dependencies
-pip install -e ".[dev,extra]"
-```
-
-### 2. Sync Previous Module Files
-```bash
-# Sync required files from your Module 2 solution
-python sync_previous_module.py <path-to-module-2> .
-
-# Example:
-python sync_previous_module.py ../Module-2 .
-```
-
-### 3. Run Tests
-```bash
-# CPU tasks (run anywhere)
-pytest -m task3_1  # CPU parallel operations
-pytest -m task3_2  # CPU matrix multiplication
-
-# GPU tasks (require CUDA-compatible GPU)
-pytest -m task3_3  # GPU operations
-pytest -m task3_4  # GPU matrix multiplication
-
-# Style checks
-pre-commit run --all-files
-```
-
 ## GPU Setup
 
-### Option 1: Google Colab (Recommended)
-Most students should use Google Colab for GPU tasks:
-
-1. Upload assignment files to Colab
-2. Change runtime to GPU (Runtime → Change runtime type → GPU)
-3. Install packages:
-   ```python
-   !pip install -e ".[dev,extra]"
-   !python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
-   ```
-
-### Option 2: Local GPU (If you have NVIDIA GPU)
-For students with NVIDIA GPUs and CUDA-compatible hardware:
-
-```bash
-# Install CUDA toolkit
-# Visit: https://developer.nvidia.com/cuda-downloads
-
-# Install GPU packages
-pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-pip install numba[cuda]
-
-# Verify GPU support
-python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
-```
-
-## Testing Strategy
-
-### CI/CD (GitHub Actions)
-- **Task 3.1**: CPU parallel operations
-- **Task 3.2**: CPU matrix multiplication  
-- **Style Check**: Code quality and formatting
-
-### GPU Testing (Colab/Local GPU)
-- **Task 3.3**: GPU operations (use Colab or local NVIDIA GPU)
-- **Task 3.4**: GPU matrix multiplication (use Colab or local NVIDIA GPU)
-
-### Performance Validation
-```bash
-# Compare backend performance
-python project/run_fast_tensor.py    # Optimized backends
-python project/run_tensor.py         # Basic tensor backend
-python project/run_scalar.py         # Scalar baseline
-```
+Follow this [link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing). Go to the Colab file → save to drive, select runtime to T4 and follow instructions.
 
 ## Development Tools
-
-### Code Quality
+## Code Quality
 ```bash
 # Automatic style checking
 pre-commit install
@@ -156,25 +72,9 @@ NUMBA_CUDA_DEBUG=1 pytest -m task3_3 -v
 nvidia-smi -l 1  # Update every second
 ```
 
-## Implementation Focus
-
-### Task 3.1 & 3.2 (CPU Optimization)
-- Implement `tensor_map`, `tensor_zip`, `tensor_reduce` with Numba parallel loops
-- Optimize matrix multiplication with efficient loop ordering
-- Focus on cache locality and parallel execution patterns
-
-### Task 3.3 & 3.4 (GPU Acceleration)  
-- Write CUDA kernels for element-wise operations
-- Implement efficient GPU matrix multiplication with shared memory
-- Optimize thread block organization and memory coalescing
-
-## Task 3.5 Training Results
-
-### Performance Targets
-- **CPU Backend**: Below 2 seconds per epoch
-- **GPU Backend**: Below 1 second per epoch (on standard Colab GPU)
-
 ### Training Commands
+
+#### Local Environment
 ```bash
 # CPU Backend
 python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05
@@ -187,6 +87,14 @@ python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --R
 python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
 ```
 
+#### Google Colab (Recommended)
+```bash
+# GPU Backend examples
+!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
+!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
+!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
+```
+
 ### Student Results
 **TODO: Add your training results here**
 
@@ -201,10 +109,3 @@ python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RAT
 #### XOR Dataset
 - CPU Backend: [Add time per epoch and accuracy] 
 - GPU Backend: [Add time per epoch and accuracy]
-
-## Important Notes
-
-- **GPU Limitations**: Tasks 3.3 and 3.4 cannot run in GitHub CI due to hardware requirements
-- **GPU Testing**: Use Google Colab (recommended) or local NVIDIA GPU for GPU tasks
-- **Performance Critical**: Implementations must show measurable speedup over sequential versions
-- **Memory Management**: Be careful with GPU memory allocation and deallocation
diff --git a/installation.md b/installation.md
@@ -83,60 +83,6 @@ Install all packages in your virtual environment:
 
 ## GPU Setup (Required for Tasks 3.3 and 3.4)
 
-Tasks 3.3 and 3.4 require GPU support and won't run on GitHub CI.
+Tasks 3.3 and 3.4 require GPU support. Use Google Colab for GPU access (Sign up for student version).
 
-### Option 1: Google Colab (Recommended)
-
-Most students should use Google Colab as it provides free GPU access:
-
-1. Upload your assignment files to Colab
-2. Change runtime to GPU (Runtime → Change runtime type → GPU)
-3. Install packages in Colab:
-   ```python
-   !pip install -e ".[dev,extra]"
-   !python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
-   ```
-
-### Option 2: Local GPU Setup (If you have NVIDIA GPU)
-
-For students with NVIDIA GPUs and CUDA-compatible hardware:
-
-1. **Install CUDA Toolkit**
-   ```bash
-   # Visit: https://developer.nvidia.com/cuda-downloads
-   # Follow instructions for your OS
-   ```
-
-2. **Verify CUDA Installation**
-   ```bash
-   >>> nvcc --version
-   >>> nvidia-smi
-   ```
-
-3. **Install GPU-compatible packages**
-   ```bash
-   >>> pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-   >>> pip install numba[cuda]
-   ```
-
-## Verification
-
-Make sure everything is installed by running:
-
-```bash
->>> python -c "import minitorch; print('Success!')"
-```
-
-Verify that the tensor functionality is available:
-
-```bash
->>> python -c "from minitorch import tensor; print('Module 3 ready!')"
-```
-
-Check if CUDA support is available (for GPU tasks):
-
-```bash
->>> python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
-```
-
-You're ready to start Module 3!
+Follow this [Google Colab link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing), save the file to your drive, select T4 GPU runtime, and follow the instructions in the notebook.
diff --git a/pyproject.toml b/pyproject.toml
@@ -10,8 +10,8 @@ requires-python = ">=3.8"
 dependencies = [
     "colorama==0.4.6",
     "hypothesis==6.138.2",
-    "numba==0.61.2",
-    "numpy>=1.24,<2.3",
+    "numba-cuda[cu12]>=0.4.0", ## cu12 is for CUDA 12.0 cu13 is for CUDA 13.0 
+    "numpy<2.0",
     "pytest==8.4.1",
     "pytest-env==1.1.5",
     "typing_extensions",

diff --git a/testing.md b/testing.md
@@ -5,11 +5,9 @@
 This project uses pytest for testing. Tests are organized by task:
 
 ```bash
-# Run all tests for a specific task
+# CPU Tasks (3.1 & 3.2) - Run locally
 pytest -m task3_1  # CPU parallel operations
 pytest -m task3_2  # CPU matrix multiplication
-pytest -m task3_3  # GPU operations (requires CUDA)
-pytest -m task3_4  # GPU matrix multiplication (requires CUDA)
 
 # Run all tests
 pytest
@@ -31,26 +29,12 @@ pytest tests/test_tensor_general.py::test_matrix_multiply
 - GitHub Actions CI only runs tasks 3.1 and 3.2 (CPU only)
 - Tasks 3.3 and 3.4 require local GPU or Google Colab
 
-**Option 1: Google Colab Testing (Recommended):**
-```python
-# In Colab notebook
-!pip install -e ".[dev,extra]"
-!python -m pytest -m task3_3 -v
-!python -m pytest -m task3_4 -v
-!python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
-```
+**GPU Tasks (3.3 & 3.4) - Google Colab (Recommended):**
 
-**Option 2: Local GPU Testing (If you have NVIDIA GPU):**
+Follow instructions on the [Google Colab link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing) and run tests like this:
 ```bash
-# Verify CUDA is available
-python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
-
-# Test GPU tasks locally
-pytest -m task3_3  # GPU operations
-pytest -m task3_4  # GPU matrix multiplication
-
-# Debug GPU issues
-NUMBA_DISABLE_JIT=1 pytest -m task3_3 -v  # Disable JIT for debugging
+!cd $DIR; python3.11 -m pytest -m task3_3 -v
+!cd $DIR; python3.11 -m pytest -m task3_4 -v
 ```
 
 ### Style and Code Quality Checks
@@ -67,18 +51,6 @@ ruff format .               # Code formatting
 pyright .                   # Type checking
 ```
 
-### Task 3.5 - Performance Evaluation
-
-**Training Scripts:**
-```bash
-# Run optimized training (CPU parallel)
-python project/run_fast_tensor.py
-
-# Compare with previous implementations
-python project/run_tensor.py     # Basic tensor implementation
-python project/run_scalar.py     # Scalar implementation
-```
-
 ### Parallel Diagnostics (Tasks 3.1 & 3.2)
 
 **Running Parallel Check:**
@@ -87,20 +59,6 @@ python project/run_scalar.py     # Scalar implementation
 python project/parallel_check.py
 ```
 
-**Expected Output for Task 3.1:**
-- **MAP**: Should show parallel loops for both fast path and general case with allocation hoisting for `np.zeros()` calls
-- **ZIP**: Should show parallel loops for both fast path and general case with optimized memory allocations
-- **REDUCE**: Should show main parallel loop with proper allocation hoisting
-
-**Expected Output for Task 3.2:**
-- **MATRIX MULTIPLY**: Should show nested parallel loops for batch and row dimensions with no allocation hoisting (since no index buffers are used)
-
-**Key Success Indicators:**
-- Parallel loops detected with `prange()`
-- Memory allocations hoisted out of parallel regions
-- Loop optimizations applied by Numba
-- No unexpected function calls in critical paths
-
 ### Pre-commit Hooks (Automatic Style Checking)
 
 The project uses pre-commit hooks that run automatically before each commit:
@@ -111,41 +69,4 @@ pre-commit install
 
 # Now style checks run automatically on every commit
 git commit -m "your message"  # Will run style checks first
-```
-
-### Debugging Tools
-
-**Numba Debugging:**
-```bash
-# Disable JIT compilation for debugging
-NUMBA_DISABLE_JIT=1 pytest -m task3_1 -v
-
-# Enable Numba debugging output
-NUMBA_DEBUG=1 python project/run_fast_tensor.py
-```
-
-**CUDA Debugging:**
-```bash
-# Check CUDA device properties
-python -c "import numba.cuda; print(numba.cuda.gpus)"
-
-# Monitor GPU memory usage
-nvidia-smi -l 1  # Update every second
-
-# Debug CUDA kernel launches
-NUMBA_CUDA_DEBUG=1 python -m pytest -m task3_3 -v
-```
-
-**Performance Profiling:**
-```bash
-# Time specific operations
-python -c "
-import time
-import minitorch
-backend = minitorch.TensorBackend(minitorch.FastOps)
-# Time your operations here
-"
-
-# Profile memory usage
-python -m memory_profiler project/run_fast_tensor.py
 ```
diff --git a/tests/test_tensor_general.py b/tests/test_tensor_general.py
@@ -15,7 +15,8 @@
 
 one_arg, two_arg, red_arg = MathTestVariable._comp_testing()
 
-
+from numba import config
+config.CUDA_ENABLE_PYNVJITLINK = 1
 # The tests in this file only run the main mathematical functions.
 # The difference is that they run with different tensor ops backends.