Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 16 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ Module 3 focuses on **optimizing tensor operations** through parallel computing
- **CPU Parallelization**: Implement parallel tensor operations with Numba
- **GPU Programming**: Write CUDA kernels for tensor operations
- **Performance Optimization**: Achieve significant speedup through hardware acceleration
- **Matrix Multiplication**: Optimize the most computationally intensive operations
- **Backend Architecture**: Build multiple computational backends for flexible performance
- **Matrix Multiplication**: Optimize the most computationally intensive operations with operator fusion

## Tasks Overview

Expand All @@ -27,15 +26,15 @@ Feel free to use numpy functions like `np.array_equal()` and `np.zeros()`.
File to edit: `minitorch/fast_ops.py`
Implement optimized batched matrix multiplication with parallel outer loops.

**Task 3.3**: GPU Operations
**Task 3.3**: GPU Operations (requires GPU)
File to edit: `minitorch/cuda_ops.py`
Implement CUDA kernels for tensor map, zip, and reduce operations.

**Task 3.4**: GPU Matrix Multiplication
**Task 3.4**: GPU Matrix Multiplication (requires GPU)
File to edit: `minitorch/cuda_ops.py`
Implement CUDA matrix multiplication with shared memory optimization for maximum performance.

**Task 3.5**: Training
**Task 3.5**: Training (requires GPU)
File to edit: `project/run_fast_tensor.py`
Implement missing functions and train models on all datasets to demonstrate performance improvements.

Expand All @@ -44,95 +43,12 @@ Implement missing functions and train models on all datasets to demonstrate perf
- **[Installation Guide](installation.md)** - Setup instructions including GPU configuration
- **[Testing Guide](testing.md)** - How to run tests locally and handle GPU requirements

## Quick Start

### 1. Environment Setup
```bash
# Clone and navigate to your assignment
git clone <your-assignment-repo>
cd <assignment-directory>

# Create virtual environment (recommended)
conda create --name minitorch python
conda activate minitorch

# Install dependencies
pip install -e ".[dev,extra]"
```

### 2. Sync Previous Module Files
```bash
# Sync required files from your Module 2 solution
python sync_previous_module.py <path-to-module-2> .

# Example:
python sync_previous_module.py ../Module-2 .
```

### 3. Run Tests
```bash
# CPU tasks (run anywhere)
pytest -m task3_1 # CPU parallel operations
pytest -m task3_2 # CPU matrix multiplication

# GPU tasks (require CUDA-compatible GPU)
pytest -m task3_3 # GPU operations
pytest -m task3_4 # GPU matrix multiplication

# Style checks
pre-commit run --all-files
```

## GPU Setup

### Option 1: Google Colab (Recommended)
Most students should use Google Colab for GPU tasks:

1. Upload assignment files to Colab
2. Change runtime to GPU (Runtime → Change runtime type → GPU)
3. Install packages:
```python
!pip install -e ".[dev,extra]"
!python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
```

### Option 2: Local GPU (If you have NVIDIA GPU)
For students with NVIDIA GPUs and CUDA-compatible hardware:

```bash
# Install CUDA toolkit
# Visit: https://developer.nvidia.com/cuda-downloads

# Install GPU packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numba[cuda]

# Verify GPU support
python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
```

## Testing Strategy

### CI/CD (GitHub Actions)
- **Task 3.1**: CPU parallel operations
- **Task 3.2**: CPU matrix multiplication
- **Style Check**: Code quality and formatting

### GPU Testing (Colab/Local GPU)
- **Task 3.3**: GPU operations (use Colab or local NVIDIA GPU)
- **Task 3.4**: GPU matrix multiplication (use Colab or local NVIDIA GPU)

### Performance Validation
```bash
# Compare backend performance
python project/run_fast_tensor.py # Optimized backends
python project/run_tensor.py # Basic tensor backend
python project/run_scalar.py # Scalar baseline
```
Follow this [link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing). Go to the Colab file → save to drive, select runtime to T4 and follow instructions.

## Development Tools

### Code Quality
## Code Quality
```bash
# Automatic style checking
pre-commit install
Expand All @@ -156,25 +72,9 @@ NUMBA_CUDA_DEBUG=1 pytest -m task3_3 -v
nvidia-smi -l 1 # Update every second
```

## Implementation Focus

### Task 3.1 & 3.2 (CPU Optimization)
- Implement `tensor_map`, `tensor_zip`, `tensor_reduce` with Numba parallel loops
- Optimize matrix multiplication with efficient loop ordering
- Focus on cache locality and parallel execution patterns

### Task 3.3 & 3.4 (GPU Acceleration)
- Write CUDA kernels for element-wise operations
- Implement efficient GPU matrix multiplication with shared memory
- Optimize thread block organization and memory coalescing

## Task 3.5 Training Results

### Performance Targets
- **CPU Backend**: Below 2 seconds per epoch
- **GPU Backend**: Below 1 second per epoch (on standard Colab GPU)

### Training Commands

#### Local Environment
```bash
# CPU Backend
python project/run_fast_tensor.py --BACKEND cpu --HIDDEN 100 --DATASET simple --RATE 0.05
Expand All @@ -187,6 +87,14 @@ python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --R
python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
```

#### Google Colab (Recommended)
```bash
# GPU Backend examples
!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET simple --RATE 0.05
!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET split --RATE 0.05
!cd $DIR; PYTHONPATH=/content/$DIR python3.11 project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RATE 0.05
```

### Student Results
**TODO: Add your training results here**

Expand All @@ -201,10 +109,3 @@ python project/run_fast_tensor.py --BACKEND gpu --HIDDEN 100 --DATASET xor --RAT
#### XOR Dataset
- CPU Backend: [Add time per epoch and accuracy]
- GPU Backend: [Add time per epoch and accuracy]

## Important Notes

- **GPU Limitations**: Tasks 3.3 and 3.4 cannot run in GitHub CI due to hardware requirements
- **GPU Testing**: Use Google Colab (recommended) or local NVIDIA GPU for GPU tasks
- **Performance Critical**: Implementations must show measurable speedup over sequential versions
- **Memory Management**: Be careful with GPU memory allocation and deallocation
58 changes: 2 additions & 56 deletions installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,60 +83,6 @@ Install all packages in your virtual environment:

## GPU Setup (Required for Tasks 3.3 and 3.4)

Tasks 3.3 and 3.4 require GPU support and won't run on GitHub CI.
Tasks 3.3 and 3.4 require GPU support. Use Google Colab for GPU access (Sign up for student version).

### Option 1: Google Colab (Recommended)

Most students should use Google Colab as it provides free GPU access:

1. Upload your assignment files to Colab
2. Change runtime to GPU (Runtime → Change runtime type → GPU)
3. Install packages in Colab:
```python
!pip install -e ".[dev,extra]"
!python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
```

### Option 2: Local GPU Setup (If you have NVIDIA GPU)

For students with NVIDIA GPUs and CUDA-compatible hardware:

1. **Install CUDA Toolkit**
```bash
# Visit: https://developer.nvidia.com/cuda-downloads
# Follow instructions for your OS
```

2. **Verify CUDA Installation**
```bash
>>> nvcc --version
>>> nvidia-smi
```

3. **Install GPU-compatible packages**
```bash
>>> pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
>>> pip install numba[cuda]
```

## Verification

Make sure everything is installed by running:

```bash
>>> python -c "import minitorch; print('Success!')"
```

Verify that the tensor functionality is available:

```bash
>>> python -c "from minitorch import tensor; print('Module 3 ready!')"
```

Check if CUDA support is available (for GPU tasks):

```bash
>>> python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
```

You're ready to start Module 3!
Follow this [Google Colab link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing), save the file to your drive, select T4 GPU runtime, and follow the instructions in the notebook.
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ requires-python = ">=3.8"
dependencies = [
"colorama==0.4.6",
"hypothesis==6.138.2",
"numba==0.61.2",
"numpy>=1.24,<2.3",
"numba-cuda[cu12]>=0.4.0", ## cu12 is for CUDA 12.0 cu13 is for CUDA 13.0
"numpy<2.0",
"pytest==8.4.1",
"pytest-env==1.1.5",
"typing_extensions",
Expand Down
89 changes: 5 additions & 84 deletions testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,9 @@
This project uses pytest for testing. Tests are organized by task:

```bash
# Run all tests for a specific task
# CPU Tasks (3.1 & 3.2) - Run locally
pytest -m task3_1 # CPU parallel operations
pytest -m task3_2 # CPU matrix multiplication
pytest -m task3_3 # GPU operations (requires CUDA)
pytest -m task3_4 # GPU matrix multiplication (requires CUDA)

# Run all tests
pytest
Expand All @@ -31,26 +29,12 @@ pytest tests/test_tensor_general.py::test_matrix_multiply
- GitHub Actions CI only runs tasks 3.1 and 3.2 (CPU only)
- Tasks 3.3 and 3.4 require local GPU or Google Colab

**Option 1: Google Colab Testing (Recommended):**
```python
# In Colab notebook
!pip install -e ".[dev,extra]"
!python -m pytest -m task3_3 -v
!python -m pytest -m task3_4 -v
!python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"
```
**GPU Tasks (3.3 & 3.4) - Google Colab (Recommended):**

**Option 2: Local GPU Testing (If you have NVIDIA GPU):**
Follow instructions on the [Google Colab link](https://colab.research.google.com/drive/1gyUFUrCXdlIBz9DYItH9YN3gQ2DvUMsI?usp=sharing) and run tests like this:
```bash
# Verify CUDA is available
python -c "import numba.cuda; print('CUDA available:', numba.cuda.is_available())"

# Test GPU tasks locally
pytest -m task3_3 # GPU operations
pytest -m task3_4 # GPU matrix multiplication

# Debug GPU issues
NUMBA_DISABLE_JIT=1 pytest -m task3_3 -v # Disable JIT for debugging
!cd $DIR; python3.11 -m pytest -m task3_3 -v
!cd $DIR; python3.11 -m pytest -m task3_4 -v
```

### Style and Code Quality Checks
Expand All @@ -67,18 +51,6 @@ ruff format . # Code formatting
pyright . # Type checking
```

### Task 3.5 - Performance Evaluation

**Training Scripts:**
```bash
# Run optimized training (CPU parallel)
python project/run_fast_tensor.py

# Compare with previous implementations
python project/run_tensor.py # Basic tensor implementation
python project/run_scalar.py # Scalar implementation
```

### Parallel Diagnostics (Tasks 3.1 & 3.2)

**Running Parallel Check:**
Expand All @@ -87,20 +59,6 @@ python project/run_scalar.py # Scalar implementation
python project/parallel_check.py
```

**Expected Output for Task 3.1:**
- **MAP**: Should show parallel loops for both fast path and general case with allocation hoisting for `np.zeros()` calls
- **ZIP**: Should show parallel loops for both fast path and general case with optimized memory allocations
- **REDUCE**: Should show main parallel loop with proper allocation hoisting

**Expected Output for Task 3.2:**
- **MATRIX MULTIPLY**: Should show nested parallel loops for batch and row dimensions with no allocation hoisting (since no index buffers are used)

**Key Success Indicators:**
- Parallel loops detected with `prange()`
- Memory allocations hoisted out of parallel regions
- Loop optimizations applied by Numba
- No unexpected function calls in critical paths

### Pre-commit Hooks (Automatic Style Checking)

The project uses pre-commit hooks that run automatically before each commit:
Expand All @@ -111,41 +69,4 @@ pre-commit install

# Now style checks run automatically on every commit
git commit -m "your message" # Will run style checks first
```

### Debugging Tools

**Numba Debugging:**
```bash
# Disable JIT compilation for debugging
NUMBA_DISABLE_JIT=1 pytest -m task3_1 -v

# Enable Numba debugging output
NUMBA_DEBUG=1 python project/run_fast_tensor.py
```

**CUDA Debugging:**
```bash
# Check CUDA device properties
python -c "import numba.cuda; print(numba.cuda.gpus)"

# Monitor GPU memory usage
nvidia-smi -l 1 # Update every second

# Debug CUDA kernel launches
NUMBA_CUDA_DEBUG=1 python -m pytest -m task3_3 -v
```

**Performance Profiling:**
```bash
# Time specific operations
python -c "
import time
import minitorch
backend = minitorch.TensorBackend(minitorch.FastOps)
# Time your operations here
"

# Profile memory usage
python -m memory_profiler project/run_fast_tensor.py
```
3 changes: 2 additions & 1 deletion tests/test_tensor_general.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@

one_arg, two_arg, red_arg = MathTestVariable._comp_testing()


from numba import config
config.CUDA_ENABLE_PYNVJITLINK = 1
# The tests in this file only run the main mathematical functions.
# The difference is that they run with different tensor ops backends.

Expand Down
Loading