Skip to content

emadnahed/custom-gpt-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom GPT from Scratch

This project implements a highly optimized GPT (Generative Pre-trained Transformer) model from scratch using PyTorch. The implementation includes state-of-the-art architectural improvements, a complete training pipeline, and flexible text generation capabilities.

🚀 Features

Core Architecture

  • Rotary Position Embeddings (RoPE) - Better length generalization than learned positional embeddings
  • Grouped Query Attention (GQA) - Memory-efficient attention with fewer KV heads
  • SwiGLU Activation - Superior performance compared to GELU
  • RMSNorm - Faster normalization than LayerNorm
  • Flash Attention Support - Optimized attention computation when available
  • Weight Tying - Shared embeddings between input and output layers

Training Optimizations

  • Mixed Precision Training (FP16/BF16) - Faster training with lower memory usage
  • Gradient Checkpointing - Trade compute for memory on large models
  • Fused AdamW - Optimized optimizer for CUDA devices
  • Learning Rate Scheduling - Warmup and cosine decay
  • Gradient Accumulation - Simulate larger batch sizes
  • torch.compile() Support - JIT compilation for PyTorch 2.0+

Generation Features

  • Temperature Sampling - Control randomness in generation
  • Top-k Filtering - Sample from top k most likely tokens
  • Top-p (Nucleus) Sampling - Dynamic vocabulary truncation
  • KV-Cache - Efficient autoregressive generation
  • Interactive Mode - Chat-like interface for generation

Model Presets

  • Tiny (~10M params) - Fast testing and prototyping
  • Small (~25M params) - Runnable on CPU/Mac
  • Medium (~80M params) - GPU recommended
  • Large (~350M params) - Requires substantial GPU memory

🛠️ Installation

Python Requirements

Recommended: Python 3.11 Minimum: Python 3.8 Check your version: python3 --version

# If you have multiple Python versions, use the specific one:
python3.11 --version           # Check if 3.11 is available
python3.11 -m venv venv        # Create venv with 3.11
python3.11 gpt.py info         # Use 3.11 for all commands

# Or check which version to use:
python check_python_version.py

Automated Setup (Easiest!)

# One command setup (Mac/Linux)
./setup.sh

# Or on Windows
setup.bat

The setup script automatically:

  • Checks/installs Python (if needed)
  • Verifies Python version (3.11 recommended)
  • Creates virtual environment
  • Installs dependencies
  • Detects hardware
  • Prepares dataset
  • Gets you ready to train!

Manual Installation

  1. Clone the repository:

    git clone https://github.com/emadnahed/custom-gpt-from-scratch.git
    cd custom-gpt-from-scratch
  2. Create and activate a virtual environment (recommended):

    python3.11 -m venv venv
    source venv/bin/activate  # On Windows use: venv\Scripts\activate
  3. Install the required dependencies:

    pip install -r requirements.txt

🖥️ Hardware Auto-Detection

This project includes comprehensive hardware detection for seamless training across different platforms:

Supported Hardware

  • NVIDIA CUDA - NVIDIA GPUs with CUDA support
  • AMD ROCm - AMD GPUs on Linux
  • Apple Metal (MPS) - Apple Silicon (M1/M2/M3)
  • Intel XPU - Intel GPUs with Intel Extension for PyTorch
  • CPU - Universal fallback

Quick Hardware Check

Check what hardware is available on your system:

# Show all detected hardware
python check_hardware.py

# Interactive hardware selection
python check_hardware.py --interactive

# Show only recommended device
python check_hardware.py --recommended

# JSON output for scripting
python check_hardware.py --json

Using with Training

The training script automatically detects and uses the best available hardware:

# Auto-detect hardware (recommended)
python train.py

# Interactively choose hardware
python train.py --interactive

# Show hardware options without training
python train.py --show-hardware

Hardware Detection Features

  • Automatic detection of best available hardware
  • Display of available and unavailable hardware (color-coded)
  • Device capabilities (memory, compute capability, precision support)
  • Optimal precision selection (bfloat16, float16, float32)
  • Platform-specific optimizations

For detailed information, see:

  • START_HERE.md - Ultra quick start guide
  • GETTING_STARTED.md - Comprehensive beginner's guide
  • HARDWARE_FEATURE_SUMMARY.md - Hardware detection features

🎮 Command Center (Like npm scripts!)

This project includes an intuitive command center (gpt.py) - think of it as your "package.json scripts" for GPT training!

Quick Commands

# Most used commands (like npm run)
python gpt.py train          # Interactive training setup
python gpt.py generate       # Generate text from trained model
python gpt.py info           # Check your setup status
python gpt.py hardware       # View available hardware

# Management commands
python gpt.py config         # Create custom configurations
python gpt.py dataset        # Manage datasets (add/prepare/switch)

Interactive Training Workflow

python gpt.py train

# You'll be asked:
# 1. Which hardware? (auto-detected!)
# 2. Which dataset? (Shakespeare, or your own)
# 3. Model size? (tiny/small/medium/large or custom)
# 4. Number of layers? (4, 8, 12, 24, or custom)
# 5. How long? (quick/short/medium/long)
# 6. Start now? (yes!)

Custom Model Architecture

Easily customize the number of layers and other parameters:

python gpt.py config

# When prompted:
# - Choose custom architecture
# - Set n_layer (number of transformer layers):
#   * 4 layers: Fast, good for testing
#   * 8 layers: Balanced
#   * 12 layers: Good quality (recommended)
#   * 24 layers: Best quality (needs good hardware)
# - Adjust other parameters (heads, embedding size, etc.)

Dataset Management

python gpt.py dataset

# Options:
# 1. List available datasets
# 2. Prepare Shakespeare (default)
# 3. Add your own text file
# 4. View dataset info

Traditional Commands (Still Supported)

# Traditional training
python train.py --config config/my_config.py

# Traditional generation
python generate_demo.py

# Hardware check
python check_hardware.py

See QUICK_REFERENCE.md for complete command documentation.

🏗️ Project Structure

custom-gpt-from-scratch/
│
├── gpt_from_scratch/        # Main Python package
│   ├── __init__.py          # Package initialization
│   ├── cli.py               # Command-line interface
│   ├── model/               # Model architecture
│   │   ├── __init__.py
│   │   └── transformer.py   # GPT implementation
│   ├── utils/               # Utility modules
│   │   ├── __init__.py
│   │   ├── hardware_detector.py  # Hardware detection
│   │   └── python_utils.py  # Python utilities
│   └── data/                # Data processing
│       ├── __init__.py
│       └── utils.py
│
├── config/                  # Training configurations
│   ├── train_default.py     # Default training config
│   ├── train_demo.py        # Demo configuration
│   └── train_*.py           # Custom configurations
│
├── data/                    # Data directory
│   └── prepare.py           # Data preparation script
│
├── out/                     # Training outputs (created during training)
│   └── ckpt.pt             # Saved model checkpoints
│
├── .claude/                 # IDE/editor configuration
│   └── settings.local.json
│
├── check_hardware.py        # Hardware detection script
├── check_python_version.py  # Python version checker
├── config_builder.py        # Interactive config builder
├── dataset_manager.py       # Dataset management
├── generate_demo.py         # Text generation demo
├── generate_interactive.py  # Interactive generation
├── gpt.py                   # Main entry point
├── requirements.txt         # Python dependencies
├── setup.py                 # Package installation
├── setup.sh                 # Setup script (Linux/macOS)
├── setup.bat                # Setup script (Windows)
├── test_system.py           # System test
└── train.py                 # Training script

🚦 Getting Started

1. Prepare Your Data

First, prepare a dataset for training. The easiest way to start is with the Shakespeare dataset:

python data/prepare.py

This downloads and prepares the Tiny Shakespeare dataset (~1MB) for quick experimentation.

For custom datasets, modify data/prepare.py to load your text data. The script supports:

  • Character-level tokenization (built-in)
  • Hugging Face datasets integration
  • Custom text files

2. Train the Model

Train with the default configuration:

python train.py

The default configuration trains a small model (~25M parameters) on the Shakespeare dataset. Training should take 5-15 minutes on a modern GPU or 30-60 minutes on CPU.

Training Output

During training, you'll see:

  • Loss metrics (train/val) every N iterations
  • Training speed (tokens/sec)
  • Model FLOPS Utilization (MFU) - efficiency metric
  • Checkpoints saved to out/ckpt.pt

Custom Configuration

Modify config/train_default.py to customize:

  • Model size (tiny/small/medium/large presets)
  • Training hyperparameters (learning rate, batch size, etc.)
  • Hardware settings (device, mixed precision)
  • Optimization features (gradient checkpointing, compilation)

Example configurations:

# Fast training on CPU
model_preset = 'tiny'
batch_size = 4
max_iters = 1000
device = 'cpu'

# GPU training with larger model
model_preset = 'medium'
batch_size = 32
max_iters = 10000
device = 'cuda'
dtype = 'bfloat16'
compile_model = True  # PyTorch 2.0+ for speedup

3. Generate Text

After training, generate text with your model:

# Basic generation
python sample.py --prompt "To be or not to be" --max_tokens 100

# Control creativity
python sample.py --prompt "Once upon a time" --temperature 0.8 --top_k 50

# Interactive mode
python sample.py --interactive

Generation Parameters

  • --prompt: Starting text (empty for random start)
  • --max_tokens: Number of tokens to generate (default: 100)
  • --temperature: Sampling temperature - higher = more random (default: 0.8)
    • 0.1-0.5: Conservative, coherent
    • 0.6-0.9: Balanced creativity
    • 1.0+: Very creative, potentially incoherent
  • --top_k: Top-k filtering - only sample from top k tokens (default: 200)
  • --top_p: Nucleus sampling - cumulative probability threshold (default: 0.9)
  • --seed: Random seed for reproducibility
  • --interactive: Launch interactive generation mode

📊 Dependencies

  • Python 3.11+
  • PyTorch 2.2.2+
  • torchvision
  • torchaudio
  • NumPy
  • tqdm
  • Hugging Face Datasets (for data loading)

All dependencies are listed in requirements.txt.

🏛️ Architecture Deep Dive

This section provides a detailed breakdown of the model architecture at different levels of abstraction, from the simplest overview to the complete technical implementation.

Level 1: Basic Flow (Simplest View)

Input → Tokenization → Token Embeddings → Transformer Blocks (×N layers) → Final Norm → Linear Head → Logits → Loss (if training)

Level 2: Components Breakdown

Input Tokens → Token Embeddings → Dropout → 
[Grouped Query Attention → Residual Add → Feedforward (SwiGLU/MLP) → Residual Add] (×N layers) → 
Final Norm → Linear Head → Logits → Softmax (during generation)

Level 3: With Residual Connections

Input Tokens → Token Embeddings → Dropout → 
[x + GQA(x) → (x + GQA(x)) + MLP(x + GQA(x))] (×N layers) → 
Final Norm → Linear Head → Logits

Level 4: With Normalization (Complete Flow - Pre-LN Architecture)

Input Tokens → Token Embeddings → Dropout → 
[RMSNorm → GQA → Add & Residual → RMSNorm → SwiGLU → Add & Residual] (×N layers) → 
Final RMSNorm → Linear Head (weight-tied) → Logits → Cross-Entropy Loss

Level 5: Advanced - Full Technical Flow

Input Tokens (B, T) 
  ↓
Token Embeddings: wte(idx) → (B, T, n_embd)
  ↓
Dropout(p=0.1)
  ↓
┌─────────────────────────────────────────────────────────────────────┐
│ Transformer Block (×n_layer) - Pre-Norm Architecture:              │
│                                                                     │
│  x_norm = RMSNorm(x)  [or LayerNorm if configured]                │
│    ↓                                                                │
│  Grouped Query Attention (GQA):                                    │
│    • Q = Linear(x_norm) → (B, n_head, T, head_dim)                │
│    • K = Linear(x_norm) → (B, n_kv_head, T, head_dim)             │
│    • V = Linear(x_norm) → (B, n_kv_head, T, head_dim)             │
│    • Q, K = RoPE(Q, K)  [Rotary Position Embeddings]              │
│    • K, V = repeat_interleave(K, V, n_rep)  [if GQA]              │
│    • attn_out = scaled_dot_product_attention(Q, K, V, causal=True)│
│    • attn_out = Linear_o(attn_out) + Dropout                      │
│  x = x + attn_out  [Residual Connection 1]                        │
│    ↓                                                                │
│  x_norm2 = RMSNorm(x)                                              │
│    ↓                                                                │
│  SwiGLU Feedforward:                                               │
│    • gate = SiLU(W1(x_norm2))                                      │
│    • hidden = W3(x_norm2)                                          │
│    • mlp_out = W2(gate * hidden) + Dropout                        │
│  x = x + mlp_out  [Residual Connection 2]                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
  ↓
Final Normalization: RMSNorm(x) → (B, T, n_embd)
  ↓
Language Model Head: Linear(x) → (B, T, vocab_size) [Weight-Tied with wte]
  ↓
Output Logits (Training) OR Logits[:, -1, :] (Inference)
  ↓
Loss Calculation (if targets provided):
  • Flatten: logits → (B*T, vocab_size), targets → (B*T)
  • Cross-Entropy Loss with ignore_index=-1

Key Architecture Features

1. Pre-Norm Architecture

  • Normalization before each sub-layer (attention and feedforward)
  • More stable training than Post-Norm
  • Formula: x = x + SubLayer(Norm(x))

2. RoPE (Rotary Position Embeddings)

  • Applied inside attention mechanism (not as separate layer)
  • No learned positional parameters
  • Better length generalization
  • Applied to Q and K before attention computation

3. Grouped Query Attention (GQA)

  • Query heads: n_head (e.g., 6)
  • Key/Value heads: n_kv_head (e.g., 3)
  • Repetition factor: n_rep = n_head / n_kv_head
  • Memory efficient: reduces KV cache size by 2-4x

4. SwiGLU Activation

  • Three projection matrices: W1, W2, W3
  • Formula: W2(SiLU(W1(x)) ⊙ W3(x))
  • Better performance than standard GELU
  • Used in PaLM and LLaMA models

5. RMSNorm

  • Faster than LayerNorm (no mean centering)
  • Formula: x * rsqrt(mean(x²) + ε) * γ
  • Single learnable parameter: weight (γ)
  • ~10-15% faster than LayerNorm

6. Weight Tying

  • Token embeddings and output head share weights
  • Reduces parameters by ~vocab_size * n_embd
  • Formula: lm_head.weight = wte.weight

7. Optimization Features

  • Flash Attention support (when available)
  • Gradient Checkpointing option
  • Mixed Precision compatible (FP16/BF16)
  • KV-Cache for efficient generation

Data Flow Dimensions

Input: (Batch, Sequence) → (B, T)
  ↓
Embeddings: (B, T, n_embd)
  ↓
Attention Reshaping:
  Q: (B, T, n_embd) → (B, n_head, T, head_dim)
  K: (B, T, n_kv_head*head_dim) → (B, n_kv_head, T, head_dim)
  V: (B, T, n_kv_head*head_dim) → (B, n_kv_head, T, head_dim)
  ↓
After Attention: (B, n_head, T, head_dim) → (B, T, n_embd)
  ↓
MLP: (B, T, n_embd) → (B, T, hidden_dim) → (B, T, n_embd)
  ↓
Final: (B, T, n_embd) → (B, T, vocab_size)

Where:

  • B = Batch size
  • T = Sequence length (context window)
  • n_embd = Embedding dimension
  • n_head = Number of query attention heads
  • n_kv_head = Number of key/value attention heads (for GQA)
  • head_dim = n_embd / n_head
  • hidden_dim = n_embd * mlp_ratio (typically 4.0)

Note: The core architecture remains identical both while training and generating/inference, but the execution flow and some behaviors differ based on the mode.

🏗️ Architecture Details

What Makes This Implementation Efficient?

  1. RoPE (Rotary Position Embeddings)

    • Better extrapolation to longer sequences than learned embeddings
    • Relative position encoding with rotation matrices
    • No learned parameters for positions
  2. Grouped Query Attention (GQA)

    • Reduces KV cache memory by 2-4x
    • Fewer key/value heads than query heads
    • Near-identical performance to full Multi-Head Attention
  3. SwiGLU Activation

    • Combination of Swish activation and gating mechanism
    • Empirically better than GELU for language modeling
    • Used in PaLM, LLaMA models
  4. RMSNorm

    • Simpler than LayerNorm (no mean centering)
    • ~10-15% faster
    • Same performance as LayerNorm
  5. Flash Attention

    • Automatically used if available (PyTorch 2.0+)
    • 2-4x speedup on attention computation
    • Reduced memory usage

Model Configuration

The model is highly configurable through GPTConfig:

@dataclass
class GPTConfig:
    block_size: int = 256        # Context length
    vocab_size: int = 8192       # Vocabulary size
    n_layer: int = 6             # Number of transformer layers
    n_head: int = 6              # Number of attention heads
    n_kv_head: int = 3           # Number of KV heads (GQA)
    n_embd: int = 384            # Embedding dimension
    mlp_ratio: float = 4.0       # MLP expansion ratio
    dropout: float = 0.1         # Dropout probability
    bias: bool = False           # Use bias in linear layers
    use_rms_norm: bool = True    # Use RMSNorm vs LayerNorm
    use_swiglu: bool = True      # Use SwiGLU vs GELU MLP
    gradient_checkpointing: bool = False  # Memory optimization

💡 Performance Tips

Training Faster

  1. Use mixed precision: Set dtype = 'bfloat16' or 'float16' in config
  2. Enable compilation: Set compile_model = True (PyTorch 2.0+ with CUDA)
  3. Increase batch size: Max out your GPU memory with larger batches
  4. Use gradient accumulation: Simulate larger batches without memory increase
  5. Optimize data loading: Use memory-mapped files for large datasets

Training on Limited Hardware

  1. Use smaller model: Start with 'tiny' or 'small' presets
  2. Enable gradient checkpointing: Trades compute for memory
  3. Reduce batch size: Compensate with gradient accumulation
  4. Reduce context length: Shorter sequences use less memory
  5. Use CPU: Training is slower but works for small models

Example: Training on M1/M2 Mac

# config/train_default.py
model_preset = 'small'
batch_size = 8
max_iters = 5000
device = 'mps'  # Metal Performance Shaders
dtype = 'float32'  # MPS doesn't support bfloat16 yet
compile_model = False  # Not supported on MPS

Example: Large Model on GPU

# config/train_default.py
model_preset = 'large'
batch_size = 16
gradient_accumulation_steps = 4  # Effective batch size: 64
max_iters = 50000
device = 'cuda'
dtype = 'bfloat16'
compile_model = True
gradient_checkpointing = True  # If OOM

🔬 Advanced Usage

Custom Model Architecture

Instead of using presets, you can define custom architectures:

from gpt_from_scratch.model import GPT, GPTConfig

config = GPTConfig(
    block_size=512,
    vocab_size=50257,  # GPT-2 vocab size
    n_layer=12,
    n_head=12,
    n_kv_head=4,       # GQA with 4 KV heads
    n_embd=768,
    dropout=0.1,
)

model = GPT(config)

Using the Model in Your Code

import torch
from gpt_from_scratch.model import GPT, create_model

# Create a model
model = create_model('small')

# Forward pass
input_ids = torch.randint(0, 8192, (1, 128))
logits, loss = model(input_ids)

# Generation
output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    temperature=0.8,
    top_k=50
)

Monitoring Training

The training script outputs Model FLOPS Utilization (MFU), which estimates how efficiently you're using your hardware:

  • MFU < 10%: Bottleneck in data loading or system
  • MFU 10-30%: Normal for small models or CPU training
  • MFU 30-50%: Good GPU utilization
  • MFU 50%+: Excellent (difficult to achieve)

Loading Checkpoints

import torch
from gpt_from_scratch.model import GPT

# Load checkpoint
checkpoint = torch.load('out/ckpt.pt')
model_config = checkpoint['model_config']
model = GPT(model_config)
model.load_state_dict(checkpoint['model'])

# Access vocabulary
vocab = checkpoint['vocab']
stoi = vocab['stoi']  # string to int
itos = vocab['itos']  # int to string

🧪 Testing the Model

Test the model implementation:

python -m gpt_from_scratch.model.transformer

This runs a forward pass and generation test to verify everything works.

📈 Expected Results

With the default Shakespeare dataset:

  • Training loss: Should drop from ~4.0 to ~1.0-1.5 after 5000 iterations
  • Validation loss: Should be similar to training loss (little overfitting)
  • Generation quality: After 2000-3000 iterations, should generate recognizable Shakespeare-like text
  • Training time (small model on RTX 3090): ~10 minutes for 5000 iterations

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Areas for improvement:

  • BPE tokenizer integration
  • Multi-GPU training support
  • Weights & Biases logging
  • More efficient data loading
  • Additional sampling strategies

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This implementation incorporates techniques from:

  • Attention Is All You Need (Vaswani et al., 2017) - Original Transformer
  • RoFormer (Su et al., 2021) - Rotary Position Embeddings
  • PaLM (Chowdhery et al., 2022) - SwiGLU activation
  • LLaMA (Touvron et al., 2023) - RMSNorm, GQA architecture
  • GPT-2 (Radford et al., 2019) - Language model pretraining
  • Flash Attention (Dao et al., 2022) - Efficient attention

📚 Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published