Custom GPT from Scratch

This project implements a highly optimized GPT (Generative Pre-trained Transformer) model from scratch using PyTorch. The implementation includes state-of-the-art architectural improvements, a complete training pipeline, and flexible text generation capabilities.

🚀 Features

Core Architecture

Rotary Position Embeddings (RoPE) - Better length generalization than learned positional embeddings
Grouped Query Attention (GQA) - Memory-efficient attention with fewer KV heads
SwiGLU Activation - Superior performance compared to GELU
RMSNorm - Faster normalization than LayerNorm
Flash Attention Support - Optimized attention computation when available
Weight Tying - Shared embeddings between input and output layers

Training Optimizations

Mixed Precision Training (FP16/BF16) - Faster training with lower memory usage
Gradient Checkpointing - Trade compute for memory on large models
Fused AdamW - Optimized optimizer for CUDA devices
Learning Rate Scheduling - Warmup and cosine decay
Gradient Accumulation - Simulate larger batch sizes
torch.compile() Support - JIT compilation for PyTorch 2.0+

Generation Features

Temperature Sampling - Control randomness in generation
Top-k Filtering - Sample from top k most likely tokens
Top-p (Nucleus) Sampling - Dynamic vocabulary truncation
KV-Cache - Efficient autoregressive generation
Interactive Mode - Chat-like interface for generation

Model Presets

Tiny (~10M params) - Fast testing and prototyping
Small (~25M params) - Runnable on CPU/Mac
Medium (~80M params) - GPU recommended
Large (~350M params) - Requires substantial GPU memory

🛠️ Installation

Python Requirements

Recommended: Python 3.11 Minimum: Python 3.8 Check your version: python3 --version

# If you have multiple Python versions, use the specific one:
python3.11 --version           # Check if 3.11 is available
python3.11 -m venv venv        # Create venv with 3.11
python3.11 gpt.py info         # Use 3.11 for all commands

# Or check which version to use:
python check_python_version.py

Automated Setup (Easiest!)

# One command setup (Mac/Linux)
./setup.sh

# Or on Windows
setup.bat

The setup script automatically:

Checks/installs Python (if needed)
Verifies Python version (3.11 recommended)
Creates virtual environment
Installs dependencies
Detects hardware
Prepares dataset
Gets you ready to train!

Manual Installation

Clone the repository:

git clone https://github.com/emadnahed/custom-gpt-from-scratch.git
cd custom-gpt-from-scratch

Create and activate a virtual environment (recommended):

python3.11 -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```

🖥️ Hardware Auto-Detection

This project includes comprehensive hardware detection for seamless training across different platforms:

Supported Hardware

NVIDIA CUDA - NVIDIA GPUs with CUDA support
AMD ROCm - AMD GPUs on Linux
Apple Metal (MPS) - Apple Silicon (M1/M2/M3)
Intel XPU - Intel GPUs with Intel Extension for PyTorch
CPU - Universal fallback

Quick Hardware Check

Check what hardware is available on your system:

# Show all detected hardware
python check_hardware.py

# Interactive hardware selection
python check_hardware.py --interactive

# Show only recommended device
python check_hardware.py --recommended

# JSON output for scripting
python check_hardware.py --json

Using with Training

The training script automatically detects and uses the best available hardware:

# Auto-detect hardware (recommended)
python train.py

# Interactively choose hardware
python train.py --interactive

# Show hardware options without training
python train.py --show-hardware

Hardware Detection Features

Automatic detection of best available hardware
Display of available and unavailable hardware (color-coded)
Device capabilities (memory, compute capability, precision support)
Optimal precision selection (bfloat16, float16, float32)
Platform-specific optimizations

For detailed information, see:

START_HERE.md - Ultra quick start guide
GETTING_STARTED.md - Comprehensive beginner's guide
HARDWARE_FEATURE_SUMMARY.md - Hardware detection features

🎮 Command Center (Like npm scripts!)

This project includes an intuitive command center (gpt.py) - think of it as your "package.json scripts" for GPT training!

Quick Commands

# Most used commands (like npm run)
python gpt.py train          # Interactive training setup
python gpt.py generate       # Generate text from trained model
python gpt.py info           # Check your setup status
python gpt.py hardware       # View available hardware

# Management commands
python gpt.py config         # Create custom configurations
python gpt.py dataset        # Manage datasets (add/prepare/switch)

Interactive Training Workflow

python gpt.py train

# You'll be asked:
# 1. Which hardware? (auto-detected!)
# 2. Which dataset? (Shakespeare, or your own)
# 3. Model size? (tiny/small/medium/large or custom)
# 4. Number of layers? (4, 8, 12, 24, or custom)
# 5. How long? (quick/short/medium/long)
# 6. Start now? (yes!)

Custom Model Architecture

Easily customize the number of layers and other parameters:

python gpt.py config

# When prompted:
# - Choose custom architecture
# - Set n_layer (number of transformer layers):
#   * 4 layers: Fast, good for testing
#   * 8 layers: Balanced
#   * 12 layers: Good quality (recommended)
#   * 24 layers: Best quality (needs good hardware)
# - Adjust other parameters (heads, embedding size, etc.)

Dataset Management

python gpt.py dataset

# Options:
# 1. List available datasets
# 2. Prepare Shakespeare (default)
# 3. Add your own text file
# 4. View dataset info

Traditional Commands (Still Supported)

# Traditional training
python train.py --config config/my_config.py

# Traditional generation
python generate_demo.py

# Hardware check
python check_hardware.py

See QUICK_REFERENCE.md for complete command documentation.

🏗️ Project Structure

custom-gpt-from-scratch/
│
├── gpt_from_scratch/        # Main Python package
│   ├── __init__.py          # Package initialization
│   ├── cli.py               # Command-line interface
│   ├── model/               # Model architecture
│   │   ├── __init__.py
│   │   └── transformer.py   # GPT implementation
│   ├── utils/               # Utility modules
│   │   ├── __init__.py
│   │   ├── hardware_detector.py  # Hardware detection
│   │   └── python_utils.py  # Python utilities
│   └── data/                # Data processing
│       ├── __init__.py
│       └── utils.py
│
├── config/                  # Training configurations
│   ├── train_default.py     # Default training config
│   ├── train_demo.py        # Demo configuration
│   └── train_*.py           # Custom configurations
│
├── data/                    # Data directory
│   └── prepare.py           # Data preparation script
│
├── out/                     # Training outputs (created during training)
│   └── ckpt.pt             # Saved model checkpoints
│
├── .claude/                 # IDE/editor configuration
│   └── settings.local.json
│
├── check_hardware.py        # Hardware detection script
├── check_python_version.py  # Python version checker
├── config_builder.py        # Interactive config builder
├── dataset_manager.py       # Dataset management
├── generate_demo.py         # Text generation demo
├── generate_interactive.py  # Interactive generation
├── gpt.py                   # Main entry point
├── requirements.txt         # Python dependencies
├── setup.py                 # Package installation
├── setup.sh                 # Setup script (Linux/macOS)
├── setup.bat                # Setup script (Windows)
├── test_system.py           # System test
└── train.py                 # Training script

🚦 Getting Started

1. Prepare Your Data

First, prepare a dataset for training. The easiest way to start is with the Shakespeare dataset:

python data/prepare.py

This downloads and prepares the Tiny Shakespeare dataset (~1MB) for quick experimentation.

For custom datasets, modify data/prepare.py to load your text data. The script supports:

Character-level tokenization (built-in)
Hugging Face datasets integration
Custom text files

2. Train the Model

Train with the default configuration:

python train.py

The default configuration trains a small model (~25M parameters) on the Shakespeare dataset. Training should take 5-15 minutes on a modern GPU or 30-60 minutes on CPU.

Training Output

During training, you'll see:

Loss metrics (train/val) every N iterations
Training speed (tokens/sec)
Model FLOPS Utilization (MFU) - efficiency metric
Checkpoints saved to out/ckpt.pt

Custom Configuration

Modify config/train_default.py to customize:

Model size (tiny/small/medium/large presets)
Training hyperparameters (learning rate, batch size, etc.)
Hardware settings (device, mixed precision)
Optimization features (gradient checkpointing, compilation)

Example configurations:

# Fast training on CPU
model_preset = 'tiny'
batch_size = 4
max_iters = 1000
device = 'cpu'

# GPU training with larger model
model_preset = 'medium'
batch_size = 32
max_iters = 10000
device = 'cuda'
dtype = 'bfloat16'
compile_model = True  # PyTorch 2.0+ for speedup

3. Generate Text

After training, generate text with your model:

# Basic generation
python sample.py --prompt "To be or not to be" --max_tokens 100

# Control creativity
python sample.py --prompt "Once upon a time" --temperature 0.8 --top_k 50

# Interactive mode
python sample.py --interactive

Generation Parameters

--prompt: Starting text (empty for random start)
--max_tokens: Number of tokens to generate (default: 100)
--temperature: Sampling temperature - higher = more random (default: 0.8)
- 0.1-0.5: Conservative, coherent
- 0.6-0.9: Balanced creativity
- 1.0+: Very creative, potentially incoherent
--top_k: Top-k filtering - only sample from top k tokens (default: 200)
--top_p: Nucleus sampling - cumulative probability threshold (default: 0.9)
--seed: Random seed for reproducibility
--interactive: Launch interactive generation mode

📊 Dependencies

Python 3.11+
PyTorch 2.2.2+
torchvision
torchaudio
NumPy
tqdm
Hugging Face Datasets (for data loading)

All dependencies are listed in requirements.txt.

🏛️ Architecture Deep Dive

This section provides a detailed breakdown of the model architecture at different levels of abstraction, from the simplest overview to the complete technical implementation.

Level 1: Basic Flow (Simplest View)

Input → Tokenization → Token Embeddings → Transformer Blocks (×N layers) → Final Norm → Linear Head → Logits → Loss (if training)

Level 2: Components Breakdown

Input Tokens → Token Embeddings → Dropout → 
[Grouped Query Attention → Residual Add → Feedforward (SwiGLU/MLP) → Residual Add] (×N layers) → 
Final Norm → Linear Head → Logits → Softmax (during generation)

Level 3: With Residual Connections

Input Tokens → Token Embeddings → Dropout → 
[x + GQA(x) → (x + GQA(x)) + MLP(x + GQA(x))] (×N layers) → 
Final Norm → Linear Head → Logits

Level 4: With Normalization (Complete Flow - Pre-LN Architecture)

Input Tokens → Token Embeddings → Dropout → 
[RMSNorm → GQA → Add & Residual → RMSNorm → SwiGLU → Add & Residual] (×N layers) → 
Final RMSNorm → Linear Head (weight-tied) → Logits → Cross-Entropy Loss

Level 5: Advanced - Full Technical Flow

Input Tokens (B, T) 
  ↓
Token Embeddings: wte(idx) → (B, T, n_embd)
  ↓
Dropout(p=0.1)
  ↓
┌─────────────────────────────────────────────────────────────────────┐
│ Transformer Block (×n_layer) - Pre-Norm Architecture:              │
│                                                                     │
│  x_norm = RMSNorm(x)  [or LayerNorm if configured]                │
│    ↓                                                                │
│  Grouped Query Attention (GQA):                                    │
│    • Q = Linear(x_norm) → (B, n_head, T, head_dim)                │
│    • K = Linear(x_norm) → (B, n_kv_head, T, head_dim)             │
│    • V = Linear(x_norm) → (B, n_kv_head, T, head_dim)             │
│    • Q, K = RoPE(Q, K)  [Rotary Position Embeddings]              │
│    • K, V = repeat_interleave(K, V, n_rep)  [if GQA]              │
│    • attn_out = scaled_dot_product_attention(Q, K, V, causal=True)│
│    • attn_out = Linear_o(attn_out) + Dropout                      │
│  x = x + attn_out  [Residual Connection 1]                        │
│    ↓                                                                │
│  x_norm2 = RMSNorm(x)                                              │
│    ↓                                                                │
│  SwiGLU Feedforward:                                               │
│    • gate = SiLU(W1(x_norm2))                                      │
│    • hidden = W3(x_norm2)                                          │
│    • mlp_out = W2(gate * hidden) + Dropout                        │
│  x = x + mlp_out  [Residual Connection 2]                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
  ↓
Final Normalization: RMSNorm(x) → (B, T, n_embd)
  ↓
Language Model Head: Linear(x) → (B, T, vocab_size) [Weight-Tied with wte]
  ↓
Output Logits (Training) OR Logits[:, -1, :] (Inference)
  ↓
Loss Calculation (if targets provided):
  • Flatten: logits → (B*T, vocab_size), targets → (B*T)
  • Cross-Entropy Loss with ignore_index=-1

Key Architecture Features

1. Pre-Norm Architecture

Normalization before each sub-layer (attention and feedforward)
More stable training than Post-Norm
Formula: x = x + SubLayer(Norm(x))

2. RoPE (Rotary Position Embeddings)

Applied inside attention mechanism (not as separate layer)
No learned positional parameters
Better length generalization
Applied to Q and K before attention computation

3. Grouped Query Attention (GQA)

Query heads: n_head (e.g., 6)
Key/Value heads: n_kv_head (e.g., 3)
Repetition factor: n_rep = n_head / n_kv_head
Memory efficient: reduces KV cache size by 2-4x

4. SwiGLU Activation

Three projection matrices: W1, W2, W3
Formula: W2(SiLU(W1(x)) ⊙ W3(x))
Better performance than standard GELU
Used in PaLM and LLaMA models

5. RMSNorm

Faster than LayerNorm (no mean centering)
Formula: x * rsqrt(mean(x²) + ε) * γ
Single learnable parameter: weight (γ)
~10-15% faster than LayerNorm

6. Weight Tying

Token embeddings and output head share weights
Reduces parameters by ~vocab_size * n_embd
Formula: lm_head.weight = wte.weight

7. Optimization Features

Flash Attention support (when available)
Gradient Checkpointing option
Mixed Precision compatible (FP16/BF16)
KV-Cache for efficient generation

Data Flow Dimensions

Input: (Batch, Sequence) → (B, T)
  ↓
Embeddings: (B, T, n_embd)
  ↓
Attention Reshaping:
  Q: (B, T, n_embd) → (B, n_head, T, head_dim)
  K: (B, T, n_kv_head*head_dim) → (B, n_kv_head, T, head_dim)
  V: (B, T, n_kv_head*head_dim) → (B, n_kv_head, T, head_dim)
  ↓
After Attention: (B, n_head, T, head_dim) → (B, T, n_embd)
  ↓
MLP: (B, T, n_embd) → (B, T, hidden_dim) → (B, T, n_embd)
  ↓
Final: (B, T, n_embd) → (B, T, vocab_size)

Where:

B = Batch size
T = Sequence length (context window)
n_embd = Embedding dimension
n_head = Number of query attention heads
n_kv_head = Number of key/value attention heads (for GQA)
head_dim = n_embd / n_head
hidden_dim = n_embd * mlp_ratio (typically 4.0)

Note: The core architecture remains identical both while training and generating/inference, but the execution flow and some behaviors differ based on the mode.

🏗️ Architecture Details

What Makes This Implementation Efficient?

RoPE (Rotary Position Embeddings)
- Better extrapolation to longer sequences than learned embeddings
- Relative position encoding with rotation matrices
- No learned parameters for positions
Grouped Query Attention (GQA)
- Reduces KV cache memory by 2-4x
- Fewer key/value heads than query heads
- Near-identical performance to full Multi-Head Attention
SwiGLU Activation
- Combination of Swish activation and gating mechanism
- Empirically better than GELU for language modeling
- Used in PaLM, LLaMA models
RMSNorm
- Simpler than LayerNorm (no mean centering)
- ~10-15% faster
- Same performance as LayerNorm
Flash Attention
- Automatically used if available (PyTorch 2.0+)
- 2-4x speedup on attention computation
- Reduced memory usage

Model Configuration

The model is highly configurable through GPTConfig:

@dataclass
class GPTConfig:
    block_size: int = 256        # Context length
    vocab_size: int = 8192       # Vocabulary size
    n_layer: int = 6             # Number of transformer layers
    n_head: int = 6              # Number of attention heads
    n_kv_head: int = 3           # Number of KV heads (GQA)
    n_embd: int = 384            # Embedding dimension
    mlp_ratio: float = 4.0       # MLP expansion ratio
    dropout: float = 0.1         # Dropout probability
    bias: bool = False           # Use bias in linear layers
    use_rms_norm: bool = True    # Use RMSNorm vs LayerNorm
    use_swiglu: bool = True      # Use SwiGLU vs GELU MLP
    gradient_checkpointing: bool = False  # Memory optimization

💡 Performance Tips

Training Faster

Use mixed precision: Set dtype = 'bfloat16' or 'float16' in config
Enable compilation: Set compile_model = True (PyTorch 2.0+ with CUDA)
Increase batch size: Max out your GPU memory with larger batches
Use gradient accumulation: Simulate larger batches without memory increase
Optimize data loading: Use memory-mapped files for large datasets

Training on Limited Hardware

Use smaller model: Start with 'tiny' or 'small' presets
Enable gradient checkpointing: Trades compute for memory
Reduce batch size: Compensate with gradient accumulation
Reduce context length: Shorter sequences use less memory
Use CPU: Training is slower but works for small models

Example: Training on M1/M2 Mac

# config/train_default.py
model_preset = 'small'
batch_size = 8
max_iters = 5000
device = 'mps'  # Metal Performance Shaders
dtype = 'float32'  # MPS doesn't support bfloat16 yet
compile_model = False  # Not supported on MPS

Example: Large Model on GPU

# config/train_default.py
model_preset = 'large'
batch_size = 16
gradient_accumulation_steps = 4  # Effective batch size: 64
max_iters = 50000
device = 'cuda'
dtype = 'bfloat16'
compile_model = True
gradient_checkpointing = True  # If OOM

🔬 Advanced Usage

Custom Model Architecture

Instead of using presets, you can define custom architectures:

from gpt_from_scratch.model import GPT, GPTConfig

config = GPTConfig(
    block_size=512,
    vocab_size=50257,  # GPT-2 vocab size
    n_layer=12,
    n_head=12,
    n_kv_head=4,       # GQA with 4 KV heads
    n_embd=768,
    dropout=0.1,
)

model = GPT(config)

Using the Model in Your Code

import torch
from gpt_from_scratch.model import GPT, create_model

# Create a model
model = create_model('small')

# Forward pass
input_ids = torch.randint(0, 8192, (1, 128))
logits, loss = model(input_ids)

# Generation
output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    temperature=0.8,
    top_k=50
)

Monitoring Training

The training script outputs Model FLOPS Utilization (MFU), which estimates how efficiently you're using your hardware:

MFU < 10%: Bottleneck in data loading or system
MFU 10-30%: Normal for small models or CPU training
MFU 30-50%: Good GPU utilization
MFU 50%+: Excellent (difficult to achieve)

Loading Checkpoints

import torch
from gpt_from_scratch.model import GPT

# Load checkpoint
checkpoint = torch.load('out/ckpt.pt')
model_config = checkpoint['model_config']
model = GPT(model_config)
model.load_state_dict(checkpoint['model'])

# Access vocabulary
vocab = checkpoint['vocab']
stoi = vocab['stoi']  # string to int
itos = vocab['itos']  # int to string

🧪 Testing the Model

Test the model implementation:

python -m gpt_from_scratch.model.transformer

This runs a forward pass and generation test to verify everything works.

📈 Expected Results

With the default Shakespeare dataset:

Training loss: Should drop from ~4.0 to ~1.0-1.5 after 5000 iterations
Validation loss: Should be similar to training loss (little overfitting)
Generation quality: After 2000-3000 iterations, should generate recognizable Shakespeare-like text
Training time (small model on RTX 3090): ~10 minutes for 5000 iterations

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Areas for improvement:

BPE tokenizer integration
Multi-GPU training support
Weights & Biases logging
More efficient data loading
Additional sampling strategies

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This implementation incorporates techniques from:

Attention Is All You Need (Vaswani et al., 2017) - Original Transformer
RoFormer (Su et al., 2021) - Rotary Position Embeddings
PaLM (Chowdhery et al., 2022) - SwiGLU activation
LLaMA (Touvron et al., 2023) - RMSNorm, GQA architecture
GPT-2 (Radford et al., 2019) - Language model pretraining
Flash Attention (Dao et al., 2022) - Efficient attention

📚 Resources

Attention Is All You Need
GPT-2 Paper
LLaMA Paper
Karpathy's nanoGPT - Inspiration for this project
The Illustrated Transformer

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
data		data
gpt_from_scratch		gpt_from_scratch
.gitignore		.gitignore
CHANGELOG_FIXES.md		CHANGELOG_FIXES.md
FIXES_SUMMARY.md		FIXES_SUMMARY.md
GETTING_STARTED.md		GETTING_STARTED.md
README.md		README.md
check_hardware.py		check_hardware.py
check_python_version.py		check_python_version.py
config_builder.py		config_builder.py
dataset_manager.py		dataset_manager.py
generate_demo.py		generate_demo.py
generate_interactive.py		generate_interactive.py
gpt.py		gpt.py
quickstart.bat		quickstart.bat
quickstart.sh		quickstart.sh
requirements.txt		requirements.txt
sample.py		sample.py
scripts.json		scripts.json
setup.bat		setup.bat
setup.py		setup.py
setup.sh		setup.sh
test_system.py		test_system.py
train.py		train.py

emadnahed/custom-gpt-from-scratch

Folders and files

Latest commit

History

Repository files navigation

Custom GPT from Scratch

🚀 Features

Core Architecture

Training Optimizations

Generation Features

Model Presets

🛠️ Installation

Python Requirements

Automated Setup (Easiest!)

Manual Installation

🖥️ Hardware Auto-Detection

Supported Hardware

Quick Hardware Check

Using with Training

Hardware Detection Features

🎮 Command Center (Like npm scripts!)

Quick Commands

Interactive Training Workflow

Custom Model Architecture

Dataset Management

Traditional Commands (Still Supported)

🏗️ Project Structure

🚦 Getting Started

1. Prepare Your Data

2. Train the Model

Training Output

Custom Configuration

3. Generate Text

Generation Parameters

📊 Dependencies

🏛️ Architecture Deep Dive

Level 1: Basic Flow (Simplest View)

Level 2: Components Breakdown

Level 3: With Residual Connections

Level 4: With Normalization (Complete Flow - Pre-LN Architecture)

Level 5: Advanced - Full Technical Flow

Key Architecture Features

1. Pre-Norm Architecture

2. RoPE (Rotary Position Embeddings)

3. Grouped Query Attention (GQA)

4. SwiGLU Activation

5. RMSNorm

6. Weight Tying

7. Optimization Features

Data Flow Dimensions

🏗️ Architecture Details

What Makes This Implementation Efficient?

Model Configuration

💡 Performance Tips

Training Faster

Training on Limited Hardware

Example: Training on M1/M2 Mac

Example: Large Model on GPU

🔬 Advanced Usage

Custom Model Architecture

Using the Model in Your Code

Monitoring Training

Loading Checkpoints

🧪 Testing the Model

📈 Expected Results

🤝 Contributing

📜 License

🙏 Acknowledgments

📚 Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages