This project implements a highly optimized GPT (Generative Pre-trained Transformer) model from scratch using PyTorch. The implementation includes state-of-the-art architectural improvements, a complete training pipeline, and flexible text generation capabilities.
- Rotary Position Embeddings (RoPE) - Better length generalization than learned positional embeddings
- Grouped Query Attention (GQA) - Memory-efficient attention with fewer KV heads
- SwiGLU Activation - Superior performance compared to GELU
- RMSNorm - Faster normalization than LayerNorm
- Flash Attention Support - Optimized attention computation when available
- Weight Tying - Shared embeddings between input and output layers
- Mixed Precision Training (FP16/BF16) - Faster training with lower memory usage
- Gradient Checkpointing - Trade compute for memory on large models
- Fused AdamW - Optimized optimizer for CUDA devices
- Learning Rate Scheduling - Warmup and cosine decay
- Gradient Accumulation - Simulate larger batch sizes
- torch.compile() Support - JIT compilation for PyTorch 2.0+
- Temperature Sampling - Control randomness in generation
- Top-k Filtering - Sample from top k most likely tokens
- Top-p (Nucleus) Sampling - Dynamic vocabulary truncation
- KV-Cache - Efficient autoregressive generation
- Interactive Mode - Chat-like interface for generation
- Tiny (~10M params) - Fast testing and prototyping
- Small (~25M params) - Runnable on CPU/Mac
- Medium (~80M params) - GPU recommended
- Large (~350M params) - Requires substantial GPU memory
Recommended: Python 3.11
Minimum: Python 3.8
Check your version: python3 --version
# If you have multiple Python versions, use the specific one:
python3.11 --version # Check if 3.11 is available
python3.11 -m venv venv # Create venv with 3.11
python3.11 gpt.py info # Use 3.11 for all commands
# Or check which version to use:
python check_python_version.py# One command setup (Mac/Linux)
./setup.sh
# Or on Windows
setup.batThe setup script automatically:
- Checks/installs Python (if needed)
- Verifies Python version (3.11 recommended)
- Creates virtual environment
- Installs dependencies
- Detects hardware
- Prepares dataset
- Gets you ready to train!
-
Clone the repository:
git clone https://github.com/emadnahed/custom-gpt-from-scratch.git cd custom-gpt-from-scratch -
Create and activate a virtual environment (recommended):
python3.11 -m venv venv source venv/bin/activate # On Windows use: venv\Scripts\activate
-
Install the required dependencies:
pip install -r requirements.txt
This project includes comprehensive hardware detection for seamless training across different platforms:
- NVIDIA CUDA - NVIDIA GPUs with CUDA support
- AMD ROCm - AMD GPUs on Linux
- Apple Metal (MPS) - Apple Silicon (M1/M2/M3)
- Intel XPU - Intel GPUs with Intel Extension for PyTorch
- CPU - Universal fallback
Check what hardware is available on your system:
# Show all detected hardware
python check_hardware.py
# Interactive hardware selection
python check_hardware.py --interactive
# Show only recommended device
python check_hardware.py --recommended
# JSON output for scripting
python check_hardware.py --jsonThe training script automatically detects and uses the best available hardware:
# Auto-detect hardware (recommended)
python train.py
# Interactively choose hardware
python train.py --interactive
# Show hardware options without training
python train.py --show-hardware- Automatic detection of best available hardware
- Display of available and unavailable hardware (color-coded)
- Device capabilities (memory, compute capability, precision support)
- Optimal precision selection (bfloat16, float16, float32)
- Platform-specific optimizations
For detailed information, see:
START_HERE.md- Ultra quick start guideGETTING_STARTED.md- Comprehensive beginner's guideHARDWARE_FEATURE_SUMMARY.md- Hardware detection features
This project includes an intuitive command center (gpt.py) - think of it as your "package.json scripts" for GPT training!
# Most used commands (like npm run)
python gpt.py train # Interactive training setup
python gpt.py generate # Generate text from trained model
python gpt.py info # Check your setup status
python gpt.py hardware # View available hardware
# Management commands
python gpt.py config # Create custom configurations
python gpt.py dataset # Manage datasets (add/prepare/switch)python gpt.py train
# You'll be asked:
# 1. Which hardware? (auto-detected!)
# 2. Which dataset? (Shakespeare, or your own)
# 3. Model size? (tiny/small/medium/large or custom)
# 4. Number of layers? (4, 8, 12, 24, or custom)
# 5. How long? (quick/short/medium/long)
# 6. Start now? (yes!)Easily customize the number of layers and other parameters:
python gpt.py config
# When prompted:
# - Choose custom architecture
# - Set n_layer (number of transformer layers):
# * 4 layers: Fast, good for testing
# * 8 layers: Balanced
# * 12 layers: Good quality (recommended)
# * 24 layers: Best quality (needs good hardware)
# - Adjust other parameters (heads, embedding size, etc.)python gpt.py dataset
# Options:
# 1. List available datasets
# 2. Prepare Shakespeare (default)
# 3. Add your own text file
# 4. View dataset info# Traditional training
python train.py --config config/my_config.py
# Traditional generation
python generate_demo.py
# Hardware check
python check_hardware.pySee QUICK_REFERENCE.md for complete command documentation.
custom-gpt-from-scratch/
│
├── gpt_from_scratch/ # Main Python package
│ ├── __init__.py # Package initialization
│ ├── cli.py # Command-line interface
│ ├── model/ # Model architecture
│ │ ├── __init__.py
│ │ └── transformer.py # GPT implementation
│ ├── utils/ # Utility modules
│ │ ├── __init__.py
│ │ ├── hardware_detector.py # Hardware detection
│ │ └── python_utils.py # Python utilities
│ └── data/ # Data processing
│ ├── __init__.py
│ └── utils.py
│
├── config/ # Training configurations
│ ├── train_default.py # Default training config
│ ├── train_demo.py # Demo configuration
│ └── train_*.py # Custom configurations
│
├── data/ # Data directory
│ └── prepare.py # Data preparation script
│
├── out/ # Training outputs (created during training)
│ └── ckpt.pt # Saved model checkpoints
│
├── .claude/ # IDE/editor configuration
│ └── settings.local.json
│
├── check_hardware.py # Hardware detection script
├── check_python_version.py # Python version checker
├── config_builder.py # Interactive config builder
├── dataset_manager.py # Dataset management
├── generate_demo.py # Text generation demo
├── generate_interactive.py # Interactive generation
├── gpt.py # Main entry point
├── requirements.txt # Python dependencies
├── setup.py # Package installation
├── setup.sh # Setup script (Linux/macOS)
├── setup.bat # Setup script (Windows)
├── test_system.py # System test
└── train.py # Training script
First, prepare a dataset for training. The easiest way to start is with the Shakespeare dataset:
python data/prepare.pyThis downloads and prepares the Tiny Shakespeare dataset (~1MB) for quick experimentation.
For custom datasets, modify data/prepare.py to load your text data. The script supports:
- Character-level tokenization (built-in)
- Hugging Face datasets integration
- Custom text files
Train with the default configuration:
python train.pyThe default configuration trains a small model (~25M parameters) on the Shakespeare dataset. Training should take 5-15 minutes on a modern GPU or 30-60 minutes on CPU.
During training, you'll see:
- Loss metrics (train/val) every N iterations
- Training speed (tokens/sec)
- Model FLOPS Utilization (MFU) - efficiency metric
- Checkpoints saved to
out/ckpt.pt
Modify config/train_default.py to customize:
- Model size (tiny/small/medium/large presets)
- Training hyperparameters (learning rate, batch size, etc.)
- Hardware settings (device, mixed precision)
- Optimization features (gradient checkpointing, compilation)
Example configurations:
# Fast training on CPU
model_preset = 'tiny'
batch_size = 4
max_iters = 1000
device = 'cpu'
# GPU training with larger model
model_preset = 'medium'
batch_size = 32
max_iters = 10000
device = 'cuda'
dtype = 'bfloat16'
compile_model = True # PyTorch 2.0+ for speedupAfter training, generate text with your model:
# Basic generation
python sample.py --prompt "To be or not to be" --max_tokens 100
# Control creativity
python sample.py --prompt "Once upon a time" --temperature 0.8 --top_k 50
# Interactive mode
python sample.py --interactive--prompt: Starting text (empty for random start)--max_tokens: Number of tokens to generate (default: 100)--temperature: Sampling temperature - higher = more random (default: 0.8)0.1-0.5: Conservative, coherent0.6-0.9: Balanced creativity1.0+: Very creative, potentially incoherent
--top_k: Top-k filtering - only sample from top k tokens (default: 200)--top_p: Nucleus sampling - cumulative probability threshold (default: 0.9)--seed: Random seed for reproducibility--interactive: Launch interactive generation mode
- Python 3.11+
- PyTorch 2.2.2+
- torchvision
- torchaudio
- NumPy
- tqdm
- Hugging Face Datasets (for data loading)
All dependencies are listed in requirements.txt.
This section provides a detailed breakdown of the model architecture at different levels of abstraction, from the simplest overview to the complete technical implementation.
Input → Tokenization → Token Embeddings → Transformer Blocks (×N layers) → Final Norm → Linear Head → Logits → Loss (if training)
Input Tokens → Token Embeddings → Dropout →
[Grouped Query Attention → Residual Add → Feedforward (SwiGLU/MLP) → Residual Add] (×N layers) →
Final Norm → Linear Head → Logits → Softmax (during generation)
Input Tokens → Token Embeddings → Dropout →
[x + GQA(x) → (x + GQA(x)) + MLP(x + GQA(x))] (×N layers) →
Final Norm → Linear Head → Logits
Input Tokens → Token Embeddings → Dropout →
[RMSNorm → GQA → Add & Residual → RMSNorm → SwiGLU → Add & Residual] (×N layers) →
Final RMSNorm → Linear Head (weight-tied) → Logits → Cross-Entropy Loss
Input Tokens (B, T)
↓
Token Embeddings: wte(idx) → (B, T, n_embd)
↓
Dropout(p=0.1)
↓
┌─────────────────────────────────────────────────────────────────────┐
│ Transformer Block (×n_layer) - Pre-Norm Architecture: │
│ │
│ x_norm = RMSNorm(x) [or LayerNorm if configured] │
│ ↓ │
│ Grouped Query Attention (GQA): │
│ • Q = Linear(x_norm) → (B, n_head, T, head_dim) │
│ • K = Linear(x_norm) → (B, n_kv_head, T, head_dim) │
│ • V = Linear(x_norm) → (B, n_kv_head, T, head_dim) │
│ • Q, K = RoPE(Q, K) [Rotary Position Embeddings] │
│ • K, V = repeat_interleave(K, V, n_rep) [if GQA] │
│ • attn_out = scaled_dot_product_attention(Q, K, V, causal=True)│
│ • attn_out = Linear_o(attn_out) + Dropout │
│ x = x + attn_out [Residual Connection 1] │
│ ↓ │
│ x_norm2 = RMSNorm(x) │
│ ↓ │
│ SwiGLU Feedforward: │
│ • gate = SiLU(W1(x_norm2)) │
│ • hidden = W3(x_norm2) │
│ • mlp_out = W2(gate * hidden) + Dropout │
│ x = x + mlp_out [Residual Connection 2] │
│ │
└─────────────────────────────────────────────────────────────────────┘
↓
Final Normalization: RMSNorm(x) → (B, T, n_embd)
↓
Language Model Head: Linear(x) → (B, T, vocab_size) [Weight-Tied with wte]
↓
Output Logits (Training) OR Logits[:, -1, :] (Inference)
↓
Loss Calculation (if targets provided):
• Flatten: logits → (B*T, vocab_size), targets → (B*T)
• Cross-Entropy Loss with ignore_index=-1
- Normalization before each sub-layer (attention and feedforward)
- More stable training than Post-Norm
- Formula:
x = x + SubLayer(Norm(x))
- Applied inside attention mechanism (not as separate layer)
- No learned positional parameters
- Better length generalization
- Applied to Q and K before attention computation
- Query heads:
n_head(e.g., 6) - Key/Value heads:
n_kv_head(e.g., 3) - Repetition factor:
n_rep = n_head / n_kv_head - Memory efficient: reduces KV cache size by 2-4x
- Three projection matrices: W1, W2, W3
- Formula:
W2(SiLU(W1(x)) ⊙ W3(x)) - Better performance than standard GELU
- Used in PaLM and LLaMA models
- Faster than LayerNorm (no mean centering)
- Formula:
x * rsqrt(mean(x²) + ε) * γ - Single learnable parameter:
weight (γ) - ~10-15% faster than LayerNorm
- Token embeddings and output head share weights
- Reduces parameters by ~vocab_size * n_embd
- Formula:
lm_head.weight = wte.weight
- Flash Attention support (when available)
- Gradient Checkpointing option
- Mixed Precision compatible (FP16/BF16)
- KV-Cache for efficient generation
Input: (Batch, Sequence) → (B, T)
↓
Embeddings: (B, T, n_embd)
↓
Attention Reshaping:
Q: (B, T, n_embd) → (B, n_head, T, head_dim)
K: (B, T, n_kv_head*head_dim) → (B, n_kv_head, T, head_dim)
V: (B, T, n_kv_head*head_dim) → (B, n_kv_head, T, head_dim)
↓
After Attention: (B, n_head, T, head_dim) → (B, T, n_embd)
↓
MLP: (B, T, n_embd) → (B, T, hidden_dim) → (B, T, n_embd)
↓
Final: (B, T, n_embd) → (B, T, vocab_size)
Where:
B= Batch sizeT= Sequence length (context window)n_embd= Embedding dimensionn_head= Number of query attention headsn_kv_head= Number of key/value attention heads (for GQA)head_dim= n_embd / n_headhidden_dim= n_embd * mlp_ratio (typically 4.0)
Note: The core architecture remains identical both while training and generating/inference, but the execution flow and some behaviors differ based on the mode.
-
RoPE (Rotary Position Embeddings)
- Better extrapolation to longer sequences than learned embeddings
- Relative position encoding with rotation matrices
- No learned parameters for positions
-
Grouped Query Attention (GQA)
- Reduces KV cache memory by 2-4x
- Fewer key/value heads than query heads
- Near-identical performance to full Multi-Head Attention
-
SwiGLU Activation
- Combination of Swish activation and gating mechanism
- Empirically better than GELU for language modeling
- Used in PaLM, LLaMA models
-
RMSNorm
- Simpler than LayerNorm (no mean centering)
- ~10-15% faster
- Same performance as LayerNorm
-
Flash Attention
- Automatically used if available (PyTorch 2.0+)
- 2-4x speedup on attention computation
- Reduced memory usage
The model is highly configurable through GPTConfig:
@dataclass
class GPTConfig:
block_size: int = 256 # Context length
vocab_size: int = 8192 # Vocabulary size
n_layer: int = 6 # Number of transformer layers
n_head: int = 6 # Number of attention heads
n_kv_head: int = 3 # Number of KV heads (GQA)
n_embd: int = 384 # Embedding dimension
mlp_ratio: float = 4.0 # MLP expansion ratio
dropout: float = 0.1 # Dropout probability
bias: bool = False # Use bias in linear layers
use_rms_norm: bool = True # Use RMSNorm vs LayerNorm
use_swiglu: bool = True # Use SwiGLU vs GELU MLP
gradient_checkpointing: bool = False # Memory optimization- Use mixed precision: Set
dtype = 'bfloat16'or'float16'in config - Enable compilation: Set
compile_model = True(PyTorch 2.0+ with CUDA) - Increase batch size: Max out your GPU memory with larger batches
- Use gradient accumulation: Simulate larger batches without memory increase
- Optimize data loading: Use memory-mapped files for large datasets
- Use smaller model: Start with 'tiny' or 'small' presets
- Enable gradient checkpointing: Trades compute for memory
- Reduce batch size: Compensate with gradient accumulation
- Reduce context length: Shorter sequences use less memory
- Use CPU: Training is slower but works for small models
# config/train_default.py
model_preset = 'small'
batch_size = 8
max_iters = 5000
device = 'mps' # Metal Performance Shaders
dtype = 'float32' # MPS doesn't support bfloat16 yet
compile_model = False # Not supported on MPS# config/train_default.py
model_preset = 'large'
batch_size = 16
gradient_accumulation_steps = 4 # Effective batch size: 64
max_iters = 50000
device = 'cuda'
dtype = 'bfloat16'
compile_model = True
gradient_checkpointing = True # If OOMInstead of using presets, you can define custom architectures:
from gpt_from_scratch.model import GPT, GPTConfig
config = GPTConfig(
block_size=512,
vocab_size=50257, # GPT-2 vocab size
n_layer=12,
n_head=12,
n_kv_head=4, # GQA with 4 KV heads
n_embd=768,
dropout=0.1,
)
model = GPT(config)import torch
from gpt_from_scratch.model import GPT, create_model
# Create a model
model = create_model('small')
# Forward pass
input_ids = torch.randint(0, 8192, (1, 128))
logits, loss = model(input_ids)
# Generation
output_ids = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.8,
top_k=50
)The training script outputs Model FLOPS Utilization (MFU), which estimates how efficiently you're using your hardware:
- MFU < 10%: Bottleneck in data loading or system
- MFU 10-30%: Normal for small models or CPU training
- MFU 30-50%: Good GPU utilization
- MFU 50%+: Excellent (difficult to achieve)
import torch
from gpt_from_scratch.model import GPT
# Load checkpoint
checkpoint = torch.load('out/ckpt.pt')
model_config = checkpoint['model_config']
model = GPT(model_config)
model.load_state_dict(checkpoint['model'])
# Access vocabulary
vocab = checkpoint['vocab']
stoi = vocab['stoi'] # string to int
itos = vocab['itos'] # int to stringTest the model implementation:
python -m gpt_from_scratch.model.transformerThis runs a forward pass and generation test to verify everything works.
With the default Shakespeare dataset:
- Training loss: Should drop from ~4.0 to ~1.0-1.5 after 5000 iterations
- Validation loss: Should be similar to training loss (little overfitting)
- Generation quality: After 2000-3000 iterations, should generate recognizable Shakespeare-like text
- Training time (small model on RTX 3090): ~10 minutes for 5000 iterations
Contributions are welcome! Please feel free to submit a Pull Request.
Areas for improvement:
- BPE tokenizer integration
- Multi-GPU training support
- Weights & Biases logging
- More efficient data loading
- Additional sampling strategies
This project is licensed under the MIT License - see the LICENSE file for details.
This implementation incorporates techniques from:
- Attention Is All You Need (Vaswani et al., 2017) - Original Transformer
- RoFormer (Su et al., 2021) - Rotary Position Embeddings
- PaLM (Chowdhery et al., 2022) - SwiGLU activation
- LLaMA (Touvron et al., 2023) - RMSNorm, GQA architecture
- GPT-2 (Radford et al., 2019) - Language model pretraining
- Flash Attention (Dao et al., 2022) - Efficient attention
- Attention Is All You Need
- GPT-2 Paper
- LLaMA Paper
- Karpathy's nanoGPT - Inspiration for this project
- The Illustrated Transformer