Welcome to the Primus documentation! This guide will help you get started with training large-scale foundation models on AMD GPUs.
Start here if you're new to Primus:
- Quick Start Guide - Get up and running in 5 minutes
- CLI User Guide - Complete command-line reference
- CLI Architecture - Design philosophy and deep dive
Guides for common workflows and features:
- Configuration Guide - YAML/TOML configuration, recommended patterns, and examples
- Slurm & Container Usage - Distributed training and containerization workflows
- Experiment Management - Organizing and tracking your training runs
In-depth technical documentation:
- Post-Training Guide - Fine-tuning with SFT and LoRA using Primus CLI
- Performance Projection - Project training performance to multi-node configurations
- Preflight - Cluster diagnostics (host/GPU/network info + perf tests)
- Benchmark Suite - GEMM, RCCL, end-to-end benchmarks and profiling
- Supported Models - Supported LLM architectures and feature compatibility matrix
- Advanced Features - Mixed precision, parallelism strategies, optimization techniques
- Backend Patch Notes - Primus-specific arguments for Megatron, TorchTitan, etc.
- Backend Extension Guide - How to add a new backend using the current adapter/trainer architecture
- Megatron Model Extension Guide - How to add a new Megatron model config
- TorchTitan Model Extension Guide - How to add a new TorchTitan model config
Get help and find answers:
- FAQ - Frequently asked questions and troubleshooting
- Examples - Real-world training examples and templates
- Preflight Tool - Cluster sanity checker to verify environment readiness
- Train a model locally → Quick Start + CLI User Guide
- Run distributed training on Slurm → Slurm & Container Usage
- Configure my training run → Configuration Guide
- Project performance to multi-node → Performance Projection
- Benchmark performance → Benchmark Suite
- Understand the CLI design → CLI Architecture
- Troubleshoot issues → FAQ
- Primus-Turbo - High-performance operators & modules
- Primus-SaFE - Stability & platform layer
- AMD ROCm Documentation
- TorchTitan Documentation