"Premature optimization is the root of all evil - but we should not pass up opportunities in the critical 3%." - Donald Knuth
From Cloud to Edge — A holistic reference covering hardware fundamentals, GPU/accelerator metrics, the roofline model, AI/ML training & inference KPIs, LLM serving, distributed systems, quantization, cloud-native & edge deployment, networking, compiler optimizations, and practical tooling.
Table of Contents
- 1. Core Execution Metrics (CPU)
- 2. GPU & Accelerator Metrics
- 3. Time, Throughput & Latency
- 4. Roofline Model
- 5. Parallel Performance
- 6. Memory & I/O Metrics
- 7. System-Level & Queueing Metrics
- 8. AI/ML Training Metrics
- 9. AI/ML Inference & Serving Metrics
- 10. LLM-Specific Performance Metrics
- 11. Distributed Training & Multi-Device Scaling
- 12. Quantization & Numerical Precision
- 13. Network & Interconnect Performance
- 14. Cloud Performance Engineering
- 15. Edge & Embedded AI Performance
- 16. Compiler & Runtime Optimizations
- 17. Energy, Cost & Sustainability
- 18. Tooling & Profiling Reference
- 19. Quick Practical Summary
- 20. References
-
Symbol:
$f_{\text{clock}}$ - Unit: Hz (cycles per second), often GHz
- Meaning: How many clock cycles the processor completes per second.
Example:
- 3.2 GHz =>
$f_{\text{clock}} = 3.2 \times 10^{9}\ \text{cycles/s}$
-
Formula: $$ \text{IPC} = \frac{\text{Number of retired instructions}}{\text{Number of clock cycles}} $$
-
Derived: $$ \text{Inst/s} = \text{IPC} \times f_{\text{clock}} $$
-
Interpretation: Average completed instructions per clock cycle.
-
Formula: $$ \text{CPI} = \frac{\text{Number of clock cycles}}{\text{Number of retired instructions}} $$
-
Relation to IPC: $$ \text{IPC} = \frac{1}{\text{CPI}}, \qquad \text{CPI} = \frac{1}{\text{IPC}} $$
- Formula: $$ \text{Peak FLOPS} = N_{\text{cores}} \times f_{\text{clock}} \times \text{FLOPs per cycle per core} $$
Example (scalar):
- 1 core, 3 GHz, 1 FLOP/cycle =>
$3 ,\text{GFLOPS}$ .
Example (vector FMA):
- 8-wide FMA, 2 FP units/core, 3 GHz: $$ \text{FLOPs/cycle/core} = 8 \times 2 \times 2 = 32 $$ $$ \text{Peak} = 3 \times 10^{9} \times 32 = 96,\text{GFLOPS} $$
-
Formula: $$ \text{Delivered FLOPs/s} = \frac{\text{Floating-point operations actually performed}}{\text{Execution time}} $$
-
Efficiency: $$ \text{FLOP efficiency} = \frac{\text{Delivered FLOPs/s}}{\text{Peak FLOPs/s}} \times 100% $$
-
Definition: $$ a \gets a + (b \times c) $$
-
MACs per second (per unit): $$ \text{MAC/s} = N_{\text{MAC/cycle}} \times f_{\text{clock}} $$
-
GMAC: $$ \text{GMAC} = \frac{\text{Number of MAC operations}}{10^{9}} $$
-
Relation to FLOPs (when one MAC = 2 FLOPs): $$ \text{FLOPs} = 2 \times \text{MACs} $$
-
FP instruction fraction: $$ \text{FP fraction} = \frac{\text{FP instructions}}{\text{Total instructions}} $$
-
Memory instruction fraction: $$ \text{Memory inst fraction} = \frac{\text{Load/Store instructions}}{\text{Total instructions}} $$
- Higher occupancy can hide memory latency through warp switching.
- Limited by register usage, shared memory allocation, and block size.
- Divergent branches reduce efficiency (threads in a warp take different paths).
- Tensor Cores perform mixed-precision matrix multiply-accumulate (e.g., FP16 inputs -> FP32 accumulator).
- Peak throughput example (NVIDIA H100 SXM): ~990 TFLOPS (FP16 Tensor), ~1,979 TFLOPS (FP8 Tensor).
- HBM3 example (H100): 3.35 TB/s peak bandwidth.
- HBM3e example (B200): 8 TB/s peak bandwidth.
- Launch overhead: 5-20 us typical on CUDA. Critical for small kernels.
- Kernel fusion eliminates intermediate launches and memory round-trips.
| Indicator | Compute Bound | Memory Bound |
|---|---|---|
| SM utilization | High | Low-Medium |
| Memory BW utilization | Low-Medium | High (near peak) |
| Arithmetic Intensity | High | Low |
| Fix strategy | Reduce FLOPs, use lower precision | Improve data reuse, fuse kernels |
- Systolic array utilization: fraction of PEs (Processing Elements) active per cycle.
- On-chip SRAM bandwidth often exceeds off-chip by 10-100x; data tiling is critical.
- TPU MXU utilization: $$ \text{MXU util} = \frac{\text{Performed matmuls}}{\text{Peak matmul capacity}} \times 100% $$
-
Using CPI: $$ T_{\text{exec}} = \frac{\text{Instruction count} \times \text{CPI}}{f_{\text{clock}}} $$
-
Using IPC: $$ T_{\text{exec}} = \frac{\text{Instruction count}}{\text{IPC} \times f_{\text{clock}}} $$
Generic definition: $$ \text{Throughput} = \frac{\text{Work done}}{\text{Time}} $$
Special cases:
- Instructions:
$\text{Inst/s} = \frac{\text{Instruction count}}{T_{\text{exec}}}$ - FLOPs:
$\text{FLOPs/s} = \frac{\text{Floating-point operations}}{T_{\text{exec}}}$ - Requests:
$\text{Req/s} = \frac{\text{Number of completed requests}}{T_{\text{obs}}}$
For a simple fully pipelined system: $$ \text{Throughput} \approx \frac{1}{\text{Average latency}} $$
- P50 (median), P95, P99, P99.9 latencies.
- Critical for production SLAs. A system can have good average latency but unacceptable tail latency.
-
Rule of thumb: At scale with fan-out
$N$ , the probability of hitting a slow node grows as$1 - (1 - p)^N$ .
Given baseline time
The roofline model provides a visual upper bound on performance as a function of arithmetic intensity.
Where:
-
$\text{AI}$ = Arithmetic Intensity (FLOPs / Byte) -
$\text{Peak BW}$ = Peak memory bandwidth (Bytes/s)
The crossover where the model shifts from memory-bound to compute-bound:
- If
$\text{AI} < \text{AI}_{\text{ridge}}$ -> memory-bound (optimize data movement). - If
$\text{AI} > \text{AI}_{\text{ridge}}$ -> compute-bound (optimize arithmetic).
| Operation | Typical AI (FLOPs/Byte) | Bound |
|---|---|---|
| Element-wise (ReLU, add) | 0.25 - 1 | Memory |
| Batch Norm | 1 - 5 | Memory |
| Convolution (large batch) | 10 - 100 | Compute |
| Matrix Multiply (large) | 50 - 200+ | Compute |
| Attention (long seq) | 5 - 50 | Varies |
| Embedding lookup | < 1 | Memory |
Extend the roofline with multiple ceilings for different memory levels:
- L1 cache roofline -> highest bandwidth, smallest capacity
- L2 cache roofline
- HBM / DRAM roofline -> lowest bandwidth, largest capacity
Each level adds a bandwidth ceiling; the kernel's performance is bounded by the ceiling corresponding to the memory level it saturates.
Let
Special case, parallelization across
For scaled-size problems on
For
Let
-
Strong scaling: Fixed total problem size, increase
$N$ . $$ E_{\text{strong}} = \frac{T_1}{N \cdot T_N} $$ -
Weak scaling: Problem size grows with
$N$ (constant work per worker). $$ E_{\text{weak}} = \frac{T_1}{T_N} $$
Definition: $$ \text{Bandwidth} = \frac{\text{Bytes transferred}}{\text{Second}} $$
Theoretical example (64-bit bus, 3.2 GT/s): $$ \text{Peak BW} = 8,\text{bytes} \times 3.2 \times 10^{9}\ \text{transfers/s} $$
Conversion between time and cycles: $$ \text{Latency (cycles)} = \text{Latency (seconds)} \times f_{\text{clock}} $$
-
Hit ratio: $$ \text{Hit ratio} = \frac{\text{Cache hits}}{\text{Cache accesses}} $$
-
Miss ratio: $$ \text{Miss ratio} = 1 - \text{Hit ratio} = \frac{\text{Cache misses}}{\text{Cache accesses}} $$
| Level | Latency | Bandwidth (approx.) |
|---|---|---|
| L1 Cache | ~1 ns (3-4 cycles) | ~1-4 TB/s |
| L2 Cache | ~3-10 ns | ~500 GB/s - 1 TB/s |
| L3 Cache | ~10-30 ns | ~200-500 GB/s |
| DRAM (DDR5) | ~50-100 ns | ~50-100 GB/s |
| HBM3 | ~80-120 ns | ~2-4 TB/s |
| NVMe SSD | ~10-100 us | ~5-14 GB/s |
| Network (RDMA) | ~1-5 us | ~25-400 Gb/s |
For AI workloads, the training loop is often bottlenecked by data loading:
- Aim for
$\text{Data starvation ratio} \approx 0$ via prefetching, multi-worker data loaders, and on-GPU augmentation.
Time-based: $$ \text{Utilization} = \frac{\text{Time device is executing useful work}}{\text{Total observed time}} \times 100% $$
For a stable system: $$ L = \lambda \times W $$
Where:
-
$L$ = average number of items in the system (concurrency) -
$\lambda$ = arrival rate (items/s) -
$W$ = average time an item spends in the system (s)
Rearrangements: $$ \lambda = \frac{L}{W}, \qquad W = \frac{L}{\lambda} $$
Applied to inference serving: If you want
For a Transformer model with
-
Forward pass per token (approx.): $$ \text{FLOPs}_{\text{fwd}} \approx 2 \times P $$ where
$P$ is the number of parameters. -
Backward pass is approximately 2x forward, so total per-token training: $$ \text{FLOPs}_{\text{train/token}} \approx 6 \times P $$
-
Model FLOPs Utilization (MFU): only counts "useful" model math: $$ \text{MFU} = \frac{\text{Model FLOPs per step} / T_{\text{step}}}{\text{Peak device FLOPs/s}} \times 100% $$
-
Hardware FLOPs Utilization (HFU): includes all FLOPs the hardware performs (rematerialization, etc.): $$ \text{HFU} = \frac{\text{All FLOPs per step} / T_{\text{step}}}{\text{Peak device FLOPs/s}} \times 100% $$
-
Good MFU values: 40-60% (typical), 60%+ (excellent).
- Used by MLPerf Training benchmarks.
| Component | Approximate Memory |
|---|---|
| Model parameters |
|
| Gradients | Same as parameters |
| Optimizer state (Adam) |
|
| Activations | proportional to batch x seq_len x hidden x layers |
| Total (FP16 + Adam) | ~16P-20P bytes |
Increasing batch size typically:
- Increases throughput (better hardware utilization, amortized overhead)
- Increases per-request latency (queuing delay)
Optimal batch size maximizes throughput while keeping
| Metric | Definition |
|---|---|
| P50 / P99 latency | 50th / 99th percentile end-to-end latency |
| Throughput (QPS) | Queries per second at target latency |
| Availability | Successful requests / Total requests x 100% |
| Goodput | Throughput of requests meeting SLA |
Includes prompt/prefill processing. Critical for perceived responsiveness.
Determines the streaming speed experienced by the user.
-
Per-request: $$ \text{Tokens/s (per request)} = \frac{\text{Output tokens}}{T_{\text{generation}}} $$
-
System-wide: $$ \text{Tokens/s (system)} = \frac{\text{Total tokens generated across all requests}}{T_{\text{observation}}} $$
| Phase | Characteristic | Bottleneck |
|---|---|---|
| Prefill | Process all input tokens in parallel | Compute-bound (large matmuls) |
| Decode | Generate tokens one at a time (autoregressive) | Memory-bound (KV cache reads, low arithmetic intensity) |
Where:
- For Llama 3 70B (FP16): ~1.3 MB per token -> 1K tokens = ~1.3 GB per request.
| Technique | Effect |
|---|---|
| Continuous batching | Dynamically add/remove requests from batch -> higher GPU utilization |
| PagedAttention (vLLM) | Non-contiguous KV cache pages -> eliminates memory fragmentation, +2-4x throughput |
| Speculative decoding | Draft model proposes tokens, target model verifies in parallel -> lower latency |
| Prefix caching | Reuse KV cache for shared prompt prefixes -> faster TTFT for repeated prefixes |
| Quantized KV cache | INT8/FP8 KV cache -> 2x more concurrent requests |
| Flash Attention | IO-aware exact attention -> reduced memory, faster computation |
| Benchmark | What it measures |
|---|---|
| MMLU / MMLU-Pro | Multi-domain knowledge accuracy |
| HumanEval / MBPP | Code generation pass@k |
| MT-Bench | Multi-turn conversation quality |
| Chatbot Arena (ELO) | Human preference ranking |
| MLPerf Inference | Standardized latency / throughput |
Each worker gets a full model copy and a shard of the data:
Where
For ring AllReduce with
Where
Ideal: overlap ratio -> 100% (communication fully hidden).
Tensor Parallelism (TP): Split individual layers across devices. $$ \text{Comm per layer (TP)} = 2 \times \text{AllReduce}(\text{activation size}) $$
Pipeline Parallelism (PP): Assign different layers to different devices. $$ \text{Pipeline bubble fraction} = \frac{P - 1}{\text{Micro-batches} + P - 1} $$
Where
-
$k$ = top-k experts per token - Load imbalance across experts degrades efficiency; auxiliary losses help.
| ZeRO Stage | What is partitioned | Memory saving (per GPU) |
|---|---|---|
| Stage 1 | Optimizer states | ~4x reduction |
| Stage 2 | + Gradients | ~8x reduction |
| Stage 3 | + Parameters | ~Nx reduction (N = GPUs) |
| Type | Bits | Range (approx.) | AI Use Case |
|---|---|---|---|
| FP32 | 32 | +/-3.4x10^38 | Baseline / master weights |
| TF32 | 19 | Same as FP32 (10-bit mantissa) | NVIDIA Ampere+ matmuls |
| BF16 | 16 | +/-3.4x10^38 (7-bit mantissa) | Training (wide range) |
| FP16 | 16 | +/-65,504 (10-bit mantissa) | Training / inference |
| FP8 (E4M3) | 8 | +/-448 | Inference & training (Hopper+) |
| FP8 (E5M2) | 8 | +/-57,344 | Gradients |
| INT8 | 8 | -128 to 127 | Post-training quantization |
| INT4 | 4 | -8 to 7 | Weight-only quantization (GPTQ, AWQ) |
| Binary / Ternary | 1-2 | {-1, 0, 1} | Ultra-low-power edge |
In practice, limited by dequantization overhead, memory alignment, and kernel support.
| Approach | Accuracy | Cost | When to use |
|---|---|---|---|
| PTQ | Good for INT8, degrades at INT4 | Cheap (minutes) | Large models, quick deployment |
| GPTQ / AWQ | Good at INT4 weight-only | Moderate (hours) | LLM inference |
| QAT | Best accuracy preservation | Expensive (full retrain) | Edge deployment, strict accuracy |
- Forward: FP16/BF16
- Loss scaling: dynamic to prevent underflow
- Master weights + optimizer: FP32
- Backward: FP16/BF16 gradients
Memory saving: approximately 2x for activations and gradients.
Critical for all-to-all communication patterns.
| Interconnect | Bandwidth (per link) | Latency | Topology |
|---|---|---|---|
| PCIe Gen5 x16 | 64 GB/s | ~100 ns | Point-to-point |
| NVLink 4 (H100) | 900 GB/s (total) | ~1 us | Fully connected (8 GPU) |
| NVLink 5 (B200) | 1.8 TB/s (total) | <1 us | NVLink Switch |
| InfiniBand NDR | 400 Gb/s (50 GB/s) | ~1 us | Fat tree |
| InfiniBand XDR | 800 Gb/s (100 GB/s) | ~1 us | Fat tree / Dragonfly |
| RoCE v2 | 100-400 Gb/s | ~2-5 us | Ethernet fabric |
| Intel Gaudi3 Scale-up | 300 GB/s (per chip) | ~1 us | Mesh |
- Fat-tree: uniform bisection bandwidth, good for AllReduce.
- Dragonfly / Torus: lower cost, but traffic-pattern dependent.
- GPUDirect RDMA: GPU memory <-> network without CPU staging -> eliminates copy latency.
- GPUDirect Storage: GPU <-> NVMe without CPU -> faster checkpointing.
- NCCL: NVIDIA's collective communication library optimized for multi-GPU/multi-node.
| Cloud Instance | GPU | GPU Memory | Interconnect | USD/hr (approx.) |
|---|---|---|---|---|
| AWS p5.48xlarge | 8x H100 | 640 GB HBM3 | NVSwitch + EFA | ~60-98 |
| AWS p5e.48xlarge | 8x H200 | 1.13 TB HBM3e | NVSwitch + EFA | ~80-120 |
| GCP a3-megagpu-8g | 8x H100 | 640 GB HBM3 | NVSwitch + GPUDirect | ~60-100 |
| Azure ND H100 v5 | 8x H100 | 640 GB HBM3 | NVSwitch + InfiniBand | ~60-100 |
| AWS inf2.48xlarge | 12x Inferentia2 | 384 GB | NeuronLink | ~12 |
Prices vary by region, commitment, and spot availability.
- Use checkpointing to survive preemptions.
- Checkpoint overhead: aim for < 5% of step time.
- Typical cold start: 10s-5min (depending on model size and framework).
- Mitigation: keep-warm policies, pre-built containers, model caching.
-
Geo-routing minimizes
$T_{\text{client-to-edge}}$ . - Model tiering: small model at edge, large model fallback in cloud.
| Constraint | Typical Range |
|---|---|
| Power budget | 1-30 W (mobile SoC: ~5 W, edge server: ~300 W) |
| Memory | 2-16 GB (shared CPU+GPU) |
| Storage | 32-256 GB eMMC / NVMe |
| Latency SLA | 1-50 ms (real-time) |
| Connectivity | Intermittent or bandwidth-constrained |
| Edge Chip | TOPS (INT8) | TOPS/W | TDP |
|---|---|---|---|
| NVIDIA Jetson Orin NX 16GB | 100 | ~5 | 25 W |
| Apple M4 Neural Engine | 38 | ~19 | ~5 W (NE only) |
| Google Coral Edge TPU | 4 | ~2 | 2 W |
| Qualcomm Snapdragon 8 Gen 3 (Hexagon NPU) | 73 | ~15 | ~5 W |
| Intel Meteor Lake NPU | 11 | ~5 | ~10 W |
| Hailo-8L | 13 | ~13 | 1.5 W |
Application
|
Model Optimization (pruning, quantization, distillation)
|
Model Format (ONNX, TFLite, Core ML, TensorRT, OpenVINO)
|
Runtime (ONNX Runtime, TFLite, SNPE, QNN, TensorRT)
|
Hardware (CPU / GPU / NPU / DSP)
At 30 FPS:
| Technique | Model Size Reduction | Accuracy Impact | Latency Impact |
|---|---|---|---|
| INT8 quantization | 4x | < 1% loss (typically) | 2-4x speedup |
| INT4 quantization | 8x | 1-3% loss | 3-6x speedup |
| Structured pruning | 2-10x | 0.5-3% loss | 2-5x speedup |
| Knowledge distillation | N/A (smaller arch) | < 2% loss | Depends on student |
| Neural Architecture Search | Model-dependent | Often better | Optimized for target |
| Optimization | Description | Tools |
|---|---|---|
| Operator fusion | Merge consecutive ops into one kernel (e.g., Conv+BN+ReLU) | TensorRT, XLA, TVM, torch.compile |
| Constant folding | Pre-compute static expressions | All compilers |
| Dead code elimination | Remove unused nodes | All compilers |
| Layout optimization | NCHW <-> NHWC for target hardware | TensorRT, OneDNN, XNNPACK |
| Common subexpression elimination | Reuse identical computations | XLA, Glow |
| Optimization | Description |
|---|---|
| Tiling / Loop blocking | Fit working set into cache/SRAM |
| Vectorization | Use SIMD/SIMT instructions |
| Loop unrolling | Reduce loop overhead, enable ILP |
| Memory coalescing | Align GPU memory accesses for warp-wide efficiency |
| Register blocking | Maximize register reuse in matmuls |
| Tool | Scope | Key Feature |
|---|---|---|
| torch.compile (Dynamo + Inductor) | PyTorch graphs | Python-level tracing + Triton codegen |
| XLA | TensorFlow / JAX | Whole-program optimization, TPU support |
| TensorRT | NVIDIA inference | INT8/FP16 calibration, layer fusion, engine building |
| TVM / Apache TVM | Cross-platform | Auto-tuning, BYOC, edge targets |
| ONNX Runtime | Cross-platform inference | Execution providers (CUDA, TensorRT, OpenVINO, CoreML) |
| OpenVINO | Intel CPUs / GPUs / VPUs | INT8 quantization, model optimizer |
| Core ML | Apple Silicon | Neural Engine dispatch, ANE optimization |
| Triton (OpenAI) | GPU kernel authoring | Python -> optimized GPU kernels |
| MLIR | Compiler infrastructure | Multi-level IR for heterogeneous compilation |
| Aspect | JIT (Just-In-Time) | AOT (Ahead-Of-Time) |
|---|---|---|
| Compilation time | Runtime (first run penalty) | Build time |
| Optimization scope | Dynamic shapes, runtime info | Static shapes only |
| Deployment | Requires compiler at runtime | Self-contained binary |
| Use case | Research, dynamic models | Production, edge devices |
-
Power: $$ P = \frac{\text{Energy}}{\text{Time}} $$
-
Energy per operation: $$ E_{\text{op}} = \frac{\text{Energy consumed}}{\text{Number of operations}} $$
-
Energy efficiency: $$ \text{FLOPs/J} = \frac{\text{FLOPs}}{\text{Energy}} $$
Examples: GFLOPS/W, Images/s/W, Tokens/s/W
Where:
- CapEx: Hardware, networking, facility build-out
- OpEx: Electricity, cooling, maintenance, staffing, cloud fees
Compare Perf/TCO across options.
- Energy = Power x Time
- Carbon intensity varies by region (50-800 g CO2/kWh).
- PUE (Power Usage Effectiveness) of datacenter: 1.1-1.8 typical.
| Tool | Platform | Purpose |
|---|---|---|
nvidia-smi |
NVIDIA | Real-time GPU utilization, memory, temp, power |
nvtop / gpustat |
NVIDIA | Interactive GPU monitoring |
| NVIDIA Nsight Systems | NVIDIA | System-wide timeline (CPU+GPU+network) |
| NVIDIA Nsight Compute | NVIDIA | Kernel-level GPU profiling (occupancy, roofline) |
| NVIDIA DCGM | NVIDIA | Datacenter GPU health and metrics |
rocm-smi / rocprof |
AMD | ROCm GPU profiling |
| Intel VTune | Intel | CPU + GPU profiling |
| AMD Omniperf | AMD | MI-series kernel profiling |
| Tool | Framework | Purpose |
|---|---|---|
torch.profiler |
PyTorch | Op-level + trace export (with TensorBoard) |
torch.cuda.Event |
PyTorch | Precise CUDA timing |
jax.profiler |
JAX | XLA HLO trace |
| TensorBoard Profiler | TF / PyTorch / JAX | Visual timeline, op stats, memory |
| Weights & Biases | Any | Experiment tracking + system metrics |
| MLflow | Any | Experiment tracking + model registry |
| DeepSpeed Flops Profiler | DeepSpeed | FLOPs counting + communication profiling |
| Tool | Purpose |
|---|---|
perf |
CPU performance counters, call graphs |
top / htop / btop |
Process-level CPU, memory overview |
vmstat / iostat / sar |
Memory, I/O, CPU stats |
mpstat |
Per-CPU utilization |
pidstat |
Per-process stats |
strace / ltrace |
Syscall / library call tracing |
bpftrace / BCC tools |
eBPF-based dynamic tracing |
numactl / lstopo |
NUMA topology awareness |
turbostat |
CPU frequency, C-states, power |
pcm |
Intel Performance Counter Monitor |
Brendan Gregg's 60-second checklist:
uptime # load averages
dmesg | tail # kernel errors
vmstat 1 # CPU, memory, I/O
mpstat -P ALL 1 # per-CPU balance
pidstat 1 # per-process CPU
iostat -xz 1 # disk I/O
free -m # memory usage
sar -n DEV 1 # network I/O
sar -n TCP,ETCP 1 # TCP stats
top # overview| Tool | Purpose |
|---|---|
| TensorRT | NVIDIA GPU inference optimization (INT8, FP16 calibration, fusion) |
| ONNX Runtime | Cross-platform optimized inference |
| OpenVINO | Intel hardware inference optimization |
| Core ML Tools | Apple Silicon optimization |
| TFLite | Mobile / embedded inference |
| vLLM | High-throughput LLM serving (PagedAttention) |
| TGI (Text Generation Inference) | HuggingFace LLM serving |
| SGLang | Fast LLM serving with RadixAttention |
| llama.cpp | CPU/GPU LLM inference (GGUF quantization) |
| MLC LLM | Universal LLM deployment (phone, browser, GPU) |
| Tool | Purpose |
|---|---|
| MLPerf (Training & Inference) | Industry-standard AI benchmarks |
sysbench |
CPU, memory, I/O microbenchmarks |
fio |
Storage I/O benchmarking |
iperf3 |
Network bandwidth testing |
likwid |
Hardware performance counter toolkit |
STREAM |
Memory bandwidth benchmark |
HPL / HPL-MxP |
LINPACK for HPC / mixed precision |
| LM Evaluation Harness | LLM accuracy benchmarks |
| LLMPerf (Anyscale) | LLM serving throughput & latency |
| Goal | Key Metrics & Tools |
|---|---|
| Model core performance | IPC/CPI, clock frequency, instruction count |
| Compute throughput | FLOPs/s, MACs/s, arithmetic intensity, roofline model |
| GPU utilization | SM occupancy, tensor core utilization, memory BW utilization |
| Training efficiency | MFU, samples/s, time-to-accuracy, scaling efficiency |
| Inference latency | P50/P95/P99 latency, TTFT, TPOT, batching strategy |
| LLM serving | Tokens/s, KV cache memory, prefill vs decode bottleneck |
| Parallelism & scaling | Amdahl/Gustafson, strong/weak scaling, communication overlap |
| Memory bottlenecks | Cache hit ratio, bandwidth utilization, data loading starvation |
| Quantization | INT8/INT4 speedup, accuracy retention, compression ratio |
| Cloud cost efficiency | Perf/$, spot strategies, auto-scaling, cold start latency |
| Edge deployment | TOPS/W, real-time feasibility, on-device optimization stack |
| System behavior | Little's Law (concurrency), tail latency, utilization |
| Energy & sustainability | FLOPs/J, Perf/W, TCO, CO2e footprint |
- MIT 6.172 - Performance Engineering of Software Systems (YouTube)
- Stanford CS149 - Parallel Computing
- CMU 15-418/618 - Parallel Computer Architecture and Programming
- Clock rate - Wikipedia
- Speedup - Wikipedia
- Transistor count - Wikipedia
- Multiply-accumulate operation (MAC) - Wikipedia
- Floating point operations per second - Wikipedia
- Processor (computing) - Wikipedia
- Computer performance - Wikipedia
- Computer performance by orders of magnitude - Wikipedia
- Hardware acceleration - Wikipedia
- NVIDIA H100 Datasheet
- NVIDIA CUDA C++ Programming Guide
- NVIDIA Nsight Systems Documentation
- NVIDIA Nsight Compute Documentation
- Google TPU Documentation
- MLPerf - ML Commons
- Efficient Processing of Deep Neural Networks (Sze et al.) - Book
- FlashAttention - Dao et al.
- vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
- Megatron-LM - NVIDIA
- DeepSpeed - Microsoft
- ZeRO: Memory Optimizations for Training Billion Parameter Models (Rajbhandari et al.)
- A Survey of Quantization Methods for Efficient Neural Network Inference (Gholami et al.)
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- AWQ: Activation-aware Weight Quantization
- Profiling (computer programming) - Wikipedia
- Benchmark (computing) - Wikipedia
- Software performance testing - Wikipedia
- Analysis of algorithms - Wikipedia
- Best, worst and average case - Wikipedia
- Algorithmic efficiency - Wikipedia
- Performance per watt - Wikipedia
- Green computing - Wikipedia
- Environmental impact of artificial intelligence - Wikipedia
- Program optimization - Wikipedia
- Optimizing compiler - Wikipedia
- torch.compile Documentation
- XLA - Accelerated Linear Algebra
- Apache TVM
- Triton Language (OpenAI)
- MLIR - Multi-Level IR Compiler Framework
- Linux Performance Analysis in 60,000 Milliseconds - Brendan Gregg @ Netflix
- Linux Systems Performance - Brendan Gregg
- Top 10 ways to monitor Linux in the console - Jeff Geerling
- AI Performance Engineering (2025-2026 Edition): Latency, Throughput, Cost Optimization & Real-World Benchmarking - Robi Kumar Tomar
- Efficiently Scaling Transformer Inference - Pope et al. (Google)
- LLM Inference Performance Engineering: Best Practices - NVIDIA Technical Blog
- High Performance Computing - cs-books collection
- Computer Architecture: A Quantitative Approach - Hennessy & Patterson
- Programming Massively Parallel Processors - Kirk & Hwu
"It's hardware that makes a machine fast. It's software that makes a fast machine slow." - Craig Bruce