AI Performance Engineering Cheatsheet

"Premature optimization is the root of all evil - but we should not pass up opportunities in the critical 3%." - Donald Knuth

AI Performance Engineering Cheatsheet

From Cloud to Edge — A holistic reference covering hardware fundamentals, GPU/accelerator metrics, the roofline model, AI/ML training & inference KPIs, LLM serving, distributed systems, quantization, cloud-native & edge deployment, networking, compiler optimizations, and practical tooling.

Table of Contents

1. Core Execution Metrics (CPU)
2. GPU & Accelerator Metrics
3. Time, Throughput & Latency
4. Roofline Model
5. Parallel Performance
6. Memory & I/O Metrics
7. System-Level & Queueing Metrics
8. AI/ML Training Metrics
9. AI/ML Inference & Serving Metrics
10. LLM-Specific Performance Metrics
11. Distributed Training & Multi-Device Scaling
12. Quantization & Numerical Precision
13. Network & Interconnect Performance
14. Cloud Performance Engineering
15. Edge & Embedded AI Performance
16. Compiler & Runtime Optimizations
17. Energy, Cost & Sustainability
18. Tooling & Profiling Reference
19. Quick Practical Summary
20. References

1. Core Execution Metrics (CPU)

1.1 Clock Frequency

Symbol: $f_{\text{clock}}$
Unit: Hz (cycles per second), often GHz
Meaning: How many clock cycles the processor completes per second.

Example:

3.2 GHz => $f_{\text{clock}} = 3.2 \times 10^{9}\ \text{cycles/s}$

1.2 IPC (Instructions Per Cycle)

Formula: $$ \text{IPC} = \frac{\text{Number of retired instructions}}{\text{Number of clock cycles}} $$
Derived: $$ \text{Inst/s} = \text{IPC} \times f_{\text{clock}} $$
Interpretation: Average completed instructions per clock cycle.

1.3 CPI (Cycles Per Instruction)

Formula: $$ \text{CPI} = \frac{\text{Number of clock cycles}}{\text{Number of retired instructions}} $$
Relation to IPC: $$ \text{IPC} = \frac{1}{\text{CPI}}, \qquad \text{CPI} = \frac{1}{\text{IPC}} $$

1.4 FLOPS (Floating-Point Operations Per Second)

1.4.1 Theoretical Peak (Per Device)

Formula: $$ \text{Peak FLOPS} = N_{\text{cores}} \times f_{\text{clock}} \times \text{FLOPs per cycle per core} $$

Example (scalar):

1 core, 3 GHz, 1 FLOP/cycle => $3 ,\text{GFLOPS}$.

Example (vector FMA):

8-wide FMA, 2 FP units/core, 3 GHz: $$ \text{FLOPs/cycle/core} = 8 \times 2 \times 2 = 32 $$ $$ \text{Peak} = 3 \times 10^{9} \times 32 = 96,\text{GFLOPS} $$

1.4.2 Delivered / Measured FLOPS

Formula: $$ \text{Delivered FLOPs/s} = \frac{\text{Floating-point operations actually performed}}{\text{Execution time}} $$
Efficiency: $$ \text{FLOP efficiency} = \frac{\text{Delivered FLOPs/s}}{\text{Peak FLOPs/s}} \times 100% $$

1.5 MACs (Multiply-Accumulate Operations)

Definition: $$ a \gets a + (b \times c) $$
MACs per second (per unit): $$ \text{MAC/s} = N_{\text{MAC/cycle}} \times f_{\text{clock}} $$
GMAC: $$ \text{GMAC} = \frac{\text{Number of MAC operations}}{10^{9}} $$
Relation to FLOPs (when one MAC = 2 FLOPs): $$ \text{FLOPs} = 2 \times \text{MACs} $$

1.6 Instruction Mix Ratios

FP instruction fraction: $$ \text{FP fraction} = \frac{\text{FP instructions}}{\text{Total instructions}} $$
Memory instruction fraction: $$ \text{Memory inst fraction} = \frac{\text{Load/Store instructions}}{\text{Total instructions}} $$

2. GPU & Accelerator Metrics

2.1 Streaming Multiprocessor (SM) Occupancy

$$ \text{Occupancy} = \frac{\text{Active warps per SM}}{\text{Maximum warps per SM}} \times 100% $$

Higher occupancy can hide memory latency through warp switching.
Limited by register usage, shared memory allocation, and block size.

2.2 Warp Execution Efficiency

$$ \text{Warp efficiency} = \frac{\text{Active threads per warp}}{32} \times 100% $$

Divergent branches reduce efficiency (threads in a warp take different paths).

2.3 Tensor Core / Matrix Unit Utilization

$$ \text{TC utilization} = \frac{\text{Cycles Tensor Cores are active}}{\text{Total cycles}} \times 100% $$

Tensor Cores perform mixed-precision matrix multiply-accumulate (e.g., FP16 inputs -> FP32 accumulator).
Peak throughput example (NVIDIA H100 SXM): ~990 TFLOPS (FP16 Tensor), ~1,979 TFLOPS (FP8 Tensor).

2.4 GPU Memory Bandwidth Utilization

$$ \text{BW utilization} = \frac{\text{Achieved bandwidth}}{\text{Peak HBM bandwidth}} \times 100% $$

HBM3 example (H100): 3.35 TB/s peak bandwidth.
HBM3e example (B200): 8 TB/s peak bandwidth.

2.5 Kernel Launch Overhead

$$ T_{\text{kernel}} = T_{\text{launch}} + T_{\text{compute}} + T_{\text{sync}} $$

Launch overhead: 5-20 us typical on CUDA. Critical for small kernels.
Kernel fusion eliminates intermediate launches and memory round-trips.

2.6 GPU Compute vs Memory Bound Classification

Indicator	Compute Bound	Memory Bound
SM utilization	High	Low-Medium
Memory BW utilization	Low-Medium	High (near peak)
Arithmetic Intensity	High	Low
Fix strategy	Reduce FLOPs, use lower precision	Improve data reuse, fuse kernels

2.7 NPU / TPU / Custom Accelerator Metrics

Systolic array utilization: fraction of PEs (Processing Elements) active per cycle.
On-chip SRAM bandwidth often exceeds off-chip by 10-100x; data tiling is critical.
TPU MXU utilization: $$ \text{MXU util} = \frac{\text{Performed matmuls}}{\text{Peak matmul capacity}} \times 100% $$

3. Time, Throughput & Latency

3.1 Execution Time (Latency)

Using CPI: $$ T_{\text{exec}} = \frac{\text{Instruction count} \times \text{CPI}}{f_{\text{clock}}} $$
Using IPC: $$ T_{\text{exec}} = \frac{\text{Instruction count}}{\text{IPC} \times f_{\text{clock}}} $$

3.2 Throughput

Generic definition: $$ \text{Throughput} = \frac{\text{Work done}}{\text{Time}} $$

Special cases:

Instructions: $\text{Inst/s} = \frac{\text{Instruction count}}{T_{\text{exec}}}$
FLOPs: $\text{FLOPs/s} = \frac{\text{Floating-point operations}}{T_{\text{exec}}}$
Requests: $\text{Req/s} = \frac{\text{Number of completed requests}}{T_{\text{obs}}}$

3.3 Latency vs Throughput (Simple Case)

For a simple fully pipelined system: $$ \text{Throughput} \approx \frac{1}{\text{Average latency}} $$

3.4 Tail Latency (Percentile Latency)

P50 (median), P95, P99, P99.9 latencies.
Critical for production SLAs. A system can have good average latency but unacceptable tail latency.
Rule of thumb: At scale with fan-out $N$, the probability of hitting a slow node grows as $1 - (1 - p)^N$.

3.5 Speedup

Given baseline time $T_1$ and optimized time $T_2$: $$ S = \frac{T_1}{T_2} $$

4. Roofline Model

The roofline model provides a visual upper bound on performance as a function of arithmetic intensity.

4.1 Core Equation

$$ \text{Attainable Performance (FLOPs/s)} = \min!\Big(\text{Peak FLOPs/s},;\text{Peak BW} \times \text{AI}\Big) $$

Where:

$\text{AI}$ = Arithmetic Intensity (FLOPs / Byte)
$\text{Peak BW}$ = Peak memory bandwidth (Bytes/s)

4.2 Ridge Point

The crossover where the model shifts from memory-bound to compute-bound:

$$ \text{AI}_{\text{ridge}} = \frac{\text{Peak FLOPs/s}}{\text{Peak BW}} $$

If $\text{AI} < \text{AI}_{\text{ridge}}$ -> memory-bound (optimize data movement).
If $\text{AI} > \text{AI}_{\text{ridge}}$ -> compute-bound (optimize arithmetic).

4.3 Common AI Workload Arithmetic Intensities

Operation	Typical AI (FLOPs/Byte)	Bound
Element-wise (ReLU, add)	0.25 - 1	Memory
Batch Norm	1 - 5	Memory
Convolution (large batch)	10 - 100	Compute
Matrix Multiply (large)	50 - 200+	Compute
Attention (long seq)	5 - 50	Varies
Embedding lookup	< 1	Memory

4.4 Hierarchical Roofline

Extend the roofline with multiple ceilings for different memory levels:

L1 cache roofline -> highest bandwidth, smallest capacity
L2 cache roofline
HBM / DRAM roofline -> lowest bandwidth, largest capacity

Each level adds a bandwidth ceiling; the kernel's performance is bounded by the ceiling corresponding to the memory level it saturates.

5. Parallel Performance

5.1 Amdahl's Law

Let $f$ be the fraction of work that can be accelerated, $s$ the speedup of that fraction:

$$ S = \frac{1}{(1 - f) + \frac{f}{s}} $$

Special case, parallelization across $N$ cores with ideal scaling ($s = N$): $$ S_N = \frac{1}{(1 - f) + \frac{f}{N}} $$

5.2 Gustafson's Law

For scaled-size problems on $N$ processors:

$$ S_N^{\text{Gustafson}} = (1 - f) + N \cdot f $$

5.3 Parallel Efficiency

For $N$ workers:

$$ E_N = \frac{S_N}{N} $$

5.4 Load Balance Metric

Let $T_i$ be time spent on worker $i$, and $T_{\max} = \max_i T_i$:

$$ \text{Load balance} = \frac{\left(\sum_i T_i / N\right)}{T_{\max}} $$

5.5 Scaling Efficiency (Weak vs Strong)

Strong scaling: Fixed total problem size, increase $N$. $$ E_{\text{strong}} = \frac{T_1}{N \cdot T_N} $$
Weak scaling: Problem size grows with $N$ (constant work per worker). $$ E_{\text{weak}} = \frac{T_1}{T_N} $$

6. Memory & I/O Metrics

6.1 Memory Bandwidth

Definition: $$ \text{Bandwidth} = \frac{\text{Bytes transferred}}{\text{Second}} $$

Theoretical example (64-bit bus, 3.2 GT/s): $$ \text{Peak BW} = 8,\text{bytes} \times 3.2 \times 10^{9}\ \text{transfers/s} $$

6.2 Memory Latency

Conversion between time and cycles: $$ \text{Latency (cycles)} = \text{Latency (seconds)} \times f_{\text{clock}} $$

6.3 Arithmetic Intensity (AI)

$$ \text{AI} = \frac{\text{Floating-point operations}}{\text{Bytes transferred from memory}} $$

6.4 Cache Hit/Miss Ratios

Hit ratio: $$ \text{Hit ratio} = \frac{\text{Cache hits}}{\text{Cache accesses}} $$
Miss ratio: $$ \text{Miss ratio} = 1 - \text{Hit ratio} = \frac{\text{Cache misses}}{\text{Cache accesses}} $$

6.5 Memory Hierarchy Typical Latencies

Level	Latency	Bandwidth (approx.)
L1 Cache	~1 ns (3-4 cycles)	~1-4 TB/s
L2 Cache	~3-10 ns	~500 GB/s - 1 TB/s
L3 Cache	~10-30 ns	~200-500 GB/s
DRAM (DDR5)	~50-100 ns	~50-100 GB/s
HBM3	~80-120 ns	~2-4 TB/s
NVMe SSD	~10-100 us	~5-14 GB/s
Network (RDMA)	~1-5 us	~25-400 Gb/s

6.6 I/O Throughput

$$ \text{I/O throughput} = \frac{\text{Bytes read or written}}{\text{Second}} $$

6.7 Data Loading Pipeline Throughput

For AI workloads, the training loop is often bottlenecked by data loading:

$$ \text{Data starvation ratio} = \frac{T_{\text{GPU idle waiting for data}}}{T_{\text{total step}}} $$

Aim for $\text{Data starvation ratio} \approx 0$ via prefetching, multi-worker data loaders, and on-GPU augmentation.

7. System-Level & Queueing Metrics

7.1 CPU / Device Utilization

Time-based: $$ \text{Utilization} = \frac{\text{Time device is executing useful work}}{\text{Total observed time}} \times 100% $$

7.2 Unit Utilization (e.g. FPU, Tensor Core)

$$ \text{Unit utilization} = \frac{\text{Cycles where the unit is busy}}{\text{Total cycles}} \times 100% $$

7.3 Stall Fraction

$$ \text{Stall fraction} = \frac{\text{Stall cycles}}{\text{Total cycles}} $$

7.4 Little's Law

For a stable system: $$ L = \lambda \times W $$

Where:

$L$ = average number of items in the system (concurrency)
$\lambda$ = arrival rate (items/s)
$W$ = average time an item spends in the system (s)

Rearrangements: $$ \lambda = \frac{L}{W}, \qquad W = \frac{L}{\lambda} $$

Applied to inference serving: If you want $\lambda = 100$ req/s and average latency $W = 50$ ms: $$ L = 100 \times 0.05 = 5 \text{ concurrent requests in flight} $$

8. AI/ML Training Metrics

8.1 Training Throughput

$$ \text{Samples/s} = \frac{\text{Batch size}}{T_{\text{step}}} $$

$$ \text{Total training time} \approx \frac{\text{Total samples (epochs x dataset size)}}{\text{Samples/s}} $$

8.2 Model FLOPs (Forward + Backward)

For a Transformer model with $L$ layers, hidden size $H$, sequence length $S$, vocabulary $V$:

Forward pass per token (approx.): $$ \text{FLOPs}_{\text{fwd}} \approx 2 \times P $$ where $P$ is the number of parameters.
Backward pass is approximately 2x forward, so total per-token training: $$ \text{FLOPs}_{\text{train/token}} \approx 6 \times P $$

8.3 Hardware FLOPs Utilization (HFU / MFU)

Model FLOPs Utilization (MFU): only counts "useful" model math: $$ \text{MFU} = \frac{\text{Model FLOPs per step} / T_{\text{step}}}{\text{Peak device FLOPs/s}} \times 100% $$
Hardware FLOPs Utilization (HFU): includes all FLOPs the hardware performs (rematerialization, etc.): $$ \text{HFU} = \frac{\text{All FLOPs per step} / T_{\text{step}}}{\text{Peak device FLOPs/s}} \times 100% $$
Good MFU values: 40-60% (typical), 60%+ (excellent).

8.4 Convergence Efficiency

$$ \text{Samples to accuracy} = \text{Total samples processed to reach target metric} $$

$$ \text{Time to accuracy} = \frac{\text{Samples to accuracy}}{\text{Samples/s}} $$

Used by MLPerf Training benchmarks.

8.5 Gradient Accumulation Effective Batch Size

$$ \text{Effective batch size} = \text{Micro-batch} \times \text{Accumulation steps} \times N_{\text{GPUs}} $$

8.6 GPU Memory Breakdown (Training)

Component	Approximate Memory
Model parameters	$2P$ bytes (FP16) or $4P$ bytes (FP32)
Gradients	Same as parameters
Optimizer state (Adam)	$8P$-$12P$ bytes (FP32 momentum + variance + master weights)
Activations	proportional to batch x seq_len x hidden x layers
Total (FP16 + Adam)	~16P-20P bytes

9. AI/ML Inference & Serving Metrics

9.1 Inference Latency Breakdown

$$ T_{\text{inference}} = T_{\text{preprocess}} + T_{\text{model}} + T_{\text{postprocess}} + T_{\text{network}} $$

9.2 Batched Throughput vs Latency Trade-off

Increasing batch size typically:

Increases throughput (better hardware utilization, amortized overhead)
Increases per-request latency (queuing delay)

$$ \text{Throughput} = \frac{B}{T_{\text{batch}}(B)} $$

Optimal batch size maximizes throughput while keeping $T_{\text{batch}}(B) \leq \text{SLA}$.

9.3 Model Serving SLA Metrics

Metric	Definition
P50 / P99 latency	50th / 99th percentile end-to-end latency
Throughput (QPS)	Queries per second at target latency
Availability	Successful requests / Total requests x 100%
Goodput	Throughput of requests meeting SLA

9.4 Dynamic Batching Efficiency

$$ \text{Batch fill rate} = \frac{\text{Average actual batch size}}{\text{Maximum batch size}} \times 100% $$

9.5 Model Compression Metrics

$$ \text{Compression ratio} = \frac{\text{Original model size}}{\text{Compressed model size}} $$

$$ \text{Accuracy retention} = \frac{\text{Compressed model accuracy}}{\text{Original model accuracy}} \times 100% $$

10. LLM-Specific Performance Metrics

10.1 Time to First Token (TTFT)

$$ \text{TTFT} = T_{\text{receive request}} \to T_{\text{first token generated}} $$

Includes prompt/prefill processing. Critical for perceived responsiveness.

10.2 Time Per Output Token (TPOT)

$$ \text{TPOT} = \frac{T_{\text{generation}} - \text{TTFT}}{\text{Number of output tokens} - 1} $$

Determines the streaming speed experienced by the user.

10.3 Token Throughput

Per-request: $$ \text{Tokens/s (per request)} = \frac{\text{Output tokens}}{T_{\text{generation}}} $$
System-wide: $$ \text{Tokens/s (system)} = \frac{\text{Total tokens generated across all requests}}{T_{\text{observation}}} $$

10.4 Prefill vs Decode Phases

Phase	Characteristic	Bottleneck
Prefill	Process all input tokens in parallel	Compute-bound (large matmuls)
Decode	Generate tokens one at a time (autoregressive)	Memory-bound (KV cache reads, low arithmetic intensity)

10.5 KV Cache Memory

$$ \text{KV cache per token} = 2 \times L \times H \times \text{bytes per element} $$

$$ \text{KV cache (total)} = \text{KV cache per token} \times S \times B $$

Where: $L$ = layers, $H$ = hidden dim, $S$ = sequence length, $B$ = batch size.

For Llama 3 70B (FP16): ~1.3 MB per token -> 1K tokens = ~1.3 GB per request.

10.6 LLM Serving Optimization Techniques

Technique	Effect
Continuous batching	Dynamically add/remove requests from batch -> higher GPU utilization
PagedAttention (vLLM)	Non-contiguous KV cache pages -> eliminates memory fragmentation, +2-4x throughput
Speculative decoding	Draft model proposes tokens, target model verifies in parallel -> lower latency
Prefix caching	Reuse KV cache for shared prompt prefixes -> faster TTFT for repeated prefixes
Quantized KV cache	INT8/FP8 KV cache -> 2x more concurrent requests
Flash Attention	IO-aware exact attention -> reduced memory, faster computation

10.7 LLM Benchmark Metrics

Benchmark	What it measures
MMLU / MMLU-Pro	Multi-domain knowledge accuracy
HumanEval / MBPP	Code generation pass@k
MT-Bench	Multi-turn conversation quality
Chatbot Arena (ELO)	Human preference ranking
MLPerf Inference	Standardized latency / throughput

11. Distributed Training & Multi-Device Scaling

11.1 Data Parallelism

Each worker gets a full model copy and a shard of the data:

$$ \text{Throughput}_{\text{DP}} \approx N \times \text{Throughput}_{\text{single}} \times E_{\text{comm}} $$

Where $E_{\text{comm}} < 1$ accounts for gradient synchronization overhead.

11.2 Communication Overhead (AllReduce)

For ring AllReduce with $N$ nodes and message size $M$:

$$ T_{\text{AllReduce}} \approx 2 \times \frac{N - 1}{N} \times \frac{M}{\text{BW}} + 2(N - 1) \times \alpha $$

Where $\alpha$ = per-message latency, $\text{BW}$ = per-link bandwidth.

11.3 Computation-Communication Overlap Ratio

$$ \text{Overlap ratio} = \frac{\text{Time comm is hidden behind compute}}{\text{Total comm time}} $$

Ideal: overlap ratio -> 100% (communication fully hidden).

11.4 Model Parallelism

Tensor Parallelism (TP): Split individual layers across devices. $$ \text{Comm per layer (TP)} = 2 \times \text{AllReduce}(\text{activation size}) $$

Pipeline Parallelism (PP): Assign different layers to different devices. $$ \text{Pipeline bubble fraction} = \frac{P - 1}{\text{Micro-batches} + P - 1} $$

Where $P$ = pipeline stages. Larger micro-batch count reduces bubble.

11.5 3D Parallelism

$$ N_{\text{total GPUs}} = N_{\text{DP}} \times N_{\text{TP}} \times N_{\text{PP}} $$

11.6 Expert Parallelism (MoE)

$$ \text{All-to-All time} \propto \frac{B \times S \times H \times k}{N_{\text{experts}} \times \text{BW}} $$

$k$ = top-k experts per token
Load imbalance across experts degrades efficiency; auxiliary losses help.

11.7 ZeRO (Zero Redundancy Optimizer) Stages

ZeRO Stage	What is partitioned	Memory saving (per GPU)
Stage 1	Optimizer states	~4x reduction
Stage 2	+ Gradients	~8x reduction
Stage 3	+ Parameters	~Nx reduction (N = GPUs)

12. Quantization & Numerical Precision

12.1 Data Types & Their Properties

Type	Bits	Range (approx.)	AI Use Case
FP32	32	+/-3.4x10^38	Baseline / master weights
TF32	19	Same as FP32 (10-bit mantissa)	NVIDIA Ampere+ matmuls
BF16	16	+/-3.4x10^38 (7-bit mantissa)	Training (wide range)
FP16	16	+/-65,504 (10-bit mantissa)	Training / inference
FP8 (E4M3)	8	+/-448	Inference & training (Hopper+)
FP8 (E5M2)	8	+/-57,344	Gradients
INT8	8	-128 to 127	Post-training quantization
INT4	4	-8 to 7	Weight-only quantization (GPTQ, AWQ)
Binary / Ternary	1-2	{-1, 0, 1}	Ultra-low-power edge

12.2 Quantization Speedup Model

$$ \text{Theoretical speedup} \leq \frac{\text{Bits}_{\text{original}}}{\text{Bits}_{\text{quantized}}} $$

In practice, limited by dequantization overhead, memory alignment, and kernel support.

12.3 Quantization-Aware Training (QAT) vs Post-Training Quantization (PTQ)

Approach	Accuracy	Cost	When to use
PTQ	Good for INT8, degrades at INT4	Cheap (minutes)	Large models, quick deployment
GPTQ / AWQ	Good at INT4 weight-only	Moderate (hours)	LLM inference
QAT	Best accuracy preservation	Expensive (full retrain)	Edge deployment, strict accuracy

12.4 Mixed Precision Training

Forward: FP16/BF16
Loss scaling: dynamic to prevent underflow
Master weights + optimizer: FP32
Backward: FP16/BF16 gradients

Memory saving: approximately 2x for activations and gradients.

13. Network & Interconnect Performance

13.1 Bandwidth & Latency

$$ T_{\text{transfer}} = \frac{\text{Message size}}{\text{Bandwidth}} + \text{Latency} $$

13.2 Bisection Bandwidth

$$ \text{Bisection BW} = \text{Min aggregate BW across any cut dividing the network in half} $$

Critical for all-to-all communication patterns.

13.3 Common Interconnects for AI

Interconnect	Bandwidth (per link)	Latency	Topology
PCIe Gen5 x16	64 GB/s	~100 ns	Point-to-point
NVLink 4 (H100)	900 GB/s (total)	~1 us	Fully connected (8 GPU)
NVLink 5 (B200)	1.8 TB/s (total)	<1 us	NVLink Switch
InfiniBand NDR	400 Gb/s (50 GB/s)	~1 us	Fat tree
InfiniBand XDR	800 Gb/s (100 GB/s)	~1 us	Fat tree / Dragonfly
RoCE v2	100-400 Gb/s	~2-5 us	Ethernet fabric
Intel Gaudi3 Scale-up	300 GB/s (per chip)	~1 us	Mesh

13.4 Network Topology Impact

$$ \text{Comm cost} \propto \text{Hops} \times \frac{\text{Message size}}{\text{Per-hop BW}} $$

Fat-tree: uniform bisection bandwidth, good for AllReduce.
Dragonfly / Torus: lower cost, but traffic-pattern dependent.

13.5 RDMA & GPUDirect

GPUDirect RDMA: GPU memory <-> network without CPU staging -> eliminates copy latency.
GPUDirect Storage: GPU <-> NVMe without CPU -> faster checkpointing.
NCCL: NVIDIA's collective communication library optimized for multi-GPU/multi-node.

14. Cloud Performance Engineering

14.1 Instance Selection Metrics

$$ \text{Cost-efficiency} = \frac{\text{Performance (tokens/s, samples/s, etc.)}}{\text{USD/hour}} $$

14.2 Cloud GPU Instance Comparison (Typical)

Cloud Instance	GPU	GPU Memory	Interconnect	USD/hr (approx.)
AWS p5.48xlarge	8x H100	640 GB HBM3	NVSwitch + EFA	~60-98
AWS p5e.48xlarge	8x H200	1.13 TB HBM3e	NVSwitch + EFA	~80-120
GCP a3-megagpu-8g	8x H100	640 GB HBM3	NVSwitch + GPUDirect	~60-100
Azure ND H100 v5	8x H100	640 GB HBM3	NVSwitch + InfiniBand	~60-100
AWS inf2.48xlarge	12x Inferentia2	384 GB	NeuronLink	~12

Prices vary by region, commitment, and spot availability.

14.3 Spot / Preemptible Instance Strategy

$$ \text{Effective cost} = \text{Spot price} \times \frac{T_{\text{total with preemptions}}}{T_{\text{ideal}}} $$

Use checkpointing to survive preemptions.
Checkpoint overhead: aim for < 5% of step time.

14.4 Auto-Scaling Metrics

$$ \text{Scale-up trigger:} \quad \text{Avg queue depth} > \theta_{\text{high}} ;\text{for}; t > t_{\text{window}} $$

$$ \text{Cold start latency} = T_{\text{provision}} + T_{\text{model load}} + T_{\text{warmup}} $$

Typical cold start: 10s-5min (depending on model size and framework).
Mitigation: keep-warm policies, pre-built containers, model caching.

14.5 Multi-Region & Edge-Cloud Considerations

$$ T_{\text{end-to-end}} = T_{\text{client-to-edge}} + T_{\text{edge-processing}} + T_{\text{edge-to-cloud}} + T_{\text{cloud-processing}} $$

Geo-routing minimizes $T_{\text{client-to-edge}}$.
Model tiering: small model at edge, large model fallback in cloud.

15. Edge & Embedded AI Performance

15.1 Edge Deployment Constraints

Constraint	Typical Range
Power budget	1-30 W (mobile SoC: ~5 W, edge server: ~300 W)
Memory	2-16 GB (shared CPU+GPU)
Storage	32-256 GB eMMC / NVMe
Latency SLA	1-50 ms (real-time)
Connectivity	Intermittent or bandwidth-constrained

15.2 Edge AI Performance Metrics

$$ \text{TOPS} = \text{Tera Operations Per Second (INT8 or INT4)} $$

$$ \text{TOPS/W} = \frac{\text{TOPS}}{\text{Power (watts)}} $$

Edge Chip	TOPS (INT8)	TOPS/W	TDP
NVIDIA Jetson Orin NX 16GB	100	~5	25 W
Apple M4 Neural Engine	38	~19	~5 W (NE only)
Google Coral Edge TPU	4	~2	2 W
Qualcomm Snapdragon 8 Gen 3 (Hexagon NPU)	73	~15	~5 W
Intel Meteor Lake NPU	11	~5	~10 W
Hailo-8L	13	~13	1.5 W

15.3 On-Device Optimization Stack

Application
    |
Model Optimization (pruning, quantization, distillation)
    |
Model Format (ONNX, TFLite, Core ML, TensorRT, OpenVINO)
    |
Runtime (ONNX Runtime, TFLite, SNPE, QNN, TensorRT)
    |
Hardware (CPU / GPU / NPU / DSP)

15.4 Real-Time Performance

$$ \text{Real-time feasible} \iff T_{\text{inference}} \leq T_{\text{frame budget}} $$

At 30 FPS: $T_{\text{frame}} = 33.3$ ms. At 60 FPS: $T_{\text{frame}} = 16.7$ ms.

15.5 Model Optimization Techniques for Edge

Technique	Model Size Reduction	Accuracy Impact	Latency Impact
INT8 quantization	4x	< 1% loss (typically)	2-4x speedup
INT4 quantization	8x	1-3% loss	3-6x speedup
Structured pruning	2-10x	0.5-3% loss	2-5x speedup
Knowledge distillation	N/A (smaller arch)	< 2% loss	Depends on student
Neural Architecture Search	Model-dependent	Often better	Optimized for target

16. Compiler & Runtime Optimizations

16.1 Graph-Level Optimizations

Optimization	Description	Tools
Operator fusion	Merge consecutive ops into one kernel (e.g., Conv+BN+ReLU)	TensorRT, XLA, TVM, torch.compile
Constant folding	Pre-compute static expressions	All compilers
Dead code elimination	Remove unused nodes	All compilers
Layout optimization	NCHW <-> NHWC for target hardware	TensorRT, OneDNN, XNNPACK
Common subexpression elimination	Reuse identical computations	XLA, Glow

16.2 Kernel-Level Optimizations

Optimization	Description
Tiling / Loop blocking	Fit working set into cache/SRAM
Vectorization	Use SIMD/SIMT instructions
Loop unrolling	Reduce loop overhead, enable ILP
Memory coalescing	Align GPU memory accesses for warp-wide efficiency
Register blocking	Maximize register reuse in matmuls

16.3 Key AI Compilers & Runtimes

Tool	Scope	Key Feature
torch.compile (Dynamo + Inductor)	PyTorch graphs	Python-level tracing + Triton codegen
XLA	TensorFlow / JAX	Whole-program optimization, TPU support
TensorRT	NVIDIA inference	INT8/FP16 calibration, layer fusion, engine building
TVM / Apache TVM	Cross-platform	Auto-tuning, BYOC, edge targets
ONNX Runtime	Cross-platform inference	Execution providers (CUDA, TensorRT, OpenVINO, CoreML)
OpenVINO	Intel CPUs / GPUs / VPUs	INT8 quantization, model optimizer
Core ML	Apple Silicon	Neural Engine dispatch, ANE optimization
Triton (OpenAI)	GPU kernel authoring	Python -> optimized GPU kernels
MLIR	Compiler infrastructure	Multi-level IR for heterogeneous compilation

16.4 JIT vs AOT Compilation

Aspect	JIT (Just-In-Time)	AOT (Ahead-Of-Time)
Compilation time	Runtime (first run penalty)	Build time
Optimization scope	Dynamic shapes, runtime info	Static shapes only
Deployment	Requires compiler at runtime	Self-contained binary
Use case	Research, dynamic models	Production, edge devices

17. Energy, Cost & Sustainability

17.1 Power & Energy

Power: $$ P = \frac{\text{Energy}}{\text{Time}} $$
Energy per operation: $$ E_{\text{op}} = \frac{\text{Energy consumed}}{\text{Number of operations}} $$
Energy efficiency: $$ \text{FLOPs/J} = \frac{\text{FLOPs}}{\text{Energy}} $$

17.2 Performance per Watt

$$ \text{Perf/W} = \frac{\text{Performance metric}}{\text{Power}} $$

Examples: GFLOPS/W, Images/s/W, Tokens/s/W

17.3 Performance per Dollar

$$ \text{Perf/$} = \frac{\text{Performance metric}}{\text{Cost}} $$

17.4 Total Cost of Ownership

$$ \text{TCO} = \text{CapEx} + \text{OpEx over lifetime} $$

Where:

CapEx: Hardware, networking, facility build-out
OpEx: Electricity, cooling, maintenance, staffing, cloud fees

Compare Perf/TCO across options.

17.5 AI Training Carbon Footprint

$$ \text{CO}_2\text{e} = \text{Energy (kWh)} \times \text{Carbon intensity (g CO}_2\text{/kWh)} $$

Energy = Power x Time
Carbon intensity varies by region (50-800 g CO2/kWh).
PUE (Power Usage Effectiveness) of datacenter: 1.1-1.8 typical.

$$ \text{Total energy} = \text{IT energy} \times \text{PUE} $$

18. Tooling & Profiling Reference

18.1 GPU Profiling

Tool	Platform	Purpose
`nvidia-smi`	NVIDIA	Real-time GPU utilization, memory, temp, power
`nvtop` / `gpustat`	NVIDIA	Interactive GPU monitoring
NVIDIA Nsight Systems	NVIDIA	System-wide timeline (CPU+GPU+network)
NVIDIA Nsight Compute	NVIDIA	Kernel-level GPU profiling (occupancy, roofline)
NVIDIA DCGM	NVIDIA	Datacenter GPU health and metrics
`rocm-smi` / `rocprof`	AMD	ROCm GPU profiling
Intel VTune	Intel	CPU + GPU profiling
AMD Omniperf	AMD	MI-series kernel profiling

18.2 AI Framework Profiling

Tool	Framework	Purpose
`torch.profiler`	PyTorch	Op-level + trace export (with TensorBoard)
`torch.cuda.Event`	PyTorch	Precise CUDA timing
`jax.profiler`	JAX	XLA HLO trace
TensorBoard Profiler	TF / PyTorch / JAX	Visual timeline, op stats, memory
Weights & Biases	Any	Experiment tracking + system metrics
MLflow	Any	Experiment tracking + model registry
DeepSpeed Flops Profiler	DeepSpeed	FLOPs counting + communication profiling

18.3 System-Level Profiling (Linux)

Tool	Purpose
`perf`	CPU performance counters, call graphs
`top` / `htop` / `btop`	Process-level CPU, memory overview
`vmstat` / `iostat` / `sar`	Memory, I/O, CPU stats
`mpstat`	Per-CPU utilization
`pidstat`	Per-process stats
`strace` / `ltrace`	Syscall / library call tracing
`bpftrace` / BCC tools	eBPF-based dynamic tracing
`numactl` / `lstopo`	NUMA topology awareness
`turbostat`	CPU frequency, C-states, power
`pcm`	Intel Performance Counter Monitor

Brendan Gregg's 60-second checklist:

uptime                  # load averages
dmesg | tail            # kernel errors
vmstat 1                # CPU, memory, I/O
mpstat -P ALL 1         # per-CPU balance
pidstat 1               # per-process CPU
iostat -xz 1            # disk I/O
free -m                 # memory usage
sar -n DEV 1            # network I/O
sar -n TCP,ETCP 1       # TCP stats
top                     # overview

18.4 Inference Optimization Tools

Tool	Purpose
TensorRT	NVIDIA GPU inference optimization (INT8, FP16 calibration, fusion)
ONNX Runtime	Cross-platform optimized inference
OpenVINO	Intel hardware inference optimization
Core ML Tools	Apple Silicon optimization
TFLite	Mobile / embedded inference
vLLM	High-throughput LLM serving (PagedAttention)
TGI (Text Generation Inference)	HuggingFace LLM serving
SGLang	Fast LLM serving with RadixAttention
llama.cpp	CPU/GPU LLM inference (GGUF quantization)
MLC LLM	Universal LLM deployment (phone, browser, GPU)

18.5 Benchmarking Tools

Tool	Purpose
MLPerf (Training & Inference)	Industry-standard AI benchmarks
`sysbench`	CPU, memory, I/O microbenchmarks
`fio`	Storage I/O benchmarking
`iperf3`	Network bandwidth testing
`likwid`	Hardware performance counter toolkit
`STREAM`	Memory bandwidth benchmark
`HPL` / `HPL-MxP`	LINPACK for HPC / mixed precision
LM Evaluation Harness	LLM accuracy benchmarks
LLMPerf (Anyscale)	LLM serving throughput & latency

19. Quick Practical Summary

Goal	Key Metrics & Tools
Model core performance	IPC/CPI, clock frequency, instruction count
Compute throughput	FLOPs/s, MACs/s, arithmetic intensity, roofline model
GPU utilization	SM occupancy, tensor core utilization, memory BW utilization
Training efficiency	MFU, samples/s, time-to-accuracy, scaling efficiency
Inference latency	P50/P95/P99 latency, TTFT, TPOT, batching strategy
LLM serving	Tokens/s, KV cache memory, prefill vs decode bottleneck
Parallelism & scaling	Amdahl/Gustafson, strong/weak scaling, communication overlap
Memory bottlenecks	Cache hit ratio, bandwidth utilization, data loading starvation
Quantization	INT8/INT4 speedup, accuracy retention, compression ratio
Cloud cost efficiency	Perf/$, spot strategies, auto-scaling, cold start latency
Edge deployment	TOPS/W, real-time feasibility, on-device optimization stack
System behavior	Little's Law (concurrency), tail latency, utilization
Energy & sustainability	FLOPs/J, Perf/W, TCO, CO2e footprint

20. References

Lectures & Courses

Performance Engineering

Architecture & Hardware

GPU & Accelerators

AI/ML Performance

Quantization

Benchmarking & Profiling

Efficiency & Sustainability

Compilers & Optimization

Supercomputing

Linux Performance

Blog Posts & Articles

Books

"It's hardware that makes a machine fast. It's software that makes a fast machine slow." - Craig Bruce

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AI Performance Engineering Cheatsheet

1. Core Execution Metrics (CPU)

1.1 Clock Frequency

1.2 IPC (Instructions Per Cycle)

1.3 CPI (Cycles Per Instruction)

1.4 FLOPS (Floating-Point Operations Per Second)

1.4.1 Theoretical Peak (Per Device)

1.4.2 Delivered / Measured FLOPS

1.5 MACs (Multiply-Accumulate Operations)

1.6 Instruction Mix Ratios

2. GPU & Accelerator Metrics

2.1 Streaming Multiprocessor (SM) Occupancy

2.2 Warp Execution Efficiency

2.3 Tensor Core / Matrix Unit Utilization

2.4 GPU Memory Bandwidth Utilization

2.5 Kernel Launch Overhead

2.6 GPU Compute vs Memory Bound Classification

2.7 NPU / TPU / Custom Accelerator Metrics

3. Time, Throughput & Latency

3.1 Execution Time (Latency)

3.2 Throughput

3.3 Latency vs Throughput (Simple Case)

3.4 Tail Latency (Percentile Latency)

3.5 Speedup

4. Roofline Model

4.1 Core Equation

4.2 Ridge Point

4.3 Common AI Workload Arithmetic Intensities

4.4 Hierarchical Roofline

5. Parallel Performance

5.1 Amdahl's Law

5.2 Gustafson's Law

5.3 Parallel Efficiency

5.4 Load Balance Metric

5.5 Scaling Efficiency (Weak vs Strong)

6. Memory & I/O Metrics

6.1 Memory Bandwidth

6.2 Memory Latency

6.3 Arithmetic Intensity (AI)

6.4 Cache Hit/Miss Ratios

6.5 Memory Hierarchy Typical Latencies

6.6 I/O Throughput

6.7 Data Loading Pipeline Throughput

7. System-Level & Queueing Metrics

7.1 CPU / Device Utilization

7.2 Unit Utilization (e.g. FPU, Tensor Core)

7.3 Stall Fraction

7.4 Little's Law

8. AI/ML Training Metrics

8.1 Training Throughput

8.2 Model FLOPs (Forward + Backward)

8.3 Hardware FLOPs Utilization (HFU / MFU)

8.4 Convergence Efficiency

8.5 Gradient Accumulation Effective Batch Size

8.6 GPU Memory Breakdown (Training)

9. AI/ML Inference & Serving Metrics

9.1 Inference Latency Breakdown

9.2 Batched Throughput vs Latency Trade-off

9.3 Model Serving SLA Metrics

9.4 Dynamic Batching Efficiency

9.5 Model Compression Metrics

10. LLM-Specific Performance Metrics

10.1 Time to First Token (TTFT)

10.2 Time Per Output Token (TPOT)

10.3 Token Throughput

10.4 Prefill vs Decode Phases

10.5 KV Cache Memory

10.6 LLM Serving Optimization Techniques

10.7 LLM Benchmark Metrics

11. Distributed Training & Multi-Device Scaling

11.1 Data Parallelism

11.2 Communication Overhead (AllReduce)

11.3 Computation-Communication Overlap Ratio

11.4 Model Parallelism

11.5 3D Parallelism

11.6 Expert Parallelism (MoE)