openai · anthony-maio · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026 · Mar 21, 2026
diff --git a/.agent/skills/runpodctl/SKILL.md b/.agent/skills/runpodctl/SKILL.md
@@ -0,0 +1,204 @@
+---
+name: runpodctl
+description: Runpod CLI to manage your GPU workloads.
+allowed-tools: Bash(runpodctl:*)
+compatibility: Linux, macOS
+metadata:
+  author: runpod
+  version: "2.1"
+license: Apache-2.0
+---
+
+# Runpodctl
+
+Manage GPU pods, serverless endpoints, templates, volumes, and models.
+
+> **Spelling:** "Runpod" (capital R). Command is `runpodctl` (lowercase).
+
+## Install
+
+```bash
+# Any platform (official installer)
+curl -sSL https://cli.runpod.net | bash
+
+# macOS (Homebrew)
+brew install runpod/runpodctl/runpodctl
+
+# macOS (manual — universal binary)
+mkdir -p ~/.local/bin && curl -sL https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-darwin-all.tar.gz | tar xz -C ~/.local/bin
+
+# Linux
+mkdir -p ~/.local/bin && curl -sL https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64.tar.gz | tar xz -C ~/.local/bin
+
+# Windows (PowerShell)
+Invoke-WebRequest -Uri https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-windows-amd64.zip -OutFile runpodctl.zip; Expand-Archive runpodctl.zip -DestinationPath $env:LOCALAPPDATA\runpodctl; [Environment]::SetEnvironmentVariable('Path', $env:Path + ";$env:LOCALAPPDATA\runpodctl", 'User')
+```
+
+Ensure `~/.local/bin` is on your `PATH` (add `export PATH="$HOME/.local/bin:$PATH"` to `~/.bashrc` or `~/.zshrc`).
+
+## Quick start
+
+```bash
+runpodctl doctor                    # First time setup (API key + SSH)
+runpodctl gpu list                  # See available GPUs
+runpodctl template search pytorch   # Find a template
+runpodctl pod create --template-id runpod-torch-v21 --gpu-id "NVIDIA RTX 4090"  # Create from template
+runpodctl pod list                  # List your pods
+```
+
+API key: https://runpod.io/console/user/settings
+
+## Commands
+
+### Pods
+
+```bash
+runpodctl pod list                                    # List running pods (default, like docker ps)
+runpodctl pod list --all                              # List all pods including exited
+runpodctl pod list --status exited                    # Filter by status (RUNNING, EXITED, etc.)
+runpodctl pod list --since 24h                        # Pods created within last 24 hours
+runpodctl pod list --created-after 2025-01-15         # Pods created after date
+runpodctl pod get <pod-id>                            # Get pod details (includes SSH info)
+runpodctl pod create --template-id runpod-torch-v21 --gpu-id "NVIDIA RTX 4090"  # Create from template
+runpodctl pod create --image "runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04" --gpu-id "NVIDIA RTX 4090"  # Create with image
+runpodctl pod create --compute-type cpu --image ubuntu:22.04  # Create CPU pod
+runpodctl pod start <pod-id>                          # Start stopped pod
+runpodctl pod stop <pod-id>                           # Stop running pod
+runpodctl pod restart <pod-id>                        # Restart pod
+runpodctl pod reset <pod-id>                          # Reset pod
+runpodctl pod update <pod-id> --name "new"            # Update pod
+runpodctl pod delete <pod-id>                         # Delete pod (aliases: rm, remove)
+```
+
+**List flags:** `--all` / `-a`, `--status`, `--since`, `--created-after`, `--name`, `--compute-type`
+**Get flags:** `--include-machine`, `--include-network-volume`
+
+**Create flags:** `--template-id` (required if no `--image`), `--image` (required if no `--template-id`), `--name`, `--gpu-id`, `--gpu-count`, `--compute-type`, `--ssh` (default true), `--container-disk-in-gb`, `--volume-in-gb`, `--volume-mount-path`, `--ports`, `--env`, `--cloud-type`, `--data-center-ids`, `--global-networking`, `--public-ip`
+
+### Serverless (alias: sls)
+
+```bash
+runpodctl serverless list                             # List all endpoints
+runpodctl serverless get <endpoint-id>                # Get endpoint details
+runpodctl serverless create --name "x" --template-id "tpl_abc"  # Create endpoint
+runpodctl serverless update <endpoint-id> --workers-max 5       # Update endpoint
+runpodctl serverless delete <endpoint-id>             # Delete endpoint
+```
+
+**List flags:** `--include-template`, `--include-workers`
+**Update flags:** `--name`, `--workers-min`, `--workers-max`, `--idle-timeout`, `--scaler-type` (QUEUE_DELAY or REQUEST_COUNT), `--scaler-value`
+**Create flags:** `--name`, `--template-id`, `--gpu-id`, `--gpu-count`, `--compute-type`, `--workers-min`, `--workers-max`, `--data-center-ids`
+
+### Templates (alias: tpl)
+
+```bash
+runpodctl template list                               # Official + community (first 10)
+runpodctl template list --type official               # All official templates
+runpodctl template list --type community              # Community templates (first 10)
+runpodctl template list --type user                   # Your own templates
+runpodctl template list --all                         # Everything including user
+runpodctl template list --limit 50                    # Show 50 templates
+runpodctl template search pytorch                     # Search for "pytorch" templates
+runpodctl template search comfyui --limit 5           # Search, limit to 5 results
+runpodctl template search vllm --type official        # Search only official
+runpodctl template get <template-id>                  # Get template details (includes README, env, ports)
+runpodctl template create --name "x" --image "img"    # Create template
+runpodctl template create --name "x" --image "img" --serverless  # Create serverless template
+runpodctl template update <template-id> --name "new"  # Update template
+runpodctl template delete <template-id>               # Delete template
+```
+
+**List flags:** `--type` (official, community, user), `--limit`, `--offset`, `--all`
+**Create flags:** `--name`, `--image`, `--container-disk-in-gb`, `--volume-in-gb`, `--volume-mount-path`, `--ports`, `--env`, `--docker-start-cmd`, `--docker-entrypoint`, `--serverless`, `--readme`
+
+### Network Volumes (alias: nv)
+
+```bash
+runpodctl network-volume list                         # List all volumes
+runpodctl network-volume get <volume-id>              # Get volume details
+runpodctl network-volume create --name "x" --size 100 --data-center-id "US-GA-1"  # Create volume
+runpodctl network-volume update <volume-id> --name "new"  # Update volume
+runpodctl network-volume delete <volume-id>           # Delete volume
+```
+
+**Create flags:** `--name`, `--size`, `--data-center-id`
+
+### Models
+
+```bash
+runpodctl model list                                  # List your models
+runpodctl model list --all                            # List all models
+runpodctl model list --name "llama"                   # Filter by name
+runpodctl model list --provider "meta"                # Filter by provider
+runpodctl model add --name "my-model" --model-path ./model  # Add model
+runpodctl model remove --name "my-model"              # Remove model
+```
+
+### Registry (alias: reg)
+
+```bash
+runpodctl registry list                               # List registry auths
+runpodctl registry get <registry-id>                  # Get registry auth
+runpodctl registry create --name "x" --username "u" --password "p"  # Create registry auth
+runpodctl registry delete <registry-id>               # Delete registry auth
+```
+
+### Info
+
+```bash
+runpodctl user                                        # Account info and balance (alias: me)
+runpodctl gpu list                                    # List available GPUs
+runpodctl gpu list --include-unavailable              # Include unavailable GPUs
+runpodctl datacenter list                             # List datacenters (alias: dc)
+runpodctl billing pods                                # Pod billing history
+runpodctl billing serverless                          # Serverless billing history
+runpodctl billing network-volume                      # Volume billing history
+```
+
+### SSH
+
+```bash
+runpodctl ssh info <pod-id>                           # Get SSH info (command + key, does not connect)
+runpodctl ssh list-keys                               # List SSH keys
+runpodctl ssh add-key                                 # Add SSH key
+```
+
+**Agent note:** `ssh info` returns connection details, not an interactive session. If interactive SSH is not available, execute commands remotely via `ssh user@host "command"`.
+
+### File Transfer
+
+```bash
+runpodctl send <path>                                 # Send files (outputs code)
+runpodctl receive <code>                              # Receive files using code
+```
+
+### Utilities
+
+```bash
+runpodctl doctor                                      # Diagnose and fix CLI issues
+runpodctl update                                      # Update CLI
+runpodctl version                                     # Show version
+runpodctl completion bash >> ~/.bashrc                # Install bash completion
+runpodctl completion zsh >> ~/.zshrc                  # Install zsh completion
+```
+
+## URLs
+
+### Pod URLs
+
+Access exposed ports on your pod:
+
+```
+https://<pod-id>-<port>.proxy.runpod.net
+```
+
+Example: `https://abc123xyz-8888.proxy.runpod.net`
+
+### Serverless URLs
+
+```
+https://api.runpod.ai/v2/<endpoint-id>/run        # Async request
+https://api.runpod.ai/v2/<endpoint-id>/runsync    # Sync request
+https://api.runpod.ai/v2/<endpoint-id>/health     # Health check
+https://api.runpod.ai/v2/<endpoint-id>/status/<job-id>  # Job status
+```
diff --git a/.agent/skills/triton-kernels/SKILL.md b/.agent/skills/triton-kernels/SKILL.md
@@ -0,0 +1,82 @@
+---
+name: triton-kernels
+description: Write optimized Triton GPU kernels for deep learning operations. Covers the full spectrum from basic vector ops to Flash Attention, persistent matmul, fused normalization, quantized GEMM, and memory-efficient patterns.
+---
+
+# Writing Optimized Triton GPU Kernels
+
+> **Targets:** Triton >= 2.1, any GPU with `tl.dot` support (SM70+/CDNA2+)
+
+## Core Patterns (always apply)
+
+**Kernel structure:** Use `@triton.jit` decorator. Get block ID with `tl.program_id(axis)`. Compute element offsets with `tl.arange(0, BLOCK_SIZE)`. Build `mask = offsets < n_elements` for all loads/stores.
+
+**Block sizes:** Strongly prefer powers of two (required for `tl.arange`; non-power-of-two may work but can reduce performance). Declare as `tl.constexpr` parameters. Use `@triton.autotune` to sweep `BLOCK_SIZE_M/N/K` configs per hardware.
+
+**Memory hierarchy:** Keep intermediates in SRAM via block-level reductions (`tl.sum`, `tl.max`) before writing to global memory. Fuse multiple pointwise ops into one kernel to avoid DRAM round-trips.
+
+**Matmul:** Use `tl.dot(a, b)` for tensor core operations. Always accumulate in `tl.float32` when inputs are FP16. For L2 cache locality, use grouped tile ordering via `group_id = pid // GROUP_SIZE`.
+
+**Grid launching:** Size grid dynamically: `grid = lambda meta: (triton.cdiv(n, meta['BLOCK_SIZE']),)`.
+
+**Masking:** ALWAYS mask boundary loads/stores: `tl.load(ptr + offs, mask=offs < dim, other=0.0)`. Missing masks corrupt memory silently.
+
+**Benchmarking:** Use `triton.testing.Benchmark` with `x_names`, `x_vals`, `line_arg`, `line_vals` to compare against PyTorch baselines.
+
+## Quick Reference Examples
+
+Fused row-wise softmax — verified, based on official Triton tutorial:
+```python
+@triton.jit
+def fused_softmax(x_ptr, out_ptr, cols, BLOCK: tl.constexpr):
+    row = tl.program_id(0)
+    offs = tl.arange(0, BLOCK)
+    mask = offs < cols
+    x = tl.load(x_ptr + row * cols + offs, mask=mask, other=-1e9)
+    x_max = tl.max(x, axis=0)
+    ex = tl.exp(x - x_max)
+    out = ex / tl.sum(ex, axis=0)
+    tl.store(out_ptr + row * cols + offs, out, mask=mask)
+```
+
+Seed-based dropout — verified, based on official Triton tutorial:
+```python
+@triton.jit
+def dropout(x_ptr, out_ptr, seed, p, n, BLOCK: tl.constexpr):
+    offs = tl.program_id(0) * BLOCK + tl.arange(0, BLOCK)
+    mask = offs < n
+    x = tl.load(x_ptr + offs, mask=mask)
+    r = tl.rand(seed, offs)  # Philox PRNG, deterministic
+    keep = r > p
+    tl.store(out_ptr + offs, x * keep / (1.0 - p), mask=mask)
+```
+
+## Performance Bottleneck Quick-Reference
+
+When optimizing an existing kernel, classify the bottleneck first (profile with `ncu`):
+
+| Bottleneck | Diagnosis | Fix |
+|------------|-----------|-----|
+| **Memory-bound** | DRAM throughput > 60% of peak, compute < 30% | PID swizzle, TMA, fuse ops to reduce loads |
+| **Compute-bound** | Tensor core utilization > 60%, DRAM < 40% | Persistent kernels, increase `num_stages`, warp specialization |
+| **Underutilized** | Both < 60%, high stall metrics | Reduce register pressure, increase `num_warps`, autotune |
+
+See `triton-gpu-kernel-optimization.md` for specific NCU metric names and detailed strategies.
+
+## Specialized Topics
+
+Read these files for detailed guidance when the task involves these areas:
+
+| Task | File to read |
+|------|-------------|
+| Flash Attention / fused self-attention | `triton-flash-attention-v2.md` |
+| Persistent kernels, warp specialization, TMA | `triton-persistent-warp-matmul.md` |
+| LayerNorm, RMSNorm, GroupNorm (fwd + bwd) | `triton-fused-normalizations.md` |
+| FP4/FP8 quantized matmul, block scaling | `triton-quantized-block-scaled-gemm.md` |
+| Kernel fusion, Philox dropout, recomputation | `triton-memory-efficient-patterns.md` |
+| General tiled GEMM, autotune, benchmarking | `triton-gpu-kernel-optimization.md` |
+| Fusing normalization/gating/residual into attention or matmul epilogue | `triton-fused-epilogue-kernels.md` |
+| Sequential stateful processing (LRU routing, mutable register state) | `triton-sequential-stateful-blocks.md` |
+| Launcher tile selection, num_stages/num_warps heuristics | `triton-dynamic-launcher-tiling.md` |
+
+**When to read specialized files:** Only read the relevant file when the user's task specifically involves that topic. The core patterns above are sufficient for basic kernels (vector ops, elementwise fusion, simple reductions).