Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
63a3024
Record: SOTA recipe (PR #162, 1.1483 bpb) + TTT LoRA eval
anthony-maio Mar 21, 2026
decccb9
Match FarnsworthEngine: 11L + full-weight SGD TTT + tuned hyperparams
anthony-maio Mar 21, 2026
b363dda
Reduce warmdown from 3000 to 1500 steps
anthony-maio Mar 21, 2026
f3ec371
Fix TTT eval: use args.eval_stride instead of undefined local variable
anthony-maio Mar 21, 2026
29ce894
Record: 9L MLP3x full SOTA stack, val_bpb=1.1401
anthony-maio Mar 21, 2026
5fba7a8
Integrate fused ReLU² MLP Triton kernel (1.26x eval speedup)
anthony-maio Mar 21, 2026
2aa218d
Add FlashAttention 3 support + fused ReLU² MLP kernel
anthony-maio Mar 21, 2026
b2f81b0
Add mixed int5/int6 quantization for 11L under 16MB
anthony-maio Mar 22, 2026
c262d1f
Int5 for ALL large weights (not just MLP) to fit 11L under 16MB
anthony-maio Mar 22, 2026
34d0a92
Mixed int4/int5: int4 for MLP, int5 for attention to fit 11L
anthony-maio Mar 22, 2026
6596aed
10L + int5 all weights: sweet spot for artifact size
anthony-maio Mar 22, 2026
333843d
Fix: default to int6 quant (QUANT_BITS=6) and 9 layers
anthony-maio Mar 22, 2026
8b26d2a
Warmdown-as-compression: WARMDOWN_ITERS=20000
anthony-maio Mar 22, 2026
ea25505
Revert warmdown to 3000 (20000 breaks SWA averaging)
anthony-maio Mar 22, 2026
9d0e9ce
Add XSA (Exclusive Self Attention) on last 4 layers
anthony-maio Mar 22, 2026
9cd4f9e
Switch to int5 quant for 11L under 16MB, QAT reduces int5 penalty
anthony-maio Mar 22, 2026
6a8a656
Update: 11L next-gen stack, val_bpb=1.1460, artifact 15.79MB VALID
anthony-maio Mar 22, 2026
6102464
Update: val_bpb=1.1399, 15.79MB valid, 11L next-gen stack on fast pod
anthony-maio Mar 22, 2026
e0cdc67
Fix Copilot review issues: README, submission.json schema, log strings
anthony-maio Mar 22, 2026
4359d78
Integrate autograd Triton kernels for training speedup
anthony-maio Mar 22, 2026
f788912
Disable both custom kernels: NaN in training - debugging
anthony-maio Mar 22, 2026
fad7dfa
Fix 2 critical kernel bugs causing NaN:
anthony-maio Mar 22, 2026
1e7839d
Disable custom training kernels: torch.compile is faster
anthony-maio Mar 22, 2026
70fa63f
Add train log (seed=1337, val_bpb=1.1435, 8xH100 SXM)
anthony-maio Mar 22, 2026
c0b1fb9
Packed int6 serialization: 25% smaller artifacts, enables int6 for 11L
anthony-maio Mar 22, 2026
c9b6583
Revert QUANT_BITS default to 5 (int6 artifacts don't fit under 16MB)
anthony-maio Mar 22, 2026
9624193
Remove broken TTT code from PR #376 to pass review
anthony-maio Mar 23, 2026
10556ae
Record: N-gram Backoff + VRL + LeakyReLU² — val_bpb 0.9642 (3-seed)
anthony-maio Mar 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
204 changes: 204 additions & 0 deletions .agent/skills/runpodctl/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
name: runpodctl
description: Runpod CLI to manage your GPU workloads.
allowed-tools: Bash(runpodctl:*)
compatibility: Linux, macOS
metadata:
author: runpod
version: "2.1"
license: Apache-2.0
---

# Runpodctl

Manage GPU pods, serverless endpoints, templates, volumes, and models.

> **Spelling:** "Runpod" (capital R). Command is `runpodctl` (lowercase).

## Install

```bash
# Any platform (official installer)
curl -sSL https://cli.runpod.net | bash

# macOS (Homebrew)
brew install runpod/runpodctl/runpodctl

# macOS (manual — universal binary)
mkdir -p ~/.local/bin && curl -sL https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-darwin-all.tar.gz | tar xz -C ~/.local/bin

# Linux
mkdir -p ~/.local/bin && curl -sL https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64.tar.gz | tar xz -C ~/.local/bin

# Windows (PowerShell)
Invoke-WebRequest -Uri https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-windows-amd64.zip -OutFile runpodctl.zip; Expand-Archive runpodctl.zip -DestinationPath $env:LOCALAPPDATA\runpodctl; [Environment]::SetEnvironmentVariable('Path', $env:Path + ";$env:LOCALAPPDATA\runpodctl", 'User')
```

Ensure `~/.local/bin` is on your `PATH` (add `export PATH="$HOME/.local/bin:$PATH"` to `~/.bashrc` or `~/.zshrc`).

## Quick start

```bash
runpodctl doctor # First time setup (API key + SSH)
runpodctl gpu list # See available GPUs
runpodctl template search pytorch # Find a template
runpodctl pod create --template-id runpod-torch-v21 --gpu-id "NVIDIA RTX 4090" # Create from template
runpodctl pod list # List your pods
```

API key: https://runpod.io/console/user/settings

## Commands

### Pods

```bash
runpodctl pod list # List running pods (default, like docker ps)
runpodctl pod list --all # List all pods including exited
runpodctl pod list --status exited # Filter by status (RUNNING, EXITED, etc.)
runpodctl pod list --since 24h # Pods created within last 24 hours
runpodctl pod list --created-after 2025-01-15 # Pods created after date
runpodctl pod get <pod-id> # Get pod details (includes SSH info)
runpodctl pod create --template-id runpod-torch-v21 --gpu-id "NVIDIA RTX 4090" # Create from template
runpodctl pod create --image "runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04" --gpu-id "NVIDIA RTX 4090" # Create with image
runpodctl pod create --compute-type cpu --image ubuntu:22.04 # Create CPU pod
runpodctl pod start <pod-id> # Start stopped pod
runpodctl pod stop <pod-id> # Stop running pod
runpodctl pod restart <pod-id> # Restart pod
runpodctl pod reset <pod-id> # Reset pod
runpodctl pod update <pod-id> --name "new" # Update pod
runpodctl pod delete <pod-id> # Delete pod (aliases: rm, remove)
```

**List flags:** `--all` / `-a`, `--status`, `--since`, `--created-after`, `--name`, `--compute-type`
**Get flags:** `--include-machine`, `--include-network-volume`

**Create flags:** `--template-id` (required if no `--image`), `--image` (required if no `--template-id`), `--name`, `--gpu-id`, `--gpu-count`, `--compute-type`, `--ssh` (default true), `--container-disk-in-gb`, `--volume-in-gb`, `--volume-mount-path`, `--ports`, `--env`, `--cloud-type`, `--data-center-ids`, `--global-networking`, `--public-ip`

### Serverless (alias: sls)

```bash
runpodctl serverless list # List all endpoints
runpodctl serverless get <endpoint-id> # Get endpoint details
runpodctl serverless create --name "x" --template-id "tpl_abc" # Create endpoint
runpodctl serverless update <endpoint-id> --workers-max 5 # Update endpoint
runpodctl serverless delete <endpoint-id> # Delete endpoint
```

**List flags:** `--include-template`, `--include-workers`
**Update flags:** `--name`, `--workers-min`, `--workers-max`, `--idle-timeout`, `--scaler-type` (QUEUE_DELAY or REQUEST_COUNT), `--scaler-value`
**Create flags:** `--name`, `--template-id`, `--gpu-id`, `--gpu-count`, `--compute-type`, `--workers-min`, `--workers-max`, `--data-center-ids`

### Templates (alias: tpl)

```bash
runpodctl template list # Official + community (first 10)
runpodctl template list --type official # All official templates
runpodctl template list --type community # Community templates (first 10)
runpodctl template list --type user # Your own templates
runpodctl template list --all # Everything including user
runpodctl template list --limit 50 # Show 50 templates
runpodctl template search pytorch # Search for "pytorch" templates
runpodctl template search comfyui --limit 5 # Search, limit to 5 results
runpodctl template search vllm --type official # Search only official
runpodctl template get <template-id> # Get template details (includes README, env, ports)
runpodctl template create --name "x" --image "img" # Create template
runpodctl template create --name "x" --image "img" --serverless # Create serverless template
runpodctl template update <template-id> --name "new" # Update template
runpodctl template delete <template-id> # Delete template
```

**List flags:** `--type` (official, community, user), `--limit`, `--offset`, `--all`
**Create flags:** `--name`, `--image`, `--container-disk-in-gb`, `--volume-in-gb`, `--volume-mount-path`, `--ports`, `--env`, `--docker-start-cmd`, `--docker-entrypoint`, `--serverless`, `--readme`

### Network Volumes (alias: nv)

```bash
runpodctl network-volume list # List all volumes
runpodctl network-volume get <volume-id> # Get volume details
runpodctl network-volume create --name "x" --size 100 --data-center-id "US-GA-1" # Create volume
runpodctl network-volume update <volume-id> --name "new" # Update volume
runpodctl network-volume delete <volume-id> # Delete volume
```

**Create flags:** `--name`, `--size`, `--data-center-id`

### Models

```bash
runpodctl model list # List your models
runpodctl model list --all # List all models
runpodctl model list --name "llama" # Filter by name
runpodctl model list --provider "meta" # Filter by provider
runpodctl model add --name "my-model" --model-path ./model # Add model
runpodctl model remove --name "my-model" # Remove model
```

### Registry (alias: reg)

```bash
runpodctl registry list # List registry auths
runpodctl registry get <registry-id> # Get registry auth
runpodctl registry create --name "x" --username "u" --password "p" # Create registry auth
runpodctl registry delete <registry-id> # Delete registry auth
```

### Info

```bash
runpodctl user # Account info and balance (alias: me)
runpodctl gpu list # List available GPUs
runpodctl gpu list --include-unavailable # Include unavailable GPUs
runpodctl datacenter list # List datacenters (alias: dc)
runpodctl billing pods # Pod billing history
runpodctl billing serverless # Serverless billing history
runpodctl billing network-volume # Volume billing history
```

### SSH

```bash
runpodctl ssh info <pod-id> # Get SSH info (command + key, does not connect)
runpodctl ssh list-keys # List SSH keys
runpodctl ssh add-key # Add SSH key
```

**Agent note:** `ssh info` returns connection details, not an interactive session. If interactive SSH is not available, execute commands remotely via `ssh user@host "command"`.

### File Transfer

```bash
runpodctl send <path> # Send files (outputs code)
runpodctl receive <code> # Receive files using code
```

### Utilities

```bash
runpodctl doctor # Diagnose and fix CLI issues
runpodctl update # Update CLI
runpodctl version # Show version
runpodctl completion bash >> ~/.bashrc # Install bash completion
runpodctl completion zsh >> ~/.zshrc # Install zsh completion
```

## URLs

### Pod URLs

Access exposed ports on your pod:

```
https://<pod-id>-<port>.proxy.runpod.net
```

Example: `https://abc123xyz-8888.proxy.runpod.net`

### Serverless URLs

```
https://api.runpod.ai/v2/<endpoint-id>/run # Async request
https://api.runpod.ai/v2/<endpoint-id>/runsync # Sync request
https://api.runpod.ai/v2/<endpoint-id>/health # Health check
https://api.runpod.ai/v2/<endpoint-id>/status/<job-id> # Job status
```
82 changes: 82 additions & 0 deletions .agent/skills/triton-kernels/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
name: triton-kernels
description: Write optimized Triton GPU kernels for deep learning operations. Covers the full spectrum from basic vector ops to Flash Attention, persistent matmul, fused normalization, quantized GEMM, and memory-efficient patterns.
---

# Writing Optimized Triton GPU Kernels

> **Targets:** Triton >= 2.1, any GPU with `tl.dot` support (SM70+/CDNA2+)

## Core Patterns (always apply)

**Kernel structure:** Use `@triton.jit` decorator. Get block ID with `tl.program_id(axis)`. Compute element offsets with `tl.arange(0, BLOCK_SIZE)`. Build `mask = offsets < n_elements` for all loads/stores.

**Block sizes:** Strongly prefer powers of two (required for `tl.arange`; non-power-of-two may work but can reduce performance). Declare as `tl.constexpr` parameters. Use `@triton.autotune` to sweep `BLOCK_SIZE_M/N/K` configs per hardware.

**Memory hierarchy:** Keep intermediates in SRAM via block-level reductions (`tl.sum`, `tl.max`) before writing to global memory. Fuse multiple pointwise ops into one kernel to avoid DRAM round-trips.

**Matmul:** Use `tl.dot(a, b)` for tensor core operations. Always accumulate in `tl.float32` when inputs are FP16. For L2 cache locality, use grouped tile ordering via `group_id = pid // GROUP_SIZE`.

**Grid launching:** Size grid dynamically: `grid = lambda meta: (triton.cdiv(n, meta['BLOCK_SIZE']),)`.

**Masking:** ALWAYS mask boundary loads/stores: `tl.load(ptr + offs, mask=offs < dim, other=0.0)`. Missing masks corrupt memory silently.

**Benchmarking:** Use `triton.testing.Benchmark` with `x_names`, `x_vals`, `line_arg`, `line_vals` to compare against PyTorch baselines.

## Quick Reference Examples

Fused row-wise softmax — verified, based on official Triton tutorial:
```python
@triton.jit
def fused_softmax(x_ptr, out_ptr, cols, BLOCK: tl.constexpr):
row = tl.program_id(0)
offs = tl.arange(0, BLOCK)
mask = offs < cols
x = tl.load(x_ptr + row * cols + offs, mask=mask, other=-1e9)
x_max = tl.max(x, axis=0)
ex = tl.exp(x - x_max)
out = ex / tl.sum(ex, axis=0)
tl.store(out_ptr + row * cols + offs, out, mask=mask)
```

Seed-based dropout — verified, based on official Triton tutorial:
```python
@triton.jit
def dropout(x_ptr, out_ptr, seed, p, n, BLOCK: tl.constexpr):
offs = tl.program_id(0) * BLOCK + tl.arange(0, BLOCK)
mask = offs < n
x = tl.load(x_ptr + offs, mask=mask)
r = tl.rand(seed, offs) # Philox PRNG, deterministic
keep = r > p
tl.store(out_ptr + offs, x * keep / (1.0 - p), mask=mask)
```

## Performance Bottleneck Quick-Reference

When optimizing an existing kernel, classify the bottleneck first (profile with `ncu`):

| Bottleneck | Diagnosis | Fix |
|------------|-----------|-----|
| **Memory-bound** | DRAM throughput > 60% of peak, compute < 30% | PID swizzle, TMA, fuse ops to reduce loads |
| **Compute-bound** | Tensor core utilization > 60%, DRAM < 40% | Persistent kernels, increase `num_stages`, warp specialization |
| **Underutilized** | Both < 60%, high stall metrics | Reduce register pressure, increase `num_warps`, autotune |

See `triton-gpu-kernel-optimization.md` for specific NCU metric names and detailed strategies.

## Specialized Topics

Read these files for detailed guidance when the task involves these areas:

| Task | File to read |
|------|-------------|
| Flash Attention / fused self-attention | `triton-flash-attention-v2.md` |
| Persistent kernels, warp specialization, TMA | `triton-persistent-warp-matmul.md` |
| LayerNorm, RMSNorm, GroupNorm (fwd + bwd) | `triton-fused-normalizations.md` |
| FP4/FP8 quantized matmul, block scaling | `triton-quantized-block-scaled-gemm.md` |
| Kernel fusion, Philox dropout, recomputation | `triton-memory-efficient-patterns.md` |
| General tiled GEMM, autotune, benchmarking | `triton-gpu-kernel-optimization.md` |
| Fusing normalization/gating/residual into attention or matmul epilogue | `triton-fused-epilogue-kernels.md` |
| Sequential stateful processing (LRU routing, mutable register state) | `triton-sequential-stateful-blocks.md` |
| Launcher tile selection, num_stages/num_warps heuristics | `triton-dynamic-launcher-tiling.md` |

**When to read specialized files:** Only read the relevant file when the user's task specifically involves that topic. The core patterns above are sufficient for basic kernels (vector ops, elementwise fusion, simple reductions).
Loading