Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions concepts/cubric_garage/HYPOTHESES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Cubric Garage — Test Hypotheses

All tests use copies of the SOTA. The original is NEVER modified.

## Test A: Baseline (no cubric)
- **File:** train_gpt_baseline.py (unmodified SOTA copy)
- **Script:** run_baseline.sh
- **Hypothesis:** Establishes the control number. Should reproduce 0.9625 BPB.

## Test B: Cubric Cadence 4 (aggressive)
- **File:** train_gpt_cadence4.py (SOTA + cubric C-step)
- **Script:** run_cadence4.sh
- **Env:** CUBRIC_CADENCE=4
- **Hypothesis:** Frequent C-steps catch fast-changing n-gram patterns. Decay stale counts, boost confirmed, prune collisions, reweight orders.
- **Expected:** +0.003-0.010 over baseline
- **Risk:** Too aggressive, may corrupt good counts.

## Test C: Cubric Cadence 10 (balanced)
- **File:** train_gpt_cadence10.py (SOTA + cubric C-step)
- **Script:** run_cadence10.sh
- **Env:** CUBRIC_CADENCE=10
- **Hypothesis:** More data per C-step = better decisions, less disruption.
- **Expected:** +0.002-0.008 over baseline
- **Risk:** Slower adaptation.

## Rules
1. NEVER modify the original SOTA file
2. Each test is a separate copy with its own run script
3. One variable per test
7 changes: 7 additions & 0 deletions concepts/cubric_garage/run_baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../.." && pwd)"
cd "${REPO_ROOT}"
export PYTHONPATH="${REPO_ROOT}/flash-attention/hopper:${PYTHONPATH:-}"
env SEED="${SEED:-1337}" MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 TTT_EVAL_ENABLED=0 COMPILE_ENABLED=1 COMPILE_FULLGRAPH=1 NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 NGRAM_EVAL_ALPHA=0.30 NGRAM_EVAL_MIN_COUNT=2 NGRAM_EVAL_BUCKETS=4194304 NGRAM_EVAL_ALPHA_MIN=0.05 NGRAM_EVAL_ALPHA_MAX=0.60 torchrun --standalone --nproc_per_node="${NPROC_PER_NODE:-8}" "${SCRIPT_DIR}/train_gpt_baseline.py"
7 changes: 7 additions & 0 deletions concepts/cubric_garage/run_cadence10.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../.." && pwd)"
cd "${REPO_ROOT}"
export PYTHONPATH="${REPO_ROOT}/flash-attention/hopper:${PYTHONPATH:-}"
env SEED="${SEED:-1337}" MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 TTT_EVAL_ENABLED=0 COMPILE_ENABLED=1 COMPILE_FULLGRAPH=1 NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 NGRAM_EVAL_ALPHA=0.30 NGRAM_EVAL_MIN_COUNT=2 NGRAM_EVAL_BUCKETS=4194304 NGRAM_EVAL_ALPHA_MIN=0.05 NGRAM_EVAL_ALPHA_MAX=0.60 CUBRIC_CADENCE=10 CUBRIC_COUNT_DECAY=0.02 CUBRIC_BOOST_CONFIDENT=1 CUBRIC_PRUNE_NOISY=1 CUBRIC_REWEIGHT_ORDERS=1 torchrun --standalone --nproc_per_node="${NPROC_PER_NODE:-8}" "${SCRIPT_DIR}/train_gpt_cadence10.py"
7 changes: 7 additions & 0 deletions concepts/cubric_garage/run_cadence4.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../.." && pwd)"
cd "${REPO_ROOT}"
export PYTHONPATH="${REPO_ROOT}/flash-attention/hopper:${PYTHONPATH:-}"
env SEED="${SEED:-1337}" MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5 XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24 TTT_EVAL_ENABLED=0 COMPILE_ENABLED=1 COMPILE_FULLGRAPH=1 NGRAM_EVAL_ORDER=7 NGRAM_EVAL_ADAPTIVE=1 NGRAM_EVAL_ALPHA=0.30 NGRAM_EVAL_MIN_COUNT=2 NGRAM_EVAL_BUCKETS=4194304 NGRAM_EVAL_ALPHA_MIN=0.05 NGRAM_EVAL_ALPHA_MAX=0.60 CUBRIC_CADENCE=4 CUBRIC_COUNT_DECAY=0.02 CUBRIC_BOOST_CONFIDENT=1 CUBRIC_PRUNE_NOISY=1 CUBRIC_REWEIGHT_ORDERS=1 torchrun --standalone --nproc_per_node="${NPROC_PER_NODE:-8}" "${SCRIPT_DIR}/train_gpt_cadence4.py"
2,216 changes: 2,216 additions & 0 deletions concepts/cubric_garage/train_gpt_baseline.py

Large diffs are not rendered by default.

2,266 changes: 2,266 additions & 0 deletions concepts/cubric_garage/train_gpt_cadence10.py

Large diffs are not rendered by default.

2,266 changes: 2,266 additions & 0 deletions concepts/cubric_garage/train_gpt_cadence4.py

Large diffs are not rendered by default.

130 changes: 130 additions & 0 deletions concepts/cubric_ngram/run_cubric_1gpu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
#!/bin/bash
set -euo pipefail
# Cubric n-gram eval-only A/B — single GPU (Vast.ai $1/hr)
# Each arm trains the SAME model, only eval-time n-gram blending differs.
# All arms share one training run, then eval 5 ways.
#
# ══════════════════════════════════════════════════════════════
# HYPOTHESES
# ══════════════════════════════════════════════════════════════
#
# ARM A (control): Entropy-adaptive alpha with fixed bounds is already
# near-optimal. Baseline for comparison.
#
# ARM B (basic cubric): Documents vary in n-gram predictability. Shifting
# alpha bounds based on accumulated reliability will improve BPB on
# heterogeneous eval data by ~0.001-0.003 vs fixed bounds.
# RISK: The entropy formula already captures most of this signal.
#
# ARM C (per-order): Different documents favor different n-gram orders.
# Code is trigram-heavy, prose is 5-gram-heavy. Per-order reliability
# tracking that reranks backoff preference will improve BPP by 0.002-0.005.
# RISK: With only 2 min_count, higher orders are sparse and noisy.
#
# ARM D (agreement): When model and n-gram both assign high probability
# to the same token, the blend should be aggressive. Agreement weighting
# captures a signal entropy alone misses: model confidence + n-gram
# confidence = strong evidence. Expected: 0.001-0.003.
# RISK: agreement_scale is hand-tuned, may overshoot.
#
# ARM E (entropy adapt): The sigmoid mapping assumes fixed entropy
# distribution. Real documents have different entropy profiles (code ~2-3
# bits, prose ~4-6 bits). Shifting the sigmoid to match running entropy
# stats will improve calibration. Expected: 0.001-0.002.
# RISK: Smallest expected effect. May be noise-level.
#
# ARM F (all combined): If mechanisms are orthogonal, gains should stack.
# Expected: sum of individual gains minus some overlap (~60-80% of sum).
# RISK: Interactions could cancel. More complexity = more noise.
#
# ══════════════════════════════════════════════════════════════

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../.." && pwd)"
cd "${REPO_ROOT}"

if [ -d "flash-attention/hopper" ]; then
export PYTHONPATH="${REPO_ROOT}/flash-attention/hopper:${PYTHONPATH:-}"
elif [ -d "local_shims" ]; then
export PYTHONPATH="${REPO_ROOT}/local_shims:${PYTHONPATH:-}"
fi

SEED="${SEED:-1337}"
NPROC="${NPROC_PER_NODE:-1}"
TRAIN_SCRIPT="${SCRIPT_DIR}/train_gpt_evalonly.py"

COMMON_ENV=(
SEED="${SEED}"
MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5
XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536
ROPE_DIMS=24
COMPILE_ENABLED=1 COMPILE_FULLGRAPH=0
NGRAM_EVAL_ORDER=5
NGRAM_EVAL_ALPHA=0.30
NGRAM_EVAL_MIN_COUNT=2
NGRAM_EVAL_BUCKETS=4194304
NGRAM_EVAL_ADAPTIVE=1
NGRAM_EVAL_ALPHA_MIN=0.05
NGRAM_EVAL_ALPHA_MAX=0.60
)

run_arm() {
local arm_id="$1"
local hyp="$2"
shift 2
local run_id="cubric_${arm_id}_s${SEED}_$(date +%Y%m%d_%H%M%S)"
echo ""
echo "═══════════════════════════════════════"
echo " [${arm_id}] ${hyp}"
echo " RUN_ID: ${run_id}"
echo "═══════════════════════════════════════"
env "${COMMON_ENV[@]}" "$@" \
RUN_ID="$run_id" \
torchrun --standalone --nproc_per_node="$NPROC" \
"$TRAIN_SCRIPT" \
2>&1 | tee "logs/${run_id}.log"
echo "── [${arm_id}] result ──"
grep -E "final_int6_sliding_window_ngram.*exact|final_int6_sliding_window_exact|cubric_rel=" \
"logs/${run_id}.log" 2>/dev/null | tail -3
echo ""
}

mkdir -p logs

echo "══════════════════════════════════════════════════"
echo " CUBRIC N-GRAM — 1-GPU A/B (eval-only variants)"
echo " NPROC=${NPROC} SEED=${SEED}"
echo "══════════════════════════════════════════════════"

run_arm "A" "CONTROL: entropy-adaptive alpha, no cubric" \
CUBRIC_ENABLED=0

run_arm "B" "H: alpha bounds shift from accumulated reliability" \
CUBRIC_ENABLED=1 CUBRIC_DECAY=0.95 CUBRIC_BOOST_SCALE=0.15

run_arm "C" "H: per-order reliability reranks backoff preference" \
CUBRIC_ENABLED=1 CUBRIC_DECAY=0.95 CUBRIC_BOOST_SCALE=0.15 \
CUBRIC_PER_ORDER=1

run_arm "D" "H: agreement weighting boosts alpha when model+ngram agree" \
CUBRIC_ENABLED=1 CUBRIC_DECAY=0.95 CUBRIC_BOOST_SCALE=0.15 \
CUBRIC_AGREEMENT=1 CUBRIC_AGREEMENT_SCALE=2.0

run_arm "E" "H: entropy sigmoid adapts to document entropy profile" \
CUBRIC_ENABLED=1 CUBRIC_DECAY=0.95 CUBRIC_BOOST_SCALE=0.15 \
CUBRIC_ENTROPY_ADAPT=1

run_arm "F" "H: all mechanisms combined — gains should stack if orthogonal" \
CUBRIC_ENABLED=1 CUBRIC_DECAY=0.95 CUBRIC_BOOST_SCALE=0.15 \
CUBRIC_PER_ORDER=1 CUBRIC_AGREEMENT=1 CUBRIC_AGREEMENT_SCALE=2.0 \
CUBRIC_ENTROPY_ADAPT=1

echo "══════════════════════════════════════════════════"
echo " SUMMARY"
echo "══════════════════════════════════════════════════"
for f in logs/cubric_*_s${SEED}_*.log; do
arm=$(basename "$f" | sed 's/cubric_\([A-F]\)_.*/\1/')
bpb=$(grep "final_int6_sliding_window_ngram.*exact" "$f" 2>/dev/null | grep -oP 'val_bpb:\K[0-9.]+' || echo "N/A")
echo " [$arm] sliding_ngram_bpb = $bpb"
done
echo "══════════════════════════════════════════════════"
35 changes: 35 additions & 0 deletions concepts/f1_sota_garage/car02_speed_lane/run_cubric_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash
set -euo pipefail

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../../.." && pwd)"
cd "${REPO_ROOT}"
export PYTHONPATH="${REPO_ROOT}/flash-attention/hopper:${PYTHONPATH:-}"

SEED="${SEED:-1337}"
NPROC="${NPROC_PER_NODE:-8}"

env \
SEED="${SEED}" \
MLP_ACT=leaky_relu_sq \
MLP_LEAKY_SLOPE=0.5 \
XSA_LAST_N=4 \
BIGRAM_VOCAB_SIZE=1536 \
ROPE_DIMS=24 \
TTT_EVAL_ENABLED=0 \
COMPILE_ENABLED=1 \
COMPILE_FULLGRAPH=1 \
NGRAM_EVAL_ORDER=7 \
NGRAM_EVAL_ADAPTIVE=1 \
NGRAM_EVAL_ALPHA=0.30 \
NGRAM_EVAL_MIN_COUNT=2 \
NGRAM_EVAL_BUCKETS=4194304 \
NGRAM_EVAL_ALPHA_MIN=0.05 \
NGRAM_EVAL_ALPHA_MAX=0.60 \
CUBRIC_CADENCE="${CUBRIC_CADENCE:-4}" \
CUBRIC_COUNT_DECAY=0.02 \
CUBRIC_BOOST_CONFIDENT=1 \
CUBRIC_PRUNE_NOISY=1 \
CUBRIC_REWEIGHT_ORDERS=1 \
torchrun --standalone --nproc_per_node="${NPROC}" \
"${SCRIPT_DIR}/train_gpt.py"
85 changes: 85 additions & 0 deletions concepts/f1_sota_garage/car02_speed_lane/sweep_ngram_params.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/bin/bash
set -euo pipefail
# N-gram parameter sweep — find optimal settings for the 0.96 regime
# Each arm changes ONE variable from the baseline.
# Single GPU, COMPILE_ENABLED=0 for Vast compat.
#
# HYPOTHESES:
# 1. Higher n-gram order (8,9) captures longer patterns the 7-gram misses
# 2. More buckets (8M,16M) reduces collisions — cleaner data = better blend
# 3. Min count 1 catches more patterns at cost of noise
# 4. Alpha range may be suboptimal — the 0.96 model is more confident
# 5. Entropy center/scale tuned for 1.12 model, not 0.96

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../../.." && pwd)"
cd "${REPO_ROOT}"

export PYTHONPATH="${REPO_ROOT}/local_shims:${PYTHONPATH:-}"

SEED="${SEED:-1337}"
NPROC=1
SCRIPT="${SCRIPT_DIR}/train_gpt.py"

# Baseline settings (from PR #753)
BASE=(
SEED="${SEED}" COMPILE_ENABLED=0
MLP_ACT=leaky_relu_sq MLP_LEAKY_SLOPE=0.5
XSA_LAST_N=4 BIGRAM_VOCAB_SIZE=1536 ROPE_DIMS=24
NGRAM_EVAL_ADAPTIVE=1
NGRAM_EVAL_ALPHA=0.30 NGRAM_EVAL_ALPHA_MIN=0.05 NGRAM_EVAL_ALPHA_MAX=0.60
NGRAM_EVAL_ENTROPY_CENTER=4.0 NGRAM_EVAL_ENTROPY_SCALE=2.0
NGRAM_EVAL_MIN_COUNT=2 NGRAM_EVAL_BUCKETS=4194304
NGRAM_EVAL_ORDER=7
)

run_arm() {
local arm_id="$1"; local hyp="$2"; shift 2
local run_id="sweep_${arm_id}_s${SEED}_$(date +%Y%m%d_%H%M%S)"
echo ""
echo "═══════════════════════════════════════"
echo " [${arm_id}] ${hyp}"
echo " RUN_ID: ${run_id}"
echo "═══════════════════════════════════════"
env "${BASE[@]}" "$@" RUN_ID="$run_id" \
torchrun --standalone --nproc_per_node="$NPROC" "$SCRIPT" \
2>&1 | tee "logs/${run_id}.log"
echo "── [${arm_id}] ──"
grep -E "final_int6_sliding_window_ngram.*exact" "logs/${run_id}.log" 2>/dev/null | tail -1
echo ""
}

mkdir -p logs
echo "══════════════════════════════════════════"
echo " N-GRAM PARAMETER SWEEP (1-GPU)"
echo "══════════════════════════════════════════"

# ── Order sweep ──
run_arm "ord8" "H: 8-gram captures longer patterns" NGRAM_EVAL_ORDER=8
run_arm "ord9" "H: 9-gram even longer context" NGRAM_EVAL_ORDER=9

# ── Bucket sweep ──
run_arm "bkt8M" "H: 8M buckets = fewer collisions" NGRAM_EVAL_BUCKETS=8388608
run_arm "bkt16M" "H: 16M buckets = minimal collisions" NGRAM_EVAL_BUCKETS=16777216

# ── Min count ──
run_arm "mc1" "H: min_count=1 catches more patterns" NGRAM_EVAL_MIN_COUNT=1
run_arm "mc3" "H: min_count=3 cleaner matches" NGRAM_EVAL_MIN_COUNT=3

# ── Alpha range ──
run_arm "alpha_tight" "H: tighter alpha for confident model" NGRAM_EVAL_ALPHA_MIN=0.10 NGRAM_EVAL_ALPHA_MAX=0.45
run_arm "alpha_wide" "H: wider alpha for aggressive blend" NGRAM_EVAL_ALPHA_MIN=0.02 NGRAM_EVAL_ALPHA_MAX=0.75

# ── Entropy sigmoid ──
run_arm "ent_low" "H: lower entropy center (model is more confident at 0.96)" NGRAM_EVAL_ENTROPY_CENTER=3.0
run_arm "ent_steep" "H: steeper sigmoid = sharper alpha transitions" NGRAM_EVAL_ENTROPY_SCALE=3.5

echo "══════════════════════════════════════════"
echo " SUMMARY"
echo "══════════════════════════════════════════"
for f in logs/sweep_*_s${SEED}_*.log; do
arm=$(basename "$f" | sed "s/sweep_\(.*\)_s${SEED}.*/\1/")
bpb=$(grep "final_int6_sliding_window_ngram.*exact" "$f" 2>/dev/null | grep -oP 'val_bpb:\K[0-9.]+' || echo "N/A")
echo " [$arm] = $bpb"
done
echo "══════════════════════════════════════════"
Loading