@@ -75,11 +75,10 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7575- Registry types enum: ` internal ` , ` organization ` , ` transient ` , ` backup ` . Use ` organization ` for GHCR/Docker Hub.
7676
7777### Docker / Build
78- - ` uv tool install ` segfaults on ARM64/QEMU. Use ` pip install ` or Daytona (native x86_64).
79- - Build-push-clean pattern for limited disk (~ 45GB): build, push, clean before next image.
80- - Colons in agent names break Docker volume mounts. Replace ` : ` with ` __ ` .
81- - Add ` || git init ` fallback to ` git clone ` in Dockerfiles. Add ` chown claude:claude /logs ` + ` adduser claude ` for OH compat.
82- - ` jefzda/ ` → ` ghcr.io/sg-evals/ ` migration incomplete: 33 Dockerfiles in ` csb/debug/ ` and ` csb/fix/ ` .
78+ - ` uv tool install ` segfaults on ARM64/QEMU. Use ` pip install ` or Daytona.
79+ - Build-push-clean for limited disk. Colons in agent names break mounts (use ` __ ` ).
80+ - Dockerfile: ` git clone || git init ` fallback, ` adduser claude ` + ` chown claude:claude /logs ` for OH.
81+ - ` jefzda/ ` → ` ghcr.io/sg-evals/ ` migration incomplete (33 Dockerfiles).
8382
8483### MCP Configuration (inside sandboxes)
8584- ` .mcp.json ` at ` $CLAUDE_CONFIG_DIR ` (typically ` /logs/agent/sessions/ ` ), not ` /app/ ` or ` /root/ ` . Claude Code needs ` --mcp-config ` flag.
@@ -112,51 +111,55 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
112111
113112### Validation / Scoring
114113- ` validators.py ` duplicated across ` ccb_build ` tasks. Changes must hit ** all copies** (` sha256sum ` ).
115- - Agent completing in ** <2s** = never installed/ran. Real name in ` config.json ` at ` task.path ` .
116- - ** no_changes_guard** : write ` reward.txt ` inside Python block, not in bash after it.
117- - ` timeout 600 ` on all test runners. ` --forceExit ` for Jest. Jest+TS needs ` memory_mb = 8192 ` .
118- - ** CSB dual-score** : file edits + ` answer.json ` scored independently. Fallback: ` promoted_verifier.py ` -> ` oracle_checks.py ` -> heuristic.
119- - Rate-limited results (score=0, <30s): ` scripts/quarantine_invalid_tasks.py --execute ` .
120- - Bare ` $VAR ` in ` instruction.md ` gets expanded. Use ` <placeholder> ` syntax.
121- - Pass rate logic duplicated in ` generate_eval_report.py ` and ` csb_metrics/models.py ` . Sync both on changes.
122- - ` cost_report.py ` : ` defaultdict(int) ` + ` .get("baseline", 1) ` returns ` 0 ` when key exists. Use ` or 1 ` .
123- - ** TARGET_SUITE misalignment** : 55 stale suite names, 220 missing. ` SUITE_WEIGHTS ` falls back to equal-weight.
124- - ** dual_score_lib.sh** : ` scorer_artifact ` always ` "auto" ` (` .setdefault() ` overwrite). Audit trail broken.
125- - ** Falsy value bugs** : ` max_score=0 ` treated as false; ` None ` MCP metrics misclassified. Use ` is None ` / ` == 0 ` .
126- - ** promote_run.py** : Crashes on non-dict env config. Validate types before ` .get() ` .
114+ - Agent <2s = never ran. ` no_changes_guard ` : write ` reward.txt ` in Python, not bash after it.
115+ - ` timeout 600 ` on test runners. ` --forceExit ` for Jest. Jest+TS: ` memory_mb = 8192 ` .
116+ - ** CSB dual-score** : file edits + ` answer.json ` independent. Fallback: ` promoted_verifier.py ` → ` oracle_checks.py ` → heuristic.
117+ - Rate-limited (score=0, <30s): ` quarantine_invalid_tasks.py --execute ` . Bare ` $VAR ` in ` instruction.md ` → use ` <placeholder> ` .
118+ - Pass rate logic duplicated in ` generate_eval_report.py ` and ` csb_metrics/models.py ` .
119+ - ` cost_report.py ` : ` defaultdict(int) ` + ` .get("baseline", 1) ` returns ` 0 ` . Use ` or 1 ` .
120+ - ** TARGET_SUITE** : 55 stale, 220 missing. ` dual_score_lib.sh ` ` scorer_artifact ` always ` "auto" ` .
121+ - ** Falsy bugs** : ` max_score=0 ` as false; ` None ` MCP metrics misclassified. ` promote_run.py ` crashes on non-dict env.
122+
123+ ### Agent / Runner Robustness
124+ - ** Agent ` /tmp ` race** : ` claude_baseline_agent.py:1134 ` uses fixed ` /tmp/claude_system_prompt.txt ` , ` /tmp/claude_run.sh ` . Concurrent tasks cross-contaminate. Use ` mktemp ` .
125+ - ** Token refresh** : ` claude_baseline_agent.py:1523 ` only catches ` HTTPError ` . Add ` URLError ` /` socket.timeout ` . ` e.read() ` leaks socket FD; use ` with e: ` .
126+ - ** Runner pipefail** : ` run_selected_tasks.sh:681 ` ` harbor_run_guarded | tee || echo ` -- ` || ` applies to ` tee ` (always 0). Add ` set -o pipefail ` .
127+ - ** Runner cleanup** : No ` trap ` for temp dirs on early exit. ` mktemp ` failure (line 648) silently copies to CWD.
128+ - ** ` grep -P ` macOS** : ` run_selected_tasks.sh:726 ` silently fails on BSD grep. Use ` sed -n ` instead.
129+
130+ ### Schema / Suite Naming
131+ - 3 schemas use deprecated ` ccb_mcp_* ` enums; actual names are ` csb_org_* ` . 8 schema files have zero consumers.
132+ - ** 16 copies of ` DIR_PREFIX_TO_SUITE ` ** across 30+ scripts with divergent definitions. Centralize in ` csb_metrics/suite_registry.py ` .
127133
128134### Git / Auth
129135- ` gh auth refresh -h github.com -s write:packages ` (explicit scope needed).
130136- Env vars must be ** exported** for Harbor subprocesses (` set -a ` before sourcing ` .env.local ` ).
131137- GitHub push protection blocks synthetic keys. Squash with ` git reset --soft origin/main ` .
132138- Shallow clones fail on push. Some repos use ` master ` ; detect with ` git symbolic-ref refs/remotes/origin/HEAD ` .
133139- ** gitignore negation** : ` !child/ ` doesn't work when parent dir is ignored. Use ` git add -f ` .
140+ - ** Remote URL stale** : ` CodeContextBench.git ` redirects to ` CodeScaleBench.git ` . Update local git remote config.
134141
135142### Python / Subprocess
136- - ` dict.get(key, default) ` does NOT protect against ` None ` values. Use ` data.get("key") or default_value ` .
137- - ` with open(log) as f: subprocess.Popen(stdout=f) ` closes the handle. Use ` open() ` without context manager for long-running subprocesses.
138- - ` json.load(open(path)) ` leaks file descriptors. Use ` with open(path) as f: json.load(f) ` . Affects 12 scripts.
139- - macOS Bash 3.2 lacks ` declare -A ` . Use pipe-delimited strings with ` IFS='|' read -r ` .
143+ - ` dict.get(key, default) ` doesn't guard against ` None ` . Use ` data.get("key") or default_value ` .
144+ - ` with open(log) as f: Popen(stdout=f) ` closes handle. Use bare ` open() ` for long-running subprocesses.
145+ - ` json.load(open(path)) ` leaks FDs; use ` with open ` . macOS Bash 3.2 lacks ` declare -A ` ; use ` IFS='|' read -r ` .
140146
141147### LLM Judge
142- - Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
143- - Judge should use task-type-aware evaluation: different rubrics per task type.
144- - Tool categorization: check MCP prefix (` mcp__ ` ) before substring checks to avoid miscategorization.
148+ - Include "Respond with valid JSON only" in prompts. Unescaped quotes break parsing.
149+ - Task-type-aware rubrics. Check ` mcp__ ` prefix before substring-based tool categorization.
145150
146151### OpenHands
147- - Strip ALL ` sandbox_plugins ` ( ` = [] ` ) . Base64-encode instructions (not ` shlex.quote() ` ). Alpine lacks ` apt-get ` -- use ` bookworm ` .
148- - OH MCP client ~ 30s timeout. Block ` deepsearch ` /` deepsearch_read ` in auth proxy; redirect to ` keyword_search ` /` nls_search ` .
149- - ` chown -R /workspace ` blocks >120s on large repos. Edit installed ` runtime_init.py ` . Set ` PYTHONSAFEPATH=1 ` .
152+ - ` sandbox_plugins = [] ` . Base64-encode instructions. Alpine → ` bookworm ` . MCP client ~ 30s timeout .
153+ - Block ` deepsearch ` /` deepsearch_read ` in proxy; redirect to ` keyword_search ` /` nls_search ` .
154+ - ` chown -R /workspace ` blocks on large repos. Edit ` runtime_init.py ` . Set ` PYTHONSAFEPATH=1 ` .
150155
151156### CI / Workflows
152- - ` docs-consistency.yml ` is redundant -- subsumed by ` repo_health.yml ` . Doubles CI minutes.
153- - Export HTML silently truncates at 1200 rows (` filtered.slice(0, 1200) ` in ` export_official_results.py ` ).
157+ - ` docs-consistency.yml ` redundant (subsumed by ` repo_health.yml ` ). Export HTML truncates at 1200 rows.
154158
155159### Pre-commit / Pytest / Ralph
156- - Secret-detection hooks false-positive on code that _ detects_ secrets. Use ` --no-verify ` when flagged code is detection logic.
157- - Classes named ` TestPlan ` /` TestCase ` /` TestResult ` get auto-collected by pytest. Rename to ` EvaluationPlan ` etc.
158- - Ralph sessions write learnings to ` progress.txt ` on feature branches, not main. Compound back after merge.
159- - ** Ralph ` prd.json ` is single-active** : Overwriting loses tracking state. Archive old ` prd.json ` before replacement.
160+ - Secret-detection false-positives on detection code. Use ` --no-verify ` when flagged code is detection logic.
161+ - Classes named ` TestPlan ` /` TestCase ` /` TestResult ` auto-collected by pytest. Rename to ` EvaluationPlan ` etc.
162+ - Ralph: ` progress.txt ` on feature branches, compound after merge. ` prd.json ` is single-active; archive before overwrite.
160163
161164## Maintenance
162165- Root and local ` AGENTS.md ` / ` CLAUDE.md ` files are generated from sources in ` docs/ops/ ` .
0 commit comments