Skip to content

Commit e1b6ed1

Browse files
committed
docs: add learnings from Mar 14 JSONL sessions
New sections: Agent/Runner Robustness (tmp race, token refresh, pipefail, trap handler, grep -P portability) and Schema/Suite Naming (deprecated ccb_mcp_* enums, DIR_PREFIX_TO_SUITE duplication). Condensed existing sections to stay within 12,288-byte ROOT_AGENT_GUIDE limit.
1 parent a67d38f commit e1b6ed1

File tree

3 files changed

+108
-99
lines changed

3 files changed

+108
-99
lines changed

AGENTS.md

Lines changed: 36 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -75,11 +75,10 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7575
- Registry types enum: `internal`, `organization`, `transient`, `backup`. Use `organization` for GHCR/Docker Hub.
7676

7777
### Docker / Build
78-
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona (native x86_64).
79-
- Build-push-clean pattern for limited disk (~45GB): build, push, clean before next image.
80-
- Colons in agent names break Docker volume mounts. Replace `:` with `__`.
81-
- Add `|| git init` fallback to `git clone` in Dockerfiles. Add `chown claude:claude /logs` + `adduser claude` for OH compat.
82-
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 Dockerfiles in `csb/debug/` and `csb/fix/`.
78+
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona.
79+
- Build-push-clean for limited disk. Colons in agent names break mounts (use `__`).
80+
- Dockerfile: `git clone || git init` fallback, `adduser claude` + `chown claude:claude /logs` for OH.
81+
- `jefzda/``ghcr.io/sg-evals/` migration incomplete (33 Dockerfiles).
8382

8483
### MCP Configuration (inside sandboxes)
8584
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`. Claude Code needs `--mcp-config` flag.
@@ -112,51 +111,55 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
112111

113112
### Validation / Scoring
114113
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
115-
- Agent completing in **<2s** = never installed/ran. Real name in `config.json` at `task.path`.
116-
- **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
117-
- `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
118-
- **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
119-
- Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
120-
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
121-
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
122-
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
123-
- **TARGET_SUITE misalignment**: 55 stale suite names, 220 missing. `SUITE_WEIGHTS` falls back to equal-weight.
124-
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` (`.setdefault()` overwrite). Audit trail broken.
125-
- **Falsy value bugs**: `max_score=0` treated as false; `None` MCP metrics misclassified. Use `is None` / `== 0`.
126-
- **promote_run.py**: Crashes on non-dict env config. Validate types before `.get()`.
114+
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash after it.
115+
- `timeout 600` on test runners. `--forceExit` for Jest. Jest+TS: `memory_mb = 8192`.
116+
- **CSB dual-score**: file edits + `answer.json` independent. Fallback: `promoted_verifier.py``oracle_checks.py` → heuristic.
117+
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` in `instruction.md` → use `<placeholder>`.
118+
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
119+
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
120+
- **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
121+
- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env.
122+
123+
### Agent / Runner Robustness
124+
- **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
125+
- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`. Add `URLError`/`socket.timeout`. `e.read()` leaks socket FD; use `with e:`.
126+
- **Runner pipefail**: `run_selected_tasks.sh:681` `harbor_run_guarded | tee || echo` -- `||` applies to `tee` (always 0). Add `set -o pipefail`.
127+
- **Runner cleanup**: No `trap` for temp dirs on early exit. `mktemp` failure (line 648) silently copies to CWD.
128+
- **`grep -P` macOS**: `run_selected_tasks.sh:726` silently fails on BSD grep. Use `sed -n` instead.
129+
130+
### Schema / Suite Naming
131+
- 3 schemas use deprecated `ccb_mcp_*` enums; actual names are `csb_org_*`. 8 schema files have zero consumers.
132+
- **16 copies of `DIR_PREFIX_TO_SUITE`** across 30+ scripts with divergent definitions. Centralize in `csb_metrics/suite_registry.py`.
127133

128134
### Git / Auth
129135
- `gh auth refresh -h github.com -s write:packages` (explicit scope needed).
130136
- Env vars must be **exported** for Harbor subprocesses (`set -a` before sourcing `.env.local`).
131137
- GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
132138
- Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
133139
- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
140+
- **Remote URL stale**: `CodeContextBench.git` redirects to `CodeScaleBench.git`. Update local git remote config.
134141

135142
### Python / Subprocess
136-
- `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.
137-
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the handle. Use `open()` without context manager for long-running subprocesses.
138-
- `json.load(open(path))` leaks file descriptors. Use `with open(path) as f: json.load(f)`. Affects 12 scripts.
139-
- macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
143+
- `dict.get(key, default)` doesn't guard against `None`. Use `data.get("key") or default_value`.
144+
- `with open(log) as f: Popen(stdout=f)` closes handle. Use bare `open()` for long-running subprocesses.
145+
- `json.load(open(path))` leaks FDs; use `with open`. macOS Bash 3.2 lacks `declare -A`; use `IFS='|' read -r`.
140146

141147
### LLM Judge
142-
- Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
143-
- Judge should use task-type-aware evaluation: different rubrics per task type.
144-
- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
148+
- Include "Respond with valid JSON only" in prompts. Unescaped quotes break parsing.
149+
- Task-type-aware rubrics. Check `mcp__` prefix before substring-based tool categorization.
145150

146151
### OpenHands
147-
- Strip ALL `sandbox_plugins` (`= []`). Base64-encode instructions (not `shlex.quote()`). Alpine lacks `apt-get` -- use `bookworm`.
148-
- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
149-
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py`. Set `PYTHONSAFEPATH=1`.
152+
- `sandbox_plugins = []`. Base64-encode instructions. Alpine `bookworm`. MCP client ~30s timeout.
153+
- Block `deepsearch`/`deepsearch_read` in proxy; redirect to `keyword_search`/`nls_search`.
154+
- `chown -R /workspace` blocks on large repos. Edit `runtime_init.py`. Set `PYTHONSAFEPATH=1`.
150155

151156
### CI / Workflows
152-
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
153-
- Export HTML silently truncates at 1200 rows (`filtered.slice(0, 1200)` in `export_official_results.py`).
157+
- `docs-consistency.yml` redundant (subsumed by `repo_health.yml`). Export HTML truncates at 1200 rows.
154158

155159
### Pre-commit / Pytest / Ralph
156-
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
157-
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
158-
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
159-
- **Ralph `prd.json` is single-active**: Overwriting loses tracking state. Archive old `prd.json` before replacement.
160+
- Secret-detection false-positives on detection code. Use `--no-verify` when flagged code is detection logic.
161+
- Classes named `TestPlan`/`TestCase`/`TestResult` auto-collected by pytest. Rename to `EvaluationPlan` etc.
162+
- Ralph: `progress.txt` on feature branches, compound after merge. `prd.json` is single-active; archive before overwrite.
160163

161164
## Maintenance
162165
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

CLAUDE.md

Lines changed: 36 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -75,11 +75,10 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7575
- Registry types enum: `internal`, `organization`, `transient`, `backup`. Use `organization` for GHCR/Docker Hub.
7676

7777
### Docker / Build
78-
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona (native x86_64).
79-
- Build-push-clean pattern for limited disk (~45GB): build, push, clean before next image.
80-
- Colons in agent names break Docker volume mounts. Replace `:` with `__`.
81-
- Add `|| git init` fallback to `git clone` in Dockerfiles. Add `chown claude:claude /logs` + `adduser claude` for OH compat.
82-
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 Dockerfiles in `csb/debug/` and `csb/fix/`.
78+
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona.
79+
- Build-push-clean for limited disk. Colons in agent names break mounts (use `__`).
80+
- Dockerfile: `git clone || git init` fallback, `adduser claude` + `chown claude:claude /logs` for OH.
81+
- `jefzda/``ghcr.io/sg-evals/` migration incomplete (33 Dockerfiles).
8382

8483
### MCP Configuration (inside sandboxes)
8584
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`. Claude Code needs `--mcp-config` flag.
@@ -112,51 +111,55 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
112111

113112
### Validation / Scoring
114113
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
115-
- Agent completing in **<2s** = never installed/ran. Real name in `config.json` at `task.path`.
116-
- **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
117-
- `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
118-
- **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
119-
- Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
120-
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
121-
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
122-
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
123-
- **TARGET_SUITE misalignment**: 55 stale suite names, 220 missing. `SUITE_WEIGHTS` falls back to equal-weight.
124-
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` (`.setdefault()` overwrite). Audit trail broken.
125-
- **Falsy value bugs**: `max_score=0` treated as false; `None` MCP metrics misclassified. Use `is None` / `== 0`.
126-
- **promote_run.py**: Crashes on non-dict env config. Validate types before `.get()`.
114+
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash after it.
115+
- `timeout 600` on test runners. `--forceExit` for Jest. Jest+TS: `memory_mb = 8192`.
116+
- **CSB dual-score**: file edits + `answer.json` independent. Fallback: `promoted_verifier.py``oracle_checks.py` → heuristic.
117+
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` in `instruction.md` → use `<placeholder>`.
118+
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
119+
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
120+
- **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
121+
- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env.
122+
123+
### Agent / Runner Robustness
124+
- **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
125+
- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`. Add `URLError`/`socket.timeout`. `e.read()` leaks socket FD; use `with e:`.
126+
- **Runner pipefail**: `run_selected_tasks.sh:681` `harbor_run_guarded | tee || echo` -- `||` applies to `tee` (always 0). Add `set -o pipefail`.
127+
- **Runner cleanup**: No `trap` for temp dirs on early exit. `mktemp` failure (line 648) silently copies to CWD.
128+
- **`grep -P` macOS**: `run_selected_tasks.sh:726` silently fails on BSD grep. Use `sed -n` instead.
129+
130+
### Schema / Suite Naming
131+
- 3 schemas use deprecated `ccb_mcp_*` enums; actual names are `csb_org_*`. 8 schema files have zero consumers.
132+
- **16 copies of `DIR_PREFIX_TO_SUITE`** across 30+ scripts with divergent definitions. Centralize in `csb_metrics/suite_registry.py`.
127133

128134
### Git / Auth
129135
- `gh auth refresh -h github.com -s write:packages` (explicit scope needed).
130136
- Env vars must be **exported** for Harbor subprocesses (`set -a` before sourcing `.env.local`).
131137
- GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
132138
- Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
133139
- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
140+
- **Remote URL stale**: `CodeContextBench.git` redirects to `CodeScaleBench.git`. Update local git remote config.
134141

135142
### Python / Subprocess
136-
- `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.
137-
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the handle. Use `open()` without context manager for long-running subprocesses.
138-
- `json.load(open(path))` leaks file descriptors. Use `with open(path) as f: json.load(f)`. Affects 12 scripts.
139-
- macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
143+
- `dict.get(key, default)` doesn't guard against `None`. Use `data.get("key") or default_value`.
144+
- `with open(log) as f: Popen(stdout=f)` closes handle. Use bare `open()` for long-running subprocesses.
145+
- `json.load(open(path))` leaks FDs; use `with open`. macOS Bash 3.2 lacks `declare -A`; use `IFS='|' read -r`.
140146

141147
### LLM Judge
142-
- Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
143-
- Judge should use task-type-aware evaluation: different rubrics per task type.
144-
- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
148+
- Include "Respond with valid JSON only" in prompts. Unescaped quotes break parsing.
149+
- Task-type-aware rubrics. Check `mcp__` prefix before substring-based tool categorization.
145150

146151
### OpenHands
147-
- Strip ALL `sandbox_plugins` (`= []`). Base64-encode instructions (not `shlex.quote()`). Alpine lacks `apt-get` -- use `bookworm`.
148-
- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
149-
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py`. Set `PYTHONSAFEPATH=1`.
152+
- `sandbox_plugins = []`. Base64-encode instructions. Alpine `bookworm`. MCP client ~30s timeout.
153+
- Block `deepsearch`/`deepsearch_read` in proxy; redirect to `keyword_search`/`nls_search`.
154+
- `chown -R /workspace` blocks on large repos. Edit `runtime_init.py`. Set `PYTHONSAFEPATH=1`.
150155

151156
### CI / Workflows
152-
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
153-
- Export HTML silently truncates at 1200 rows (`filtered.slice(0, 1200)` in `export_official_results.py`).
157+
- `docs-consistency.yml` redundant (subsumed by `repo_health.yml`). Export HTML truncates at 1200 rows.
154158

155159
### Pre-commit / Pytest / Ralph
156-
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
157-
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
158-
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
159-
- **Ralph `prd.json` is single-active**: Overwriting loses tracking state. Archive old `prd.json` before replacement.
160+
- Secret-detection false-positives on detection code. Use `--no-verify` when flagged code is detection logic.
161+
- Classes named `TestPlan`/`TestCase`/`TestResult` auto-collected by pytest. Rename to `EvaluationPlan` etc.
162+
- Ralph: `progress.txt` on feature branches, compound after merge. `prd.json` is single-active; archive before overwrite.
160163

161164
## Maintenance
162165
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

0 commit comments

Comments
 (0)