Skip to content

Commit dea437f

Browse files
committed
docs: add session learnings — security, harness-agnostic verifiers, Daytona CLI
New gotchas from 4 archive sessions + 59 post-archive commits (Mar 9-12): - Security: file-based credential injection, sanitize_secrets.py - Harness-agnostic verifiers: origin ref check, path fallback chains, GOWORK=off - Docker: git clone fallbacks, adduser claude for OH compat - Daytona: snapshot positional args, CLI/API version sync, registry type enum - Condensed existing entries to stay within 12KB agent_guide_root size limit
1 parent 3d13dcd commit dea437f

File tree

5 files changed

+155
-182
lines changed

5 files changed

+155
-182
lines changed

AGENTS.md

Lines changed: 51 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -40,21 +40,15 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
4040
- `configs/AGENTS.md` - run launcher wrappers and confirmation gate policy
4141
- `docs/AGENTS.md` - documentation IA and canonical vs archive guidance
4242

43-
## Compaction / Handoff Checkpoints
44-
- Compact after exploration, before multi-file edits.
45-
- Compact after launching a benchmark batch.
46-
- Compact after completing a triage batch or report generation pass.
47-
- When handing work to a new session, use the generic `/handoff` skill to generate an inline copy/paste handoff prompt.
48-
- Do not create a markdown handoff file unless the user explicitly asks for one.
49-
- Use `docs/ops/HANDOFF_TEMPLATE.md` as a checklist for what the handoff should include.
50-
51-
## Landing the Plane (Session Completion)
52-
- Track remaining follow-up in issues or beads.
53-
- Run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
54-
- Update issue/task status.
55-
- `git pull --rebase && git push && git status` and confirm `main` is up to date with `origin/main`.
56-
- Clean up and hand off using `/handoff` plus `docs/ops/HANDOFF_TEMPLATE.md`.
57-
- Work is not complete until push succeeds.
43+
## Compaction / Handoff
44+
- Compact after exploration, after launching a batch, and after triage/report passes.
45+
- Use `/handoff` skill for session handoffs (inline prompt, not a markdown file unless asked).
46+
- Use `docs/ops/HANDOFF_TEMPLATE.md` as checklist.
47+
48+
## Landing the Plane
49+
- Run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only).
50+
- `git pull --rebase && git push && git status` -- work is not done until push succeeds.
51+
- Track follow-ups in issues or beads. Update status.
5852

5953
## Canonical Maps
6054
- `docs/START_HERE_BY_TASK.md` - task-based read order
@@ -71,54 +65,63 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7165
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
7266

7367
### Daytona / Harbor
74-
- Daytona builds images from Dockerfiles at sandbox creation time (`Image.from_dockerfile()`). Dockerfile fixes pushed to `main` take effect on the next run -- **no manual image rebuild needed**. Exception: pre-built GHCR base images must be rebuilt separately.
75-
- Harbor+Daytona (`harbor run --environment-type daytona`) is the recommended production approach. The standalone `scripts/daytona_runner.py` is for quick validation only.
76-
- Use `BASELINE_MCP_TYPE` env var to control MCP configuration: `none`, `sourcegraph`, `deepsearch`.
77-
- Daytona SDK (`daytona_sdk`) over CLI for sandbox interaction -- the CLI is interactive-only for SSH.
78-
- GHCR packages default to **private** for personal accounts and visibility cannot be changed via API. Use the GitHub web UI or push to an org.
68+
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (exception: pre-built GHCR base images need separate rebuild).
69+
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
70+
- `BASELINE_MCP_TYPE` env var: `none`, `sourcegraph`, `deepsearch`.
71+
- Use Daytona SDK (`daytona_sdk`) over CLI (CLI is interactive-only for SSH).
72+
- GHCR packages default **private** for personal accounts; visibility change requires GitHub web UI.
73+
- Snapshot names are **positional**: `daytona snapshot create ccb-name`, NOT `--name`.
74+
- CLI/API version mismatch causes "Forbidden" errors. Keep CLI version in sync.
75+
- Registry types enum: `internal`, `organization`, `transient`, `backup`. Use `organization` for GHCR/Docker Hub.
7976

8077
### Docker / Build
8178
- `uv tool install` segfaults on ARM64/QEMU emulation. Use `pip install` instead, or switch to Daytona (native x86_64).
8279
- Build-push-clean pattern when building Docker images with limited disk (~45GB): build one image, push, then clean locally before the next.
8380
- Colons in agent names (e.g., `module:ClassName`) break Docker volume mounts. Sanitize paths: replace `:` with `__`.
81+
- Add `|| git init` fallback to all `git clone` commands in Dockerfiles for network resilience. Applied to 269 Dockerfiles.
82+
- Add `chown claude:claude /logs` and `adduser claude` to Dockerfiles for cross-harness (OH) permission compatibility.
8483

8584
### MCP Configuration (inside sandboxes)
86-
- `.mcp.json` must be placed at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`.
87-
- Claude Code requires the `--mcp-config` CLI flag to load MCP config -- it does not auto-detect.
88-
- Inject MCP usage instructions into the task prompt. Agents won't use MCP tools just because they're available.
89-
- Set `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in Docker containers (curl working does not mean Node.js fetch will work).
90-
- Sourcegraph MCP uses **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP. HTTP 405 = correct endpoint, wrong protocol.
91-
- Sourcegraph skills show empty in headless mode. Embed skill prompt content in CLAUDE.md directly.
85+
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`.
86+
- Claude Code needs `--mcp-config` flag; it does not auto-detect. Inject MCP usage instructions into the task prompt.
87+
- `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in containers.
88+
- Sourcegraph: **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP. HTTP 405 = wrong protocol.
89+
- Sourcegraph skills show empty in headless mode. Embed prompt content in CLAUDE.md.
9290
- Sourcegraph env vars: `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN` (NOT `_ENDPOINT` or `_TOKEN`).
9391

9492
### Harbor Result Format
95-
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
96-
- `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
97-
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
98-
- Token usage data lives in `trajectory.json`; plain transcript parsers do not see it.
99-
- Harbor task contract requires writing `/logs/verifier/reward.txt`.
93+
- Timing fields (`started_at`, `finished_at`) at **top level** of `result.json`, not nested under `timing`.
94+
- `trajectory.json` generated by Harbor's `_convert_events_to_trajectory()`, not by Claude Code CLI.
95+
- SWE-bench `test.sh` redirects stdout to temp file; Harbor never sees `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers.
96+
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
97+
98+
### Security / Credentials
99+
- **Never pass credentials via Docker `-e` flags.** They leak into trajectory HTML when an agent runs `env`. Use file-based injection: write to `/logs/agent/.credentials.json` with `chmod 600`.
100+
- `scripts/sanitize_secrets.py` redacts real API keys (Anthropic, OpenAI, Sourcegraph, GitHub, Daytona) at result generation time. Maintains allowlist for known fake benchmark fixtures.
101+
102+
### Harness-Agnostic Verifiers
103+
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for agents that auto-commit (e.g., OpenHands). Otherwise the guard falsely penalizes normal OH behavior.
104+
- Verifier path fallback chains: use `${TASK_WORKDIR:-/workspace}` for working directory and `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root. Enables same verifier across Harbor and OpenHands.
105+
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo. The go.work file may require a newer Go version than the container provides.
100106

101107
### Validation / Scoring
102-
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
103-
- Install scripts that print "INSTALL_SUCCESS" regardless of actual outcome are common. Always verify the binary exists and is executable.
104-
- Agent completing in **<2 seconds** = agent never installed/ran (smoke test heuristic).
105-
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
106-
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
107-
- **no_changes_guard**: Python sets `reward = 0.0` but bash `echo "$score"` uses the original variable. Write `reward.txt` inside the Python block, not after it.
108-
- Wrap all test runners with `timeout 600`. Add `--forceExit` to Jest. Indefinite hangs (>2h) observed without timeout.
109-
- Jest + TypeScript needs 4-6GB RAM. Set `memory_mb = 8192` in `task.toml` for front-end test suites (default 2GB causes OOM).
110-
- **CSB dual-score**: agents produce file edits + `answer.json`; scored independently. Fallback: `promoted_verifier.py``oracle_checks.py` → heuristic.
111-
- Rate-limited results (score=0, duration <30s): quarantine with `scripts/quarantine_invalid_tasks.py --execute`.
108+
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
109+
- Install scripts printing "INSTALL_SUCCESS" regardless of outcome are common. Verify binary exists.
110+
- Agent completing in **<2s** = never installed/ran. Trial dir names truncated with hash; real name in `config.json` at `task.path`.
111+
- LoCoBench task IDs have multi-word fields. Use 3-digit task number as positional anchor.
112+
- **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
113+
- `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
114+
- **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
115+
- Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
112116
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
113117

114118
### Git / Auth
115-
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
116-
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
117-
- Account readiness tracked in `runs/state/account_health.json`. Launchers source `configs/_common.sh` and filter unsafe accounts.
118-
- GitHub push protection blocks synthetic API keys. Squash with `git reset --soft origin/main`.
119-
- Shallow clones (`--depth 1`) fail on push. Always use full clones for repos that will be pushed.
120-
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD`.
121-
- GitHub secret scanning blocks embedded secrets. Unblock via the `/security/secret-scanning/unblock-secret/` URL.
119+
- `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
120+
- Env vars must be **exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
121+
- Account readiness: `runs/state/account_health.json`. Launchers source `configs/_common.sh`.
122+
- GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
123+
- Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
124+
- GitHub secret scanning: unblock via `/security/secret-scanning/unblock-secret/` URL.
122125

123126
### Python / Subprocess
124127
- `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.

0 commit comments

Comments
 (0)