Skip to content

Commit c0c381b

Browse files
committed
Harden benchmark task contracts and add registry smoke coverage
1 parent 1b3644d commit c0c381b

File tree

514 files changed

+4326
-877
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

514 files changed

+4326
-877
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
name: Task smoke matrix
2+
3+
on:
4+
workflow_dispatch:
5+
pull_request:
6+
branches: [main]
7+
paths:
8+
- "benchmarks/**"
9+
- "configs/validate_one_per_benchmark.sh"
10+
- "configs/registry_smoke_matrix.json"
11+
- "scripts/validate_tasks_preflight.py"
12+
- "docs/reference/TASK_CONTRACT.md"
13+
- ".github/workflows/task_smoke_matrix.yml"
14+
push:
15+
branches: [main]
16+
paths:
17+
- "benchmarks/**"
18+
- "configs/validate_one_per_benchmark.sh"
19+
- "configs/registry_smoke_matrix.json"
20+
- "scripts/validate_tasks_preflight.py"
21+
- "docs/reference/TASK_CONTRACT.md"
22+
- ".github/workflows/task_smoke_matrix.yml"
23+
24+
jobs:
25+
contract-audit:
26+
runs-on: ubuntu-latest
27+
steps:
28+
- uses: actions/checkout@v4
29+
30+
- name: Set up Python
31+
uses: actions/setup-python@v5
32+
with:
33+
python-version: "3.10"
34+
35+
- name: Full contract audit
36+
run: |
37+
python3 scripts/validate_tasks_preflight.py --all --contract-only --summary-by-check --format json > contract_audit.json
38+
python3 - <<'PY'
39+
import json
40+
with open("contract_audit.json") as f:
41+
data = json.load(f)
42+
allowed = {"daytona_storage_over_10g"}
43+
disallowed = [issue for issue in data["issues"] if issue["check"] not in allowed]
44+
if disallowed:
45+
for issue in disallowed[:50]:
46+
print(issue)
47+
raise SystemExit(f"{len(disallowed)} disallowed contract issues found")
48+
print(data["summary_by_check"])
49+
PY
50+
51+
smoke-runtime:
52+
runs-on: ubuntu-latest
53+
timeout-minutes: 90
54+
strategy:
55+
fail-fast: false
56+
matrix:
57+
variant:
58+
- baseline
59+
- sg-only
60+
- artifact-only
61+
steps:
62+
- uses: actions/checkout@v4
63+
64+
- name: Set up Python
65+
uses: actions/setup-python@v5
66+
with:
67+
python-version: "3.10"
68+
69+
- name: Smoke curated registry matrix
70+
run: |
71+
set -euo pipefail
72+
extra_args=()
73+
case "${{ matrix.variant }}" in
74+
baseline) ;;
75+
sg-only) extra_args+=(--sg-only) ;;
76+
artifact-only) extra_args+=(--artifact-only) ;;
77+
esac
78+
bash configs/validate_one_per_benchmark.sh \
79+
--selection-file configs/registry_smoke_matrix.json \
80+
--smoke-runtime \
81+
--smoke-timeout-sec 300 \
82+
--smoke-timeout-overrides "csb_sdlc_design=450,csb_sdlc_document=450,csb_sdlc_feature=600,csb_sdlc_fix=600,csb_sdlc_refactor=600,csb_sdlc_test=450,csb_sdlc_understand=450" \
83+
--max-concurrent 2 \
84+
"${extra_args[@]}"

AGENTS.md

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,17 @@ full operations manual.
1010
- Before commit/push, run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
1111
- Prefer a **remote execution environment** (e.g., Daytona) for large benchmark runs; use local Docker only when a task’s image or registry is incompatible with your cloud environment. See `docs/DAYTONA.md`.
1212
- Set **parallelism based on your own account and model limits**. Avoid exceeding documented concurrency or rate caps for your environment or provider.
13+
- Before launching any benchmark batch, check account readiness with `python3 scripts/check_infra.py` or `python3 scripts/account_health.py status`. Do not assume OAuth accounts are usable just because credentials exist.
1314

14-
## Beads Prerequisite
15+
## Beads Prerequisite and Usage
1516
- Keep the Beads CLI (`bd`, alias `beads`) up to date before running agent workflows that rely on task graphs.
1617
- Install or update with the official installer:
1718
```bash
1819
curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash
1920
```
2021
- Verify install/version with `bd --version` (or `beads --version`).
22+
- Do not use `bd edit`; use non-interactive `bd create/update/close --json` or stdin-based `--description=-`.
23+
- Typical flow: `bd ready --json`, `bd create ... --json`, `bd update <id> --claim`, `bd close <id> --reason "Done"`.
2124

2225
## Minimal Loading Policy
2326
- Default load order: this file + one relevant skill + one relevant doc.
@@ -41,7 +44,17 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
4144
- Compact after exploration, before multi-file edits.
4245
- Compact after launching a benchmark batch.
4346
- Compact after completing a triage batch or report generation pass.
44-
- Use `docs/ops/HANDOFF_TEMPLATE.md` when handing work to a new session.
47+
- When handing work to a new session, use the generic `/handoff` skill to generate an inline copy/paste handoff prompt.
48+
- Do not create a markdown handoff file unless the user explicitly asks for one.
49+
- Use `docs/ops/HANDOFF_TEMPLATE.md` as a checklist for what the handoff should include.
50+
51+
## Landing the Plane (Session Completion)
52+
- Track remaining follow-up in issues or beads.
53+
- Run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
54+
- Update issue/task status.
55+
- `git pull --rebase && git push && git status` and confirm `main` is up to date with `origin/main`.
56+
- Clean up and hand off using `/handoff` plus `docs/ops/HANDOFF_TEMPLATE.md`.
57+
- Work is not complete until push succeeds.
4558

4659
## Canonical Maps
4760
- `docs/START_HERE_BY_TASK.md` - task-based read order
@@ -82,8 +95,8 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
8295
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
8396
- `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
8497
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85-
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86-
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
98+
- Token usage data lives in `trajectory.json`; plain transcript parsers do not see it.
99+
- Harbor task contract requires writing `/logs/verifier/reward.txt`.
87100

88101
### Validation / Scoring
89102
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
@@ -92,12 +105,10 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
92105
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93106
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94107

95-
### Gitignore
96-
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
97-
98108
### Git / Auth
99109
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
100110
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
111+
- Account readiness is tracked in `runs/state/account_health.json`. Launchers source `configs/_common.sh`, filter out unsafe accounts before launch, and record recent runtime rate-limit observations there for operator context.
101112
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
102113
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
103114
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
@@ -108,15 +119,6 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
108119
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
109120
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
110121

111-
### Dashboard / Streamlit
112-
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
113-
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
114-
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
115-
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116-
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117-
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118-
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
119-
120122
### LLM Judge
121123
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
122124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.

CLAUDE.md

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,17 @@ full operations manual.
1010
- Before commit/push, run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
1111
- Prefer a **remote execution environment** (e.g., Daytona) for large benchmark runs; use local Docker only when a task’s image or registry is incompatible with your cloud environment. See `docs/DAYTONA.md`.
1212
- Set **parallelism based on your own account and model limits**. Avoid exceeding documented concurrency or rate caps for your environment or provider.
13+
- Before launching any benchmark batch, check account readiness with `python3 scripts/check_infra.py` or `python3 scripts/account_health.py status`. Do not assume OAuth accounts are usable just because credentials exist.
1314

14-
## Beads Prerequisite
15+
## Beads Prerequisite and Usage
1516
- Keep the Beads CLI (`bd`, alias `beads`) up to date before running agent workflows that rely on task graphs.
1617
- Install or update with the official installer:
1718
```bash
1819
curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash
1920
```
2021
- Verify install/version with `bd --version` (or `beads --version`).
22+
- Do not use `bd edit`; use non-interactive `bd create/update/close --json` or stdin-based `--description=-`.
23+
- Typical flow: `bd ready --json`, `bd create ... --json`, `bd update <id> --claim`, `bd close <id> --reason "Done"`.
2124

2225
## Minimal Loading Policy
2326
- Default load order: this file + one relevant skill + one relevant doc.
@@ -41,7 +44,17 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
4144
- Compact after exploration, before multi-file edits.
4245
- Compact after launching a benchmark batch.
4346
- Compact after completing a triage batch or report generation pass.
44-
- Use `docs/ops/HANDOFF_TEMPLATE.md` when handing work to a new session.
47+
- When handing work to a new session, use the generic `/handoff` skill to generate an inline copy/paste handoff prompt.
48+
- Do not create a markdown handoff file unless the user explicitly asks for one.
49+
- Use `docs/ops/HANDOFF_TEMPLATE.md` as a checklist for what the handoff should include.
50+
51+
## Landing the Plane (Session Completion)
52+
- Track remaining follow-up in issues or beads.
53+
- Run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
54+
- Update issue/task status.
55+
- `git pull --rebase && git push && git status` and confirm `main` is up to date with `origin/main`.
56+
- Clean up and hand off using `/handoff` plus `docs/ops/HANDOFF_TEMPLATE.md`.
57+
- Work is not complete until push succeeds.
4558

4659
## Canonical Maps
4760
- `docs/START_HERE_BY_TASK.md` - task-based read order
@@ -82,8 +95,8 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
8295
- Timing fields (`started_at`, `finished_at`) live at the **top level** of `result.json`, not nested under `timing`.
8396
- `trajectory.json` is generated by Harbor's `_convert_events_to_trajectory()` post-processing, NOT by Claude Code CLI directly.
8497
- SWE-bench `test.sh` redirects stdout to a temp file -- Harbor never sees the parser's `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers via its normal capture.
85-
- Token usage data lives in `trajectory.json` per-step metrics with tool attribution. `TranscriptParser` only parses plain text transcripts and ignores trajectory.json.
86-
- Harbor task contract requires writing to `/logs/verifier/reward.txt`. MCP integration happens at the agent runner level, not the individual task level.
98+
- Token usage data lives in `trajectory.json`; plain transcript parsers do not see it.
99+
- Harbor task contract requires writing `/logs/verifier/reward.txt`.
87100

88101
### Validation / Scoring
89102
- `validators.py` is duplicated across `ccb_build` tasks. Changes must be applied to **all copies** (verify with `sha256sum`).
@@ -92,12 +105,10 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
92105
- Trial directory names are truncated with hash suffixes (e.g., `c_api_graphql_expert_079_archite__pm9xcPn`). The real task name lives in `config.json` at `task.path`.
93106
- LoCoBench task IDs contain multi-word fields (e.g., `game_engine`, `cross_file_refactoring`). Use the 3-digit task number as a positional anchor for parsing instead of rigid regexes that assume single-word fields.
94107

95-
### Gitignore
96-
- Unanchored `.gitignore` patterns like `dirname/` match at **any directory level**. Use `/dirname/` to anchor to root only. (e.g., `10figure/` inadvertently blocked `benchmarks/10figure/` from being committed.)
97-
98108
### Git / Auth
99109
- `gh auth refresh` without `-s <scope>` is a no-op for adding scopes. Must use `gh auth refresh -h github.com -s write:packages` explicitly.
100110
- Environment variables must be **explicitly exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
111+
- Account readiness is tracked in `runs/state/account_health.json`. Launchers source `configs/_common.sh`, filter out unsafe accounts before launch, and record recent runtime rate-limit observations there for operator context.
101112
- GitHub push protection blocks synthetic/fake API keys in test data. Use `git reset --soft origin/main` to squash intermediate commits that contained fake credentials.
102113
- Shallow clones (`--depth 1`) fail on push to GitHub with `remote: fatal: did not receive expected object`. Always use full clones for repos that will be pushed.
103114
- Some repos use `master` as default branch. Detect with `git symbolic-ref refs/remotes/origin/HEAD` and remap to `main` if needed.
@@ -108,15 +119,6 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
108119
- `with open(log) as f: subprocess.Popen(stdout=f)` closes the file handle immediately after `Popen()` returns. Use `open()` without context manager for long-running subprocesses.
109120
- macOS ships Bash 3.2 which lacks associative arrays (`declare -A`). Use pipe-delimited string arrays with `IFS='|' read -r` for compatibility.
110121

111-
### Dashboard / Streamlit
112-
- Streamlit widget keys in loops must include an index or unique ID to avoid `DuplicateElementKey` errors (e.g., `key=f"nav_{idx}_{page}"` not `key=f"nav_{page}"`).
113-
- `st.session_state` cannot be modified after widget instantiation. Use `on_click` callback pattern that sets state before widget rerender.
114-
- Sidebar config below navigation menu is invisible without scrolling. Put critical UI controls in the main content area using `st.columns()`.
115-
- Always check actual dataclass field names before writing view code. Common mismatches: `agent_results` vs `agent_metrics`, `anomalies` vs `total_anomalies`, dict access vs object attributes.
116-
- Process handles stored in `st.session_state` are lost on browser refresh. For long-running background processes, use file-based persistent tracking (e.g., `.dashboard_runs/` JSON files) instead.
117-
- Prefer `st.dataframe` over `st.columns()` with buttons for tabular data -- column layouts squash buttons at narrow viewports.
118-
- Metric precision matters: use 4+ decimal places for reward/duration comparisons. Rounding to 2 decimals silently loses information needed for meaningful comparison.
119-
120122
### LLM Judge
121123
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
122124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.

agents/claude_baseline_agent.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -941,6 +941,18 @@ def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
941941
if cmd.command and "claude " in cmd.command:
942942
# Start with the base command
943943
modified_command = cmd.command
944+
945+
# Harbor's upstream Claude command pipes stream-json output through
946+
# `tee` to write /logs/agent/claude-code.txt. In long-running Claude
947+
# sessions this can hang after the final answer if descendant
948+
# processes inherit the pipeline FDs, leaving Harbor waiting for EOF
949+
# long after the agent has logically finished. Redirect directly to
950+
# the transcript file instead; Harbor reads the downloaded file for
951+
# post-run analysis, so live tee output is unnecessary here.
952+
modified_command = modified_command.replace(
953+
"2>&1 </dev/null | stdbuf -oL tee /logs/agent/claude-code.txt",
954+
"</dev/null >/logs/agent/claude-code.txt 2>&1",
955+
)
944956

945957
# Insert flags after "claude " - build them up incrementally
946958
flags_to_insert = []
@@ -1123,8 +1135,9 @@ def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
11231135

11241136
modified_command = (
11251137
f"{setup_cmds}{file_cmds} && "
1126-
f"{run_cmd} ; "
1127-
"chmod -R a+rX /logs 2>/dev/null || true"
1138+
f"{run_cmd}; _run_status=$?; "
1139+
"chmod -R a+rX /logs 2>/dev/null || true; "
1140+
"exit $_run_status"
11281141
)
11291142

11301143
# CRITICAL: Add autonomous environment variables

0 commit comments

Comments
 (0)