Skip to content

Commit d8348cc

Browse files
committed
docs: nightly research report 2026-03-19 (report #14)
New findings from fresh code audit pass targeting previously unreviewed scripts: - Hardcoded-path epidemic expanded: fix_memory_mb.py:8, extract_build_diary.py:121, plot_build_diary_supplementary.py:121+ all use /home/stephanie_jarmak/CodeScaleBench - Deprecated claude-opus-4-5-20251101 in shell scripts (rerun_fixed_tasks.sh:34, rerun_zero_mcp_tasks.sh:29) — missed when Python/skill files were updated - configs/run_selected_tasks.sh:648,699,711: mktemp+mv race condition (mv failure swallowed by subshell, subsequent cp targets missing directory) - csb_metrics/extractors.py:669: FD leak via pathlib.Path.open() form (missed by SIM115 grep sweep that only caught json.load(open(...)) pattern) Recommended next feature: scripts/fix_common_patterns.py — automated mass-fix for all known safe mechanical bugs (~60% backlog reduction in one pass). Updates ROOT_AGENT_GUIDE.md with new learnings; condensed Validation/Scoring, Scripts/Code Quality, and Skills/Automation sections to stay under 12,288-byte limit.
1 parent 7a60bf6 commit d8348cc

File tree

4 files changed

+371
-66
lines changed

4 files changed

+371
-66
lines changed

AGENTS.md

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -96,23 +96,23 @@ full operations manual.
9696
- **Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
9797

9898
### Scripts / Code Quality
99-
- **abc_audit.py**: 4 functions defined twice (`check_oa_*`, `check_ob_*`, `check_og_*`, `check_t10_*`); Python silently uses last definition.
100-
- **`rerun_failed.py`**: `shell=True` + dynamic commands (injection risk); `sourcegraph_fulldeepsearch` mapping wrong (invalid MCP type); contains deprecated model ID.
101-
- **ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
102-
- **`--skip-completed`**: requires both `result.json` + `task_metrics.json`. Fix: check only `result.json`.
103-
- **Task registry header stale**: claims 436, actual 274. `sync_task_metadata.py --fix` doesn't update it.
104-
- **`verification_modes`/`use_case_category` missing from all 274 tasks**: breaks auto-detection + `--use-case-category` filter (silently filters all).
99+
- `abc_audit.py`: 4 functions defined twice; Python silently uses last definition.
100+
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_fulldeepsearch` mapping; deprecated model.
101+
- `ir_metrics.py:749`: `tt_all_r` set comparison bug (first-relevant, not all-relevant).
102+
- `--skip-completed` requires result.json + task_metrics.json; fix: check only result.json.
103+
- Task registry header: claims 436, actual 274 (`sync_task_metadata.py --fix` doesn't update it).
104+
- `verification_modes`/`use_case_category` missing from all 274 tasks; `--use-case-category` silently returns 0.
105105

106106
### Validation / Scoring
107-
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
108-
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash. `timeout 600` on runners; `--forceExit` for Jest; Jest+TS: `memory_mb = 8192`.
109-
- **CSB dual-score**: file edits + `answer.json` independent. Fallback: `promoted_verifier.py``oracle_checks.py`heuristic.
110-
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` in `instruction.md` → use `<placeholder>`.
111-
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
112-
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
113-
- **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
114-
- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147,1005` `mcp_mode or config_name` falls through on empty string (both sites).
115-
- `models.py` `from_dict()` mutates caller's dict via `.pop()`.
107+
- `validators.py` duplicated in `ccb_build`; update all copies (`sha256sum`).
108+
- Agent <2s = never ran. `reward.txt` in Python. `timeout 600`; Jest `--forceExit`; `memory_mb=8192`.
109+
- CSB dual-score: edits + `answer.json` independent. Fallback: promoted_verifieroracle_checksheuristic.
110+
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` `<placeholder>`.
111+
- Pass rate logic duplicated: `generate_eval_report.py` + `csb_metrics/models.py`.
112+
- `cost_report.py`: use `or 1` guard (not `.get(..., 1)` which returns 0 for key=0).
113+
- TARGET_SUITE: 55 stale, 220 missing. `dual_score_lib.sh scorer_artifact` always `"auto"`.
114+
- Falsy bugs: `max_score=0`, None MCP, promote_run.py non-dict env, eval_report.py:147,1005.
115+
- `models.py from_dict()` mutates caller dict via `.pop()`.
116116

117117
### Agent / Runner Robustness
118118
- **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
@@ -128,9 +128,9 @@ full operations manual.
128128
- **16 copies of `DIR_PREFIX_TO_SUITE`** across 30+ scripts with divergent definitions. Centralize in `csb_metrics/suite_registry.py`.
129129

130130
### Skills / Automation
131-
- **54 stale paths**: 25 skill files hardcode `~/CodeScaleBench` (actual `~/CodeContextBench`). Use `$(git rev-parse --show-toplevel)`.
132-
- **21 stale `sourcegraph_full` refs**: 14 skill files + 5 schemas. Invalid `BASELINE_MCP_TYPE` value (accepts `none`/`sourcegraph`/`deepsearch`).
133-
- **3 deprecated model IDs**: `claude-opus-4-5-20251101``claude-opus-4-6` in skills.
131+
- 25 skill files hardcode `~/CodeScaleBench` (use `git rev-parse --show-toplevel`).
132+
- 14 skill files + 5 schemas have stale `sourcegraph_full` (valid: `none`/`sourcegraph`/`deepsearch`).
133+
- 3 deprecated model IDs in skills: `claude-opus-4-5-20251101``claude-opus-4-6`.
134134

135135
### Git / Auth
136136
- `gh auth refresh -h github.com -s write:packages`. Env vars must be **exported** (`set -a` before sourcing `.env.local`).
@@ -150,13 +150,17 @@ full operations manual.
150150
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
151151
- Ralph: `prd.json` single-active; archive before overwrite. `prd-archive/` and `prd.json` not gitignored.
152152

153-
### Scripts / Code Quality (Mar 17-18 additions)
154-
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; crash on other machines.
153+
### Scripts / Code Quality (Mar 17-20 additions)
154+
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench`; crash on other machines.
155155
- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
156156
- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
157157
- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
158-
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → old org `CodeScaleBench`; links 404.
159-
- **Ruff** S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
158+
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → stale org; links 404.
159+
- Ruff S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
160+
- Hardcoded-path epidemic: `fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+` also use `/home/stephanie_jarmak/CodeScaleBench`; 5 scripts total.
161+
- Shell scripts `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29` use deprecated `claude-opus-4-5-20251101`; Ruff misses `.sh` — add `grep -rn "claude-opus-4-5" scripts/` to CI.
162+
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell, `cp` targets missing dir.
163+
- `csb_metrics/extractors.py:669`: FD leak via `tp.open()` (pathlib form; missed by SIM115 grep sweep).
160164

161165
## Maintenance
162166
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

CLAUDE.md

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -96,23 +96,23 @@ full operations manual.
9696
- **Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
9797

9898
### Scripts / Code Quality
99-
- **abc_audit.py**: 4 functions defined twice (`check_oa_*`, `check_ob_*`, `check_og_*`, `check_t10_*`); Python silently uses last definition.
100-
- **`rerun_failed.py`**: `shell=True` + dynamic commands (injection risk); `sourcegraph_fulldeepsearch` mapping wrong (invalid MCP type); contains deprecated model ID.
101-
- **ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
102-
- **`--skip-completed`**: requires both `result.json` + `task_metrics.json`. Fix: check only `result.json`.
103-
- **Task registry header stale**: claims 436, actual 274. `sync_task_metadata.py --fix` doesn't update it.
104-
- **`verification_modes`/`use_case_category` missing from all 274 tasks**: breaks auto-detection + `--use-case-category` filter (silently filters all).
99+
- `abc_audit.py`: 4 functions defined twice; Python silently uses last definition.
100+
- `rerun_failed.py`: `shell=True` injection; wrong `sourcegraph_fulldeepsearch` mapping; deprecated model.
101+
- `ir_metrics.py:749`: `tt_all_r` set comparison bug (first-relevant, not all-relevant).
102+
- `--skip-completed` requires result.json + task_metrics.json; fix: check only result.json.
103+
- Task registry header: claims 436, actual 274 (`sync_task_metadata.py --fix` doesn't update it).
104+
- `verification_modes`/`use_case_category` missing from all 274 tasks; `--use-case-category` silently returns 0.
105105

106106
### Validation / Scoring
107-
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
108-
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash. `timeout 600` on runners; `--forceExit` for Jest; Jest+TS: `memory_mb = 8192`.
109-
- **CSB dual-score**: file edits + `answer.json` independent. Fallback: `promoted_verifier.py``oracle_checks.py`heuristic.
110-
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` in `instruction.md` → use `<placeholder>`.
111-
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
112-
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
113-
- **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
114-
- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147,1005` `mcp_mode or config_name` falls through on empty string (both sites).
115-
- `models.py` `from_dict()` mutates caller's dict via `.pop()`.
107+
- `validators.py` duplicated in `ccb_build`; update all copies (`sha256sum`).
108+
- Agent <2s = never ran. `reward.txt` in Python. `timeout 600`; Jest `--forceExit`; `memory_mb=8192`.
109+
- CSB dual-score: edits + `answer.json` independent. Fallback: promoted_verifieroracle_checksheuristic.
110+
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` `<placeholder>`.
111+
- Pass rate logic duplicated: `generate_eval_report.py` + `csb_metrics/models.py`.
112+
- `cost_report.py`: use `or 1` guard (not `.get(..., 1)` which returns 0 for key=0).
113+
- TARGET_SUITE: 55 stale, 220 missing. `dual_score_lib.sh scorer_artifact` always `"auto"`.
114+
- Falsy bugs: `max_score=0`, None MCP, promote_run.py non-dict env, eval_report.py:147,1005.
115+
- `models.py from_dict()` mutates caller dict via `.pop()`.
116116

117117
### Agent / Runner Robustness
118118
- **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
@@ -128,9 +128,9 @@ full operations manual.
128128
- **16 copies of `DIR_PREFIX_TO_SUITE`** across 30+ scripts with divergent definitions. Centralize in `csb_metrics/suite_registry.py`.
129129

130130
### Skills / Automation
131-
- **54 stale paths**: 25 skill files hardcode `~/CodeScaleBench` (actual `~/CodeContextBench`). Use `$(git rev-parse --show-toplevel)`.
132-
- **21 stale `sourcegraph_full` refs**: 14 skill files + 5 schemas. Invalid `BASELINE_MCP_TYPE` value (accepts `none`/`sourcegraph`/`deepsearch`).
133-
- **3 deprecated model IDs**: `claude-opus-4-5-20251101``claude-opus-4-6` in skills.
131+
- 25 skill files hardcode `~/CodeScaleBench` (use `git rev-parse --show-toplevel`).
132+
- 14 skill files + 5 schemas have stale `sourcegraph_full` (valid: `none`/`sourcegraph`/`deepsearch`).
133+
- 3 deprecated model IDs in skills: `claude-opus-4-5-20251101``claude-opus-4-6`.
134134

135135
### Git / Auth
136136
- `gh auth refresh -h github.com -s write:packages`. Env vars must be **exported** (`set -a` before sourcing `.env.local`).
@@ -150,13 +150,17 @@ full operations manual.
150150
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic.
151151
- Ralph: `prd.json` single-active; archive before overwrite. `prd-archive/` and `prd.json` not gitignored.
152152

153-
### Scripts / Code Quality (Mar 17-18 additions)
154-
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench` path; crash on other machines.
153+
### Scripts / Code Quality (Mar 17-20 additions)
154+
- `apply_verifier_fixes.py:9` hardcodes `~/CodeScaleBench`; crash on other machines.
155155
- `context_retrieval_agent.py:432+` `shell=True` without allowlist; injection risk.
156156
- Non-atomic writes: `aggregate_status.py:669`, `apply_verifier_fixes.py:103+`; use temp+rename.
157157
- Bare `except:`: `audit_v2_report_data.py:104`, `ds_audit.py:244+`, `extract_v2_report_data.py:144+`.
158-
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → old org `CodeScaleBench`; links 404.
159-
- **Ruff** S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
158+
- FD leaks: 17+ sites; use `with open()`. `export_official_results.py:45` `DEFAULT_REPO_BLOB_BASE` → stale org; links 404.
159+
- Ruff S603/S604, SIM115, BLE001; add `pyproject.toml`. SIM115 skips `Popen(stdout=f)`. `sanitize_secrets.py`: S105/S106 per-file ignores.
160+
- Hardcoded-path epidemic: `fix_memory_mb.py:8`, `extract_build_diary.py:121`, `plot_build_diary_supplementary.py:121+` also use `/home/stephanie_jarmak/CodeScaleBench`; 5 scripts total.
161+
- Shell scripts `rerun_fixed_tasks.sh:34`, `rerun_zero_mcp_tasks.sh:29` use deprecated `claude-opus-4-5-20251101`; Ruff misses `.sh` — add `grep -rn "claude-opus-4-5" scripts/` to CI.
162+
- `run_selected_tasks.sh:648,699,711`: mktemp+mv race — `mv` failure swallowed by subshell, `cp` targets missing dir.
163+
- `csb_metrics/extractors.py:669`: FD leak via `tp.open()` (pathlib form; missed by SIM115 grep sweep).
160164

161165
## Maintenance
162166
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

0 commit comments

Comments
 (0)