perf(refactor,feature-dev): harness design overhaul + Codex review fixes by zircote · Pull Request #23 · zircote/refactor

zircote · 2026-04-01T15:16:48Z

Summary

Apply Anthropic harness design principles (separated evaluation, context compaction, checkpoint-resume) to /refactor and /feature-dev skills
Fix 3 bugs found by Codex review: detect-project CLI entry point, Vitest parser, TypeScript coverage line mapping
Add regression tests locking in all fixes

Changes

Harness Design Overhaul (both skills)

Separated evaluator: code-reviewer grades implementation against priorities BEFORE tests run — catches self-evaluation blindness
Sprint contracts: architect and evaluator negotiate "done" criteria before each autonomous iteration
Context compaction: phase summaries written to blackboard; raw agent reports discarded from orchestrator context
Checkpoint-resume: orchestrator state saved to blackboard after each phase; resume from prior checkpoint on crash
Blackboard write validation: read-back after critical writes with retry and inline fallback
Structured inter-iteration feedback: iteration_N_weaknesses written after EVALUATE; next MODIFY reads specific targets
Intra-iteration progress reporting: status messages within each iteration for visibility

Refactor-specific

Deferred agent spawning: agents spawn on-demand per phase, not all upfront
Testing-focus fix: test-architect agents spawn in Step 0.9 when testing focus is active (not deferred to Phase 2)
Zero-change gate: standard mode exits early when no files modified in an iteration

Feature-dev-specific

Configurable autonomous thresholds: autonomousMinimumRigorScore and autonomousMinimumCoverage in config
--context-reset flag: full context resets between phases for long autonomous runs

Codex Review Fixes

P1: detect_project.py — added __main__ block; updated SKILL.md to use python -m scripts.detect_project
P2: Vitest parser — now matches Tests summary line, not Test Files (was reporting passed=1 instead of passed=42)
P2: TypeScript coverage — maps statement IDs through statementMap to real source line numbers

Test Plan

Full test suite passes (95.64% coverage)
Vitest parser regression: test_vitest_prefers_tests_line_over_test_files asserts passed==42
Coverage mapping regression: test_typescript_uncovered_lines_are_source_lines_not_statement_ids asserts [10, 42] not [1, 2]
detect_project CLI: test_cli_entry_point_produces_json verifies python -m scripts.detect_project works
Autoresearch: refactor skill scored 1.0 (up from 0.983), feature-dev scored 1.0 baseline
All pre-commit hooks pass (ruff, mypy, bandit)

…ator, context compaction, checkpoint-resume Apply Anthropic harness design principles to both skills: 1. Deferred agent spawning (refactor): spawn agents on-demand per phase, not all upfront — avoids wasting resources on early exit 2. Checkpoint-resume protocol (both): write orchestrator state to blackboard after each phase; detect and resume from prior checkpoints 3. Context compaction + phase summaries (both): summarize and discard raw agent reports after each phase; add --context-reset flag for full context resets on long autonomous runs 4. Blackboard write validation (both): read-back after critical writes with retry and inline fallback on failure 5. Separated evaluator pass (both): code-reviewer grades implementation against priorities/architecture BEFORE tests run — catches issues self-evaluation misses. Sprint contracts in autonomous mode. 6. Structured inter-iteration feedback (both): write iteration_N_weaknesses to blackboard after EVALUATE; next MODIFY reads specific targets 7. Lightweight convergence in standard mode (refactor): skip remaining iterations when refactor-code reports zero files modified 8. Configurable autonomous thresholds (feature-dev): move hardcoded rigor/coverage thresholds to config fields 9. Intra-iteration progress reporting (both): brief status messages within each iteration for visibility during long runs Based on: anthropic.com/engineering/harness-design-long-running-apps

Deferred spawning moved test-planner, test-writer, test-rigor-reviewer, and coverage-analyst to Phase 2, but they are assigned tasks in Phase 1 Steps 1.1 and 1.3 for testing-focus runs. Now spawned in Step 0.9 when testing is in focus areas, preventing task assignment to non-existent agents. Non-testing runs still defer to Phase 2.

…r, coverage line mapping P1: detect_project.py uses relative imports and had no CLI entry point. Added __main__ block and updated SKILL.md to use python -m scripts.detect_project instead of direct script execution. P2: Vitest parser matched first 'N passed' from 'Test Files' summary instead of 'Tests' summary, reporting passed=1 for a file with 42 tests. Now explicitly parses the 'Tests' summary line. P2: TypeScript coverage report returned statement IDs (0, 1, 2...) as uncovered_lines instead of actual source line numbers from statementMap. Now maps statement IDs through statementMap to get real line numbers.

- Vitest parser: assert passed==42 from Tests line, not passed==1 from Test Files line. Covers mixed pass/fail and fallback cases. - TypeScript coverage: assert uncovered_lines are source lines [10, 42], not statement IDs [1, 2]. Verifies statementMap lookup. - detect_project CLI: assert python -m scripts.detect_project produces valid JSON and exits non-zero on bad paths.

Copilot

Pull request overview

This PR updates the /refactor and /feature-dev skill harness documentation to incorporate checkpoint/resume, context compaction, and separated evaluator guidance, and it fixes three concrete runtime/parsing issues in the Python support scripts (Vitest test-count parsing, TypeScript coverage line mapping, and a detect_project module entry point) with regression tests.

Changes:

Updated Vitest/TypeScript test output parsing to prefer the Tests summary line over Test Files, with new parser-focused regression tests.
Fixed TypeScript coverage normalization to map uncovered statement IDs through statementMap into real source line numbers, with a regression test.
Added a __main__ entry point for scripts.detect_project and updated skill docs/tests to use python -m scripts.detect_project.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`scripts/run_tests.py`	Improves Vitest parsing by extracting counts from the `Tests` summary line when present.
`tests/test_run_tests.py`	Adds/strengthens Vitest parsing regression tests to lock in correct pass/fail counts.
`scripts/coverage_report.py`	Maps Istanbul `s` statement IDs to real source line numbers via `statementMap`.
`tests/test_coverage_report.py`	Adds regression test ensuring uncovered lines are source lines, not statement IDs.
`scripts/detect_project.py`	Adds module entry point behavior for `python -m scripts.detect_project`.
`tests/test_detect_project.py`	Adds subprocess-based regression tests verifying module execution behavior.
`skills/test-architect/SKILL.md`	Updates documentation to use the module invocation for project detection.
`skills/refactor/SKILL.md`	Documents harness design overhaul details (checkpointing, context compaction, evaluator pass, deferred spawning, etc.).
`skills/feature-dev/SKILL.md`	Documents harness changes plus configurable autonomous thresholds and checkpointing guidance.

Comments suppressed due to low confidence (1)

skills/feature-dev/SKILL.md:106

There’s an extra standalone triple-backtick fence here, which will break markdown rendering for the rest of the document. Remove this fence (or replace it with the intended code block content).

After loading config, set: ta_config = config.featureDev.testArchitect ?? { enabled: true, minimumRigorScore: 0.7, minimumCoverage: 80 }. All quality gate comparisons use ta_config.* — never hardcoded values.


### Step 0.0.5: Pre-flight Workspace Cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_detect_project.py

tests/test_run_tests.py

skills/feature-dev/SKILL.md

- Use sys.executable instead of hardcoded 'python3' in CLI subprocess tests - Wrap long vitest test string to respect 100-char line-length limit - Change JSON code block with // comments to jsonc fence language

zircote added 4 commits April 1, 2026 09:35

Copilot AI review requested due to automatic review settings April 1, 2026 15:16

Copilot started reviewing on behalf of zircote April 1, 2026 15:17 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

tests/test_detect_project.py Show resolved Hide resolved

tests/test_detect_project.py Show resolved Hide resolved

tests/test_run_tests.py Show resolved Hide resolved

skills/feature-dev/SKILL.md Show resolved Hide resolved

fix: address Copilot review feedback on PR #23

a84f81b

- Use sys.executable instead of hardcoded 'python3' in CLI subprocess tests - Wrap long vitest test string to respect 100-char line-length limit - Change JSON code block with // comments to jsonc fence language

zircote merged commit d4644e8 into develop Apr 1, 2026
4 checks passed

zircote deleted the feature/harness-overhaul branch April 1, 2026 15:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(refactor,feature-dev): harness design overhaul + Codex review fixes#23

perf(refactor,feature-dev): harness design overhaul + Codex review fixes#23
zircote merged 5 commits intodevelopfrom
feature/harness-overhaul

zircote commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zircote commented Apr 1, 2026

Summary

Changes

Harness Design Overhaul (both skills)

Refactor-specific

Feature-dev-specific

Codex Review Fixes

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants