perf(refactor,feature-dev): harness design overhaul + Codex review fixes#23
perf(refactor,feature-dev): harness design overhaul + Codex review fixes#23
Conversation
…ator, context compaction, checkpoint-resume Apply Anthropic harness design principles to both skills: 1. Deferred agent spawning (refactor): spawn agents on-demand per phase, not all upfront — avoids wasting resources on early exit 2. Checkpoint-resume protocol (both): write orchestrator state to blackboard after each phase; detect and resume from prior checkpoints 3. Context compaction + phase summaries (both): summarize and discard raw agent reports after each phase; add --context-reset flag for full context resets on long autonomous runs 4. Blackboard write validation (both): read-back after critical writes with retry and inline fallback on failure 5. Separated evaluator pass (both): code-reviewer grades implementation against priorities/architecture BEFORE tests run — catches issues self-evaluation misses. Sprint contracts in autonomous mode. 6. Structured inter-iteration feedback (both): write iteration_N_weaknesses to blackboard after EVALUATE; next MODIFY reads specific targets 7. Lightweight convergence in standard mode (refactor): skip remaining iterations when refactor-code reports zero files modified 8. Configurable autonomous thresholds (feature-dev): move hardcoded rigor/coverage thresholds to config fields 9. Intra-iteration progress reporting (both): brief status messages within each iteration for visibility during long runs Based on: anthropic.com/engineering/harness-design-long-running-apps
Deferred spawning moved test-planner, test-writer, test-rigor-reviewer, and coverage-analyst to Phase 2, but they are assigned tasks in Phase 1 Steps 1.1 and 1.3 for testing-focus runs. Now spawned in Step 0.9 when testing is in focus areas, preventing task assignment to non-existent agents. Non-testing runs still defer to Phase 2.
…r, coverage line mapping P1: detect_project.py uses relative imports and had no CLI entry point. Added __main__ block and updated SKILL.md to use python -m scripts.detect_project instead of direct script execution. P2: Vitest parser matched first 'N passed' from 'Test Files' summary instead of 'Tests' summary, reporting passed=1 for a file with 42 tests. Now explicitly parses the 'Tests' summary line. P2: TypeScript coverage report returned statement IDs (0, 1, 2...) as uncovered_lines instead of actual source line numbers from statementMap. Now maps statement IDs through statementMap to get real line numbers.
- Vitest parser: assert passed==42 from Tests line, not passed==1 from Test Files line. Covers mixed pass/fail and fallback cases. - TypeScript coverage: assert uncovered_lines are source lines [10, 42], not statement IDs [1, 2]. Verifies statementMap lookup. - detect_project CLI: assert python -m scripts.detect_project produces valid JSON and exits non-zero on bad paths.
There was a problem hiding this comment.
Pull request overview
This PR updates the /refactor and /feature-dev skill harness documentation to incorporate checkpoint/resume, context compaction, and separated evaluator guidance, and it fixes three concrete runtime/parsing issues in the Python support scripts (Vitest test-count parsing, TypeScript coverage line mapping, and a detect_project module entry point) with regression tests.
Changes:
- Updated Vitest/TypeScript test output parsing to prefer the
Testssummary line overTest Files, with new parser-focused regression tests. - Fixed TypeScript coverage normalization to map uncovered statement IDs through
statementMapinto real source line numbers, with a regression test. - Added a
__main__entry point forscripts.detect_projectand updated skill docs/tests to usepython -m scripts.detect_project.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/run_tests.py |
Improves Vitest parsing by extracting counts from the Tests summary line when present. |
tests/test_run_tests.py |
Adds/strengthens Vitest parsing regression tests to lock in correct pass/fail counts. |
scripts/coverage_report.py |
Maps Istanbul s statement IDs to real source line numbers via statementMap. |
tests/test_coverage_report.py |
Adds regression test ensuring uncovered lines are source lines, not statement IDs. |
scripts/detect_project.py |
Adds module entry point behavior for python -m scripts.detect_project. |
tests/test_detect_project.py |
Adds subprocess-based regression tests verifying module execution behavior. |
skills/test-architect/SKILL.md |
Updates documentation to use the module invocation for project detection. |
skills/refactor/SKILL.md |
Documents harness design overhaul details (checkpointing, context compaction, evaluator pass, deferred spawning, etc.). |
skills/feature-dev/SKILL.md |
Documents harness changes plus configurable autonomous thresholds and checkpointing guidance. |
Comments suppressed due to low confidence (1)
skills/feature-dev/SKILL.md:106
- There’s an extra standalone triple-backtick fence here, which will break markdown rendering for the rest of the document. Remove this fence (or replace it with the intended code block content).
After loading config, set: ta_config = config.featureDev.testArchitect ?? { enabled: true, minimumRigorScore: 0.7, minimumCoverage: 80 }. All quality gate comparisons use ta_config.* — never hardcoded values.
### Step 0.0.5: Pre-flight Workspace Cleanup
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Use sys.executable instead of hardcoded 'python3' in CLI subprocess tests - Wrap long vitest test string to respect 100-char line-length limit - Change JSON code block with // comments to jsonc fence language
Summary
/refactorand/feature-devskillsChanges
Harness Design Overhaul (both skills)
iteration_N_weaknesseswritten after EVALUATE; next MODIFY reads specific targetsRefactor-specific
Feature-dev-specific
autonomousMinimumRigorScoreandautonomousMinimumCoveragein config--context-resetflag: full context resets between phases for long autonomous runsCodex Review Fixes
detect_project.py— added__main__block; updated SKILL.md to usepython -m scripts.detect_projectTestssummary line, notTest Files(was reportingpassed=1instead ofpassed=42)statementMapto real source line numbersTest Plan
test_vitest_prefers_tests_line_over_test_filesassertspassed==42test_typescript_uncovered_lines_are_source_lines_not_statement_idsasserts[10, 42]not[1, 2]test_cli_entry_point_produces_jsonverifiespython -m scripts.detect_projectworks