Skip to content

perf(refactor,feature-dev): harness design overhaul + Codex review fixes#23

Merged
zircote merged 5 commits intodevelopfrom
feature/harness-overhaul
Apr 1, 2026
Merged

perf(refactor,feature-dev): harness design overhaul + Codex review fixes#23
zircote merged 5 commits intodevelopfrom
feature/harness-overhaul

Conversation

@zircote
Copy link
Copy Markdown
Owner

@zircote zircote commented Apr 1, 2026

Summary

  • Apply Anthropic harness design principles (separated evaluation, context compaction, checkpoint-resume) to /refactor and /feature-dev skills
  • Fix 3 bugs found by Codex review: detect-project CLI entry point, Vitest parser, TypeScript coverage line mapping
  • Add regression tests locking in all fixes

Changes

Harness Design Overhaul (both skills)

  • Separated evaluator: code-reviewer grades implementation against priorities BEFORE tests run — catches self-evaluation blindness
  • Sprint contracts: architect and evaluator negotiate "done" criteria before each autonomous iteration
  • Context compaction: phase summaries written to blackboard; raw agent reports discarded from orchestrator context
  • Checkpoint-resume: orchestrator state saved to blackboard after each phase; resume from prior checkpoint on crash
  • Blackboard write validation: read-back after critical writes with retry and inline fallback
  • Structured inter-iteration feedback: iteration_N_weaknesses written after EVALUATE; next MODIFY reads specific targets
  • Intra-iteration progress reporting: status messages within each iteration for visibility

Refactor-specific

  • Deferred agent spawning: agents spawn on-demand per phase, not all upfront
  • Testing-focus fix: test-architect agents spawn in Step 0.9 when testing focus is active (not deferred to Phase 2)
  • Zero-change gate: standard mode exits early when no files modified in an iteration

Feature-dev-specific

  • Configurable autonomous thresholds: autonomousMinimumRigorScore and autonomousMinimumCoverage in config
  • --context-reset flag: full context resets between phases for long autonomous runs

Codex Review Fixes

  • P1: detect_project.py — added __main__ block; updated SKILL.md to use python -m scripts.detect_project
  • P2: Vitest parser — now matches Tests summary line, not Test Files (was reporting passed=1 instead of passed=42)
  • P2: TypeScript coverage — maps statement IDs through statementMap to real source line numbers

Test Plan

  • Full test suite passes (95.64% coverage)
  • Vitest parser regression: test_vitest_prefers_tests_line_over_test_files asserts passed==42
  • Coverage mapping regression: test_typescript_uncovered_lines_are_source_lines_not_statement_ids asserts [10, 42] not [1, 2]
  • detect_project CLI: test_cli_entry_point_produces_json verifies python -m scripts.detect_project works
  • Autoresearch: refactor skill scored 1.0 (up from 0.983), feature-dev scored 1.0 baseline
  • All pre-commit hooks pass (ruff, mypy, bandit)

zircote added 4 commits April 1, 2026 09:35
…ator, context compaction, checkpoint-resume

Apply Anthropic harness design principles to both skills:

1. Deferred agent spawning (refactor): spawn agents on-demand per phase,
   not all upfront — avoids wasting resources on early exit
2. Checkpoint-resume protocol (both): write orchestrator state to
   blackboard after each phase; detect and resume from prior checkpoints
3. Context compaction + phase summaries (both): summarize and discard
   raw agent reports after each phase; add --context-reset flag for
   full context resets on long autonomous runs
4. Blackboard write validation (both): read-back after critical writes
   with retry and inline fallback on failure
5. Separated evaluator pass (both): code-reviewer grades implementation
   against priorities/architecture BEFORE tests run — catches issues
   self-evaluation misses. Sprint contracts in autonomous mode.
6. Structured inter-iteration feedback (both): write iteration_N_weaknesses
   to blackboard after EVALUATE; next MODIFY reads specific targets
7. Lightweight convergence in standard mode (refactor): skip remaining
   iterations when refactor-code reports zero files modified
8. Configurable autonomous thresholds (feature-dev): move hardcoded
   rigor/coverage thresholds to config fields
9. Intra-iteration progress reporting (both): brief status messages
   within each iteration for visibility during long runs

Based on: anthropic.com/engineering/harness-design-long-running-apps
Deferred spawning moved test-planner, test-writer, test-rigor-reviewer,
and coverage-analyst to Phase 2, but they are assigned tasks in Phase 1
Steps 1.1 and 1.3 for testing-focus runs. Now spawned in Step 0.9 when
testing is in focus areas, preventing task assignment to non-existent
agents. Non-testing runs still defer to Phase 2.
…r, coverage line mapping

P1: detect_project.py uses relative imports and had no CLI entry point.
Added __main__ block and updated SKILL.md to use python -m scripts.detect_project
instead of direct script execution.

P2: Vitest parser matched first 'N passed' from 'Test Files' summary
instead of 'Tests' summary, reporting passed=1 for a file with 42 tests.
Now explicitly parses the 'Tests' summary line.

P2: TypeScript coverage report returned statement IDs (0, 1, 2...) as
uncovered_lines instead of actual source line numbers from statementMap.
Now maps statement IDs through statementMap to get real line numbers.
- Vitest parser: assert passed==42 from Tests line, not passed==1 from
  Test Files line. Covers mixed pass/fail and fallback cases.
- TypeScript coverage: assert uncovered_lines are source lines [10, 42],
  not statement IDs [1, 2]. Verifies statementMap lookup.
- detect_project CLI: assert python -m scripts.detect_project produces
  valid JSON and exits non-zero on bad paths.
Copilot AI review requested due to automatic review settings April 1, 2026 15:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the /refactor and /feature-dev skill harness documentation to incorporate checkpoint/resume, context compaction, and separated evaluator guidance, and it fixes three concrete runtime/parsing issues in the Python support scripts (Vitest test-count parsing, TypeScript coverage line mapping, and a detect_project module entry point) with regression tests.

Changes:

  • Updated Vitest/TypeScript test output parsing to prefer the Tests summary line over Test Files, with new parser-focused regression tests.
  • Fixed TypeScript coverage normalization to map uncovered statement IDs through statementMap into real source line numbers, with a regression test.
  • Added a __main__ entry point for scripts.detect_project and updated skill docs/tests to use python -m scripts.detect_project.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/run_tests.py Improves Vitest parsing by extracting counts from the Tests summary line when present.
tests/test_run_tests.py Adds/strengthens Vitest parsing regression tests to lock in correct pass/fail counts.
scripts/coverage_report.py Maps Istanbul s statement IDs to real source line numbers via statementMap.
tests/test_coverage_report.py Adds regression test ensuring uncovered lines are source lines, not statement IDs.
scripts/detect_project.py Adds module entry point behavior for python -m scripts.detect_project.
tests/test_detect_project.py Adds subprocess-based regression tests verifying module execution behavior.
skills/test-architect/SKILL.md Updates documentation to use the module invocation for project detection.
skills/refactor/SKILL.md Documents harness design overhaul details (checkpointing, context compaction, evaluator pass, deferred spawning, etc.).
skills/feature-dev/SKILL.md Documents harness changes plus configurable autonomous thresholds and checkpointing guidance.
Comments suppressed due to low confidence (1)

skills/feature-dev/SKILL.md:106

  • There’s an extra standalone triple-backtick fence here, which will break markdown rendering for the rest of the document. Remove this fence (or replace it with the intended code block content).

After loading config, set: ta_config = config.featureDev.testArchitect ?? { enabled: true, minimumRigorScore: 0.7, minimumCoverage: 80 }. All quality gate comparisons use ta_config.* — never hardcoded values.


### Step 0.0.5: Pre-flight Workspace Cleanup

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Use sys.executable instead of hardcoded 'python3' in CLI subprocess tests
- Wrap long vitest test string to respect 100-char line-length limit
- Change JSON code block with // comments to jsonc fence language
@zircote zircote merged commit d4644e8 into develop Apr 1, 2026
4 checks passed
@zircote zircote deleted the feature/harness-overhaul branch April 1, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants