refactor: entire codebase — autonomous convergence (3 iterations) by zircote · Pull Request #22 · zircote/refactor

zircote · 2026-03-31T17:07:54Z

Summary

Autonomous refactoring of the entire scripts/ codebase using a 6-agent swarm with convergence loop.

3 iterations, 100% keep rate, composite score 0.9125 → 0.9700 (+6.3%)
12 files changed (911 insertions, 239 deletions)
795 tests passing, 97.95% coverage
1 bug fixed: silent data loss in audit logging (coverage_pct key mismatch)

Quality Scores

Metric	Score
Clean Code	8/10
Architecture	8/10
Security Posture	9/10

Key Changes

Unified LanguageConfig dataclass registry (consolidates 7 dispatch dicts across 3 modules)
7 TypedDicts replace dict[str, Any] returns for type safety
Fixed silent bug: coverage_pct key mismatch caused audit logs to record None
Extracted run_subprocess() helper (single auditable subprocess security surface)
Dict-based dispatch replaces if/elif chains and key-sniffing patterns
Removed dead code (unused path param in detect_test_framework())
Added 398 lines of gap coverage tests

Test Plan

795/795 tests passing throughout all iterations
97.95% code coverage (above 80% minimum)
Zero regressions across all iterations
ruff, mypy, bandit all clean
Security review: PASS (no regressions, subprocess centralized)

Convergence Report

See refactor-result-20260331-130522.md for full details.

… feature-dev Implements Karpathy autoresearch pattern for source code improvement with composite scoring (tests 50% + quality 25% + security 25%), git branch snapshots for keep/discard gating, and automatic convergence detection (perfect score, stuck, plateau, max iterations). New: convergence-reporter agent, code-reviewer Mode 5, scripts/ (git_snapshot.sh, score.sh, results_log.sh), algorithm reference, 6 eval cases, explanation + how-to docs.

- Fix README: version badge 3.1.0→4.0.0, agent count 7→8, add convergence-reporter to agent list, add autonomous mode to quick start/features/docs table - Add frontmatter to use-autonomous-mode.md and autonomous-convergence.md - Restructure use-autonomous-mode.md with overview/prerequisites/steps/ verification/related sections matching how-to pattern - Fix config reference: version default 3.1→4.0, remove duplicate CLI flags table, update stale version strings in examples - Add tutorial-autonomous.md filling the tutorial quadrant gap - Add v4.0.0 section to architecture.md explanation - Add 4 autonomous mode troubleshooting entries - Add cross-references between all autonomous mode docs - Fix tutorial agent count references and version strings - Update use-feature-dev.md version string and prerequisite

- Delete feature-dev-workspace/ iteration results and skill snapshots - Delete refactor-workspace/ iteration results and skill snapshots

- Add test-architect skill with 4 modes (full, plan, eval, coverage) - Add 4 specialist agents: test-planner, test-writer, test-rigor-reviewer, coverage-analyst - Add 3 commands: /test-gen, /test-plan, /test-eval - Add reference materials for property testing, boundary analysis, mutation testing - Add project detection and coverage report scripts - Add test-architect evals and hooks - Update plugin.json, CHANGELOG, and existing skill definitions

New documentation (4 files): - Tutorial: Your First Test Architecture - How-to: Generate and Evaluate Tests - How-to: Evaluate Test Quality - Explanation: Formal Test Design Techniques Updated documentation (8 files): - agents.md: add 4 test-architect agents (8->12 total) - quality-scores.md: add rigor score rubric and coverage verdicts - configuration.md: add test-architect config and --focus=testing - focus-refactoring.md: add testing focus area - troubleshooting.md: add 3 test-architect entries - architecture.md: add v4.1.0 section Structural: - Move tutorials from docs/ root to docs/tutorials/ - Update all cross-references across 18 files

- Create docs/README.md with full Diataxis index, coverage matrix, and directory structure - Update root README: 8->12 agents, add test-architect skill, fix tutorial paths to docs/tutorials/, add 4 new doc entries

…ments, pr-fix) - cp: stage, commit, push with conventional commits - ff: fast-forward merge only - fr: fetch and rebase onto remote - sync: full fetch, rebase, push cycle - prune: clean stale local branches (dry-run default) - pr: create/update/manage PRs (draft default) - review-comments: confidence-scored PR comment review - pr-fix: 10-phase PR remediation workflow - All use gh CLI exclusively - Includes autoresearch-compatible evals and trigger-evals

- cp, ff, fr, sync, prune, pr, review-comments, pr-fix - Positive and negative trigger tests per skill - Cross-skill routing accuracy tests (fr vs ff vs sync, pr vs pr-fix vs review-comments)

- cp: clarify individual file staging (0.975 → 1.00) - fr: add commit count reporting and stash pop warning (0.936 → 1.00) - pr: add natural language intent mapping table (0.98 → 1.00) - prune: structured counting, case handlers, force-mode messaging (0.681 → 1.00) - review-comments: --score-only mode, per-dimension flagging (0.88 → 1.00) - sync: argument parsing, conflict halt, force-push discipline (0.475 → 1.00)

- ff: execution policy, precise commit counting, divergence explanation (0.571 → 1.00) - pr-fix: push before thread resolution ordering (0.96 → 1.00)

- Add MANDATORY SWARM ORCHESTRATION blocks to refactor, feature-dev, test-architect - TeamCreate is now a blocking prerequisite with retry + stop on failure - team_name parameter documented as REQUIRED on every Agent spawn - SendMessage reminder after each spawn to prevent idle teammates - Prevents model from falling back to plain Agent subagents

Add explicit continuation directives to Phase 0.1 steps in SKILL.md to prevent agents from stalling between blackboard_create and TaskCreate. Add regression evals (IDs 7, 8) verifying the full initialization sequence completes without interruption.

Replace refactor-test with the full test-architect pipeline (test-planner → test-writer → test-rigor-reviewer → coverage-analyst) as a mandatory, non-optional part of the feature-dev workflow. Key changes: - New Phase 4.5: Test Architecture Planning — test-planner produces scientifically grounded test plans against chosen architecture - Phase 5: test-writer replaces refactor-test for plan-driven test generation with mutation-aware assertions - Phase 6: test-rigor-reviewer + coverage-analyst now mandatory (not conditional) with configurable quality gates (minimumRigorScore, minimumCoverage) that block feature completion - Autonomous mode: test plan is stable fitness function, not rewritten per iteration - Config: testArchitect section under featureDev with enabled flag and threshold defaults - Fix/Override/Abandon gate with max 2 re-validation loops - Error handling fallbacks for missing test plans or coverage tools

Update 7 documentation files to reflect the Phase 4.5 test architecture planning integration and mandatory quality gates in feature-dev: - tutorial-feature-dev.md: Add Phase 4.5 step, replace refactor-test with test-writer, add quality gate example, update learning goals - use-feature-dev.md: Add Phase 4.5 section, expand quality review with rigor/coverage gates, add testArchitect config documentation - agents.md: Update feature-dev agent list (8 agents), add /feature-dev invocation points for all 4 test-architect agents, fix multi-instance table, update autonomous test freeze behavior - configuration.md: Add testArchitect config section with enabled flag, minimumRigorScore, minimumCoverage fields and quality gate behavior - architecture.md: Add v4.2.0 section explaining the integration rationale, Phase 4.5 timing, stable test_plan contract, and gates - README.md: Cross-reference test-architect docs from Feature-Dev row - troubleshooting.md: Add feature-dev test plan and quality gate troubleshooting entries

Allow agents to inherit the model from the parent session instead of being pinned to sonnet across all 12 agent definitions.

All 5 pushing skills (pr, cp, pr-fix, feature-dev, refactor) now fetch and rebase onto the target branch before pushing or creating PRs, guaranteeing branches are always current with upstream. Key changes: - pr: rebase before first push (no force-with-lease needed) - cp: sync with remote before push, conditional force-with-lease - pr-fix: rebase before remediation (phase reorder) - feature-dev/refactor: fetch/rebase before PR creation - sync: conditional force-with-lease after rebase - git_snapshot.sh: git clean -fd on restore for completeness - Secret exclusion added to feature-dev and refactor staging Autonomous convergence: 4 iterations, score 0.738 → 0.980 Quality: 4.0 → 9.7, Security: 5.5 → 9.5, Tests: 68/68 pass

Addresses issues #2–#11 from /cog-discover assessment: - #2: Bootstrap pytest test suite (741 tests, 88% coverage) - #3: Bridge 27 eval JSON files to parametrized pytest assertions - #4: Configure ruff linter and formatter - #5: Add security scanning (pip-audit + bandit + dependabot pip) - #6: Structured error handling with custom exception hierarchy - #7: Test fixtures and data management (7 fixture files, factory pattern) - #8: Release automation (.github/workflows/release.yml) - #9: Regression tests for 6 past bug fixes - #10: Refactor long functions (extract 4 helpers) - #11: Property-based testing with Hypothesis (14 properties) Also fixes: - Agent reaping: shutdown timeout, guaranteed cleanup, stale detection - Bug: parse_json_output empty dict falsy-or (found by Hypothesis) - Bug: parse_coverage crash on non-dict JSON/NaN (found by Hypothesis) - Workspace cleanup: skills now rm -rf workspace dirs in finalization - COD-010: moved import re to module top in coverage_report.py

When --autonomous is set, both refactor and feature-dev skills now bypass ALL AskUserQuestion prompts and use highest-confidence best practices instead: - refactor: skip config setup (use defaults), skip scope confirmation, auto-fix findings >= 80 confidence, commit without confirmation - feature-dev: skip elicitation (use assumptions), skip clarification (use codebase patterns), auto-select architecture (convention-aligned), skip implementation approval, auto-resolve review findings Previously --autonomous only controlled the convergence loop while still blocking on 5+ interactive gates per run.

- ff: clarify pre-flight step with explicit clean/dirty branching - pr: show existing PR URL/number when duplicate detected (Step C.4) - pr-fix: move dry-run stop to Phase 3, reorder remediate-before-rebase, add Step 3.1 triage summary display - review-comments: fix "let me decide" intent detection to trigger interactive mode instead of score-only

Generated by autoresearch eval-doctor with 60%+ deterministic coverage: - feature-dev: 10 evals, 48 deterministic checks, 33 LLM expectations - refactor: 10 evals, 43 deterministic checks, ~47 LLM expectations - test-architect: 10 evals, 60 deterministic checks, 40 LLM expectations

Autonomous convergence loop assessed 17 domains, disabled 6 N/A domains for CLI plugin profile, and improved quality across 6 iterations (3 kept, 1 reverted, 1 rebase). New files: - CONTRIBUTING.md — dev setup, testing, PR guidelines - Makefile — 11 self-documenting targets (lint, test, format, etc.) - SECURITY.md — security model, incident response, deprecation policy - docs/REQUIREMENTS.md — capabilities, NFRs, edge cases, non-goals - docs/adr/ — 3 ADRs (swarm orchestration, zero deps, hypothesis) - .github/ISSUE_TEMPLATE/ — bug report + feature request forms - .github/PULL_REQUEST_TEMPLATE.md — PR checklist - .vscode/launch.json — Python debug configurations - .cogitations/ — assessment config, results, fallback data Modified: - .github/CODEOWNERS — security-sensitive path annotations - .cogitations/config.yaml — 11 active domains, 8 item suppressions

Add eng-principles ontology covering 17 engineering domains, 6 entity types, scoring traits, and discovery patterns for automatic namespace suggestion during memory capture.

3 iterations (2 kept, 1 reverted): - Suppress N/A items: ARC-004, DEX-012, DEX-013 (+2.0) - Unify error handling: structured exceptions, dead code removed, error dicts replaced with proper raises (+0.2) - Reverted: license scanning + GOV unsuppression (net-negative) 748 tests passing, 88.79% coverage.

Let the orchestrating agent determine the appropriate model per agent based on task complexity at spawn time, rather than pinning all agents to inherit from the parent.

…v skills Add mandatory pre-flight cleanup (Step 0.1.5 / 0.0.5) to remove stale *-autonomous/ and *-workspace/ directories before team creation. Expand shutdown finally-blocks (Step 4.3 / 7.4) to unconditionally remove working directories as a safety net, and verify .gitignore coverage.

…ning Analyzes a project's languages, tooling, CI/CD pipelines, and conventions to recommend and implement tailored git hooks. Detects existing hook managers (husky, pre-commit, lefthook) and works within them. Supports interactive, --auto (bulk provisioning), and --dry-run modes. Includes dormant config detection, tiered recommendations, and /version-guard integration for all versioned artifacts.

Intelligent GitHub workplan manager for issues, discussions, milestones, labels, and project boards. Supports --audit, --auto, --dry-run modes across single and multi-repo scopes. Eval suite: 52 checks (34 deterministic + 18 LLM) covering stale detection, triage, and label audit operations. Evals hardened via 3-iteration autoresearch loop targeting arithmetic accuracy, near- duplicate detection, and output structure verification.

…ation - Update ruff-pre-commit v0.9.10 → v0.15.7 - Update mirrors-mypy v1.14.1 → v1.19.1 - Update bandit 1.8.3 → 1.9.4 - Scope mypy hook to scripts/ only (matches CI) - Exclude tests/ from bandit hook (hardcoded /tmp in fixtures) - Add B404/B603 to bandit skips (intentional subprocess usage) - Fix ruff import sorting in scripts/__init__.py

…nfig - Add pr-review skill: size-scaled PR code review with swarm orchestration for large PRs (500+ lines), hygiene checks, batched GitHub review submission - Add copilot-setup skill: interactive elicitation for Copilot coding agent configuration with --init, --audit, --improve, and --deploy modes - Generate .github/copilot-instructions.md with project-specific conventions - Add .github/instructions/testing.instructions.md for test file patterns - Add copilot-setup-steps.yml for Copilot environment setup (uv + dev deps) - Add copilot-auto-merge.yml with path-based auto-merge policy (docs/evals/tests auto-merge, scripts/agents/skills require review)

…egration - Add references/copilot-agent-mechanics.md with verified agent behavior, limitations, and staleness warnings pointing to online verification - Add references/auto-merge-patterns.md with path-based merge strategies - Add templates/ with placeholder-based skeletons for all generated files: copilot-instructions.md.tmpl, copilot-setup-steps.yml.tmpl, copilot-auto-merge.yml.tmpl - All templates include VERIFY comments for /version-guard - Update SKILL.md to reference bundled resources and mandate version verification before generating workflow files

- 5-phase audit: discovery → spec comprehension → implementation audit → enterprise readiness → synthesis (+ optional issue creation via /gh-work) - Discovers project structure empirically — never hardcodes paths or toolchain - Uses refactor plugin agents (code-reviewer, architect, test-rigor-reviewer) - Supports --discovery-only, --focus=<area>, --skip-issues flags - Add references/discovery-checklist.md with language-specific detection - Add references/enterprise-readiness-criteria.md with scoring rubric - Add 3 eval cases covering discovery-only, focused, and full audit modes - Update copilot-setup dashboard from autoresearch v2 run

When a project is onboarded to cogitations (.cogitations/config.yaml exists), Phase 3 uses cogitations domain assessors, profile weights, and tier scoring instead of the standalone enterprise readiness rubric. Audit findings cross-reference cogitations domains with score impact. Falls back to standalone rubric for non-onboarded projects with a suggestion to run /cog-init.

When cogitations is onboarded, run domain assessors for active domains AND the standalone rubric for disabled dimensions (observability, resilience, performance). Cogitations often disables domains that are structurally N/A (e.g., observability for plugins) — the rubric fills those gaps. Merged scoring in Phase 4 presents both systems unified.

- Establish develop branch as the active development branch - All feature work branches from and merges back to develop - main remains protected, receives merges from develop for releases - Document build/test commands, commit conventions, project structure

Three locations in feature-dev SKILL.md used ambiguous or underspecified inline execution instead of the full TaskCreate → TaskUpdate → SendMessage swarm pattern: - Step 5.4: test verification said "team lead or via feature-code agent" — now always delegates to test-writer with proper task protocol - Step 6.4 autonomous fix cycle: said "create tasks" without specifying the full pattern — now has explicit TaskCreate/TaskUpdate/SendMessage - Step 6.4 interactive fix cycle: same underspecification — now explicit

…oard - SKILL.md: three-phase ideation workflow (elicitation, development, plan) with codebase-grounded research and /feature-dev handoff - evals/evals.json: four evaluation scenarios covering vague ideas, formed plans, multi-idea comparison, and premature spec prevention - ideate-dashboard.html: evaluation results visualization dashboard

- Phase 1: upgrade from suggestion to hard requirement with rationale - Phase 2.3: tighten iteration feedback to mandate AskUserQuestion - Phase 3: require AskUserQuestion for handoff confirmation - Multi-idea comparison: make tool usage explicit for candidate selection - Behavior section: add blanket rule — no inline text questions

- check-test-write.sh now exits early for non-test files - hooks.json restructured to object format with matcher and timeout

- copilot-setup-dashboard.html - gh-work-dashboard.html - ideate-dashboard.html

Port gpm-work from github-project-manager as gh-do with standalone operation. Supports single item, sweep, and swarm modes for processing GitHub issues, PRs, and discussions end-to-end. Includes error handling for missing items, disabled discussions, and explicit PR size label thresholds. Evals cover all 3 modes and 3 item types.

Single-agent skill that reads CLAUDE.md as project constitution, elicits session goals, snapshots board state, generates confidence-scored changesets, and executes via gh CLI/GraphQL. Chrome DevTools MCP for UI-only ops (views, workflows) gated behind --ui-ops flag. - 7 phases: config, constitution, board snapshot, goal elicitation, changeset generation, execution, UI ops, summary - 4 goal strategies: board-hygiene, sprint-plan, prioritize, board-sync - --autonomous, --dry-run, --ui-ops flags - projectPlan config section in refactor.config.json - 11 evals + 25 trigger-evals

…ity Posture 9/10 Autonomous convergence loop: 3 iterations, 100% keep rate, score 0.9125 → 0.9700 (+6.3%). Key changes: - Unified LanguageConfig dataclass registry (consolidates 7 dispatch dicts) - 7 TypedDicts replace dict[str, Any] returns for type safety - Fixed silent bug: coverage_pct key mismatch in audit logging - Extracted run_subprocess() helper (centralizes subprocess security surface) - Replaced key-sniffing and if/elif chains with dict dispatch - Removed dead code (unused path param) - Added 398 lines of gap coverage tests (89% → 100% before refactoring) 12 files changed, 880 insertions, 239 deletions. 795 tests passing, 97.95% coverage.

- Restore backward-compatible detect_test_framework() signature (accepts both single-arg and deprecated two-arg forms) - Fix score.sh auto-normalization to normalize each score independently (prevents mixed-scale corruption) - Upgrade pygments 2.19.2→2.20.0 (CVE-2026-4539) - Upgrade requests 2.32.5→2.33.1 (CVE-2026-25645)

zircote added 30 commits March 19, 2026 16:19

chore: remove eval workspace artifacts

51ee240

- Delete feature-dev-workspace/ iteration results and skill snapshots - Delete refactor-workspace/ iteration results and skill snapshots

docs: add docs/README.md index and update root README

64ed425

- Create docs/README.md with full Diataxis index, coverage matrix, and directory structure - Update root README: 8->12 agents, add test-architect skill, fix tutorial paths to docs/tutorials/, add 4 new doc entries

test: add skill-creator format evals for 8 gh-ported skills

1bb353c

- cp, ff, fr, sync, prune, pr, review-comments, pr-fix - Positive and negative trigger tests per skill - Cross-skill routing accuracy tests (fr vs ff vs sync, pr vs pr-fix vs review-comments)

perf: apply autoresearch improvements to ff and pr-fix skills

520065d

- ff: execution policy, precise commit counting, divergence explanation (0.571 → 1.00) - pr-fix: push before thread resolution ordering (0.96 → 1.00)

refactor: change all agent models from sonnet to inherit

ec979fb

Allow agents to inherit the model from the parent session instead of being pinned to sonnet across all 12 agent definitions.

cogitations: snapshot v1

a922141

cogitations: snapshot v2

b60c846

cogitations: snapshot v0

161a4a5

cogitations: snapshot v1

dd9261f

cogitations: snapshot v2

c2a2dcd

cogitations: snapshot v3

b736268

cogitations: snapshot v5

f819ada

cogitations: snapshot v6

fa5daff

feat: deploy Atlatl engineering ontology for cogitations

a34c513

Add eng-principles ontology covering 17 engineering domains, 6 entity types, scoring traits, and discovery patterns for automatic namespace suggestion during memory capture.

zircote added 29 commits March 22, 2026 06:39

cogitations: snapshot v11

ee4183a

cogitations: snapshot v13

78473c4

cogitations: snapshot v14

a2f5698

cogitations: snapshot v15

66fa913

cogitations: snapshot v16

e79a701

cogitations: snapshot v17

531f723

chore: update results log for iterations 11-17

c007b9d

refactor: remove model: inherit from all agent definitions

d214ed2

Let the orchestrating agent determine the appropriate model per agent based on task complexity at spawn time, rather than pinning all agents to inherit from the parent.

chore: add copilot-setup autoresearch dashboard

111467d

fix(hooks): add test-file guard and migrate to new hooks format

8959171

- check-test-write.sh now exits early for non-test files - hooks.json restructured to object format with matcher and timeout

chore: remove autoresearch dashboard HTML files

944f59b

- copilot-setup-dashboard.html - gh-work-dashboard.html - ideate-dashboard.html

zircote closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: entire codebase — autonomous convergence (3 iterations)#22

refactor: entire codebase — autonomous convergence (3 iterations)#22
zircote wants to merge 62 commits intomainfrom
develop

zircote commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zircote commented Mar 31, 2026

Summary

Quality Scores

Key Changes

Test Plan

Convergence Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant