-
Notifications
You must be signed in to change notification settings - Fork 1
Retrospective: Crosslink as autonomous agent swarm coordinator — ferrolearn case study #231
Description
Context
On March 4, 2026, crosslink was used as the coordination layer for an autonomous agent swarm that built ferrolearn — a scikit-learn equivalent for Rust — from an empty repository. A single "Phase 0 Coordinator" agent orchestrated 33 subagents (opus and sonnet models) across four phases, producing a 14-crate Cargo workspace with 1,452 passing tests and zero failures. A fifth post-phase effort added PyO3 Python bindings passing 619/619 sklearn check_estimator tests.
The session spanned ~13,500 transcript lines, 1,600+ tool calls, 6+ context window continuations, and roughly 4 hours of wall-clock time.
This retrospective documents what worked, what broke, and what crosslink needs to become a first-class swarm manager.
What went right
1. Issue tracking as persistent memory across context compressions
The coordinator session hit context limits 6+ times and was auto-continued. Each time, crosslink issues + comments survived as the canonical state of truth. The coordinator could re-read crosslink list and crosslink show <id> after each compression to reconstruct what agents had completed and what remained. This is crosslink's killer feature for swarms — it's the only state that survives context window resets.
2. Typed comments (--kind plan/decision/observation/result) created an auditable build log
Every agent spawn was logged with --kind plan, every completion with --kind result. When debugging merge conflicts or verifying which agents had delivered, the comment trail was the definitive record. The typed taxonomy made it possible to distinguish "what we planned" from "what actually happened."
3. Design-driven development pattern validated
The /design skill produced 5 design documents (116 requirements, 62 acceptance criteria, 0 open questions) that served as direct agent prompts. Agents didn't need to make judgment calls — the design docs contained exact trait signatures, file paths, dependency versions, and acceptance criteria. Crosslink's knowledge repo stored the design docs so they persisted across sessions.
4. crosslink quick streamlined the create-label-work cycle
The coordinator created 10+ issues rapidly. crosslink quick "title" -p high -l feature collapsing create + label + session work into one command was essential for the fast pace of swarm management.
5. Phase gating worked flawlessly across all 4 phases
The coordinator ran cargo build --workspace && cargo test --workspace at phase boundaries and only proceeded when all tests passed:
- Phase 1: 230 tests
- Phase 2: 631 tests
- Phase 3: 1,054 tests
- Phase 4: 1,438 tests
- Post-cleanup: 1,452 tests
Zero failures at every gate. Crosslink issues served as the phase transition record.
6. Model cost optimization via opus/sonnet allocation
The coordinator deliberately assigned opus to architecturally complex agents (GBM, typed pipeline, backend trait, calibrated classifiers, manifold learning) and sonnet to more mechanical ones (scalers, imputers, additional clustering). This kept costs down while maintaining quality where it mattered.
7. Post-phase test audit caught real quality issues
After Phase 4, a dedicated test audit found: 7/10 sklearn fixture files orphaned (tests referenced wrong paths), many "does it not crash" shape-only tests, and zero cross-crate integration tests. Two cleanup agents (oracle tests + E2E integration) raised the test count from 1,438 to 1,452 while dramatically improving test quality.
What went wrong
1. The coordinator repeatedly guessed wrong crosslink subcommands
Errors encountered in the transcript:
crosslink issues list→ should becrosslink listcrosslink new→ should becrosslink createcrosslink knowledge update→ should becrosslink knowledge editcrosslink knowledge edit --from-doc→editdoesn't accept--from-doc(onlyadddoes)crosslink close --reason→--reasondoesn't exist
Impact: Each wrong guess cost a tool call round-trip (~2-3 seconds). Over 1,600+ tool calls, this adds up.
Suggestion: Consider adding common aliases (new → create, issues → list) and making the knowledge edit subcommand accept --from-doc for parity with knowledge add.
2. No native swarm/agent coordination primitives
Crosslink tracked issues for individual agents, but the coordinator had to manually:
- Map agent IDs to crosslink issue IDs
- Track which agents were running vs. completed
- Decide merge order
- Detect stuck agents (by polling
TaskOutputrepeatedly)
Suggestion: A crosslink agent spawn <issue-id> / crosslink agent status / crosslink agent merge <id> workflow would make swarm coordination a first-class concept rather than an emergent behavior built on top of issue tracking.
3. Worktree management was fragile
Several problems with agent worktrees:
- Embedded git repository warning when
git add -Acaptured.claude/worktrees/agent-* - Worktree branches showed as "not fully merged" because they weren't merged to
origin/main, requiringgit branch -Dinstead ofgit branch -d - Agent 9 (tree) had its worktree cleaned up before the coordinator could merge it, losing the branch reference
- Some agents committed directly to
devwhile others worked in worktrees, creating an inconsistent merge story - Worktree cleanup in Phases 3-4 required
--forcedue to modified/untracked files left behind by agents
Suggestion: Crosslink should own the worktree lifecycle for agent work — create on spawn, auto-add to .gitignore, merge on completion, clean up after merge verification.
4. Hook policy friction during coordinator operation
The work-check.py hook blocked git merge because it was in blocked_git_commands. The user had said "let you claudes go sicko mode" but the config wasn't updated. The coordinator had to:
- Diagnose the hook block
- Ask the user to create a local override
- Wait for the user to respond
- Create
.crosslink/hook-config.local.json
Suggestion: Consider a crosslink mode coordinator that temporarily relaxes git mutation restrictions for the current session, or allow the hook config to specify per-role permissions.
5. Context window exhaustion was the primary scaling bottleneck
The coordinator consumed 6+ full context windows across 4 phases. Each continuation required the system to generate a multi-page summary, and some details were lost in compression. The coordinator had to re-read design docs, re-check branch state, and re-learn the crosslink API after each reset.
Root cause: The coordinator's context was filled by:
- Agent prompts (long, detailed, one per agent)
- TaskOutput polling results (verbose JSON)
cargo build/testoutput- Git merge conflict resolution
Suggestion: Crosslink could provide a crosslink swarm status command that returns a compact summary of all active work, replacing the need for the coordinator to individually poll each agent and reconstruct state from raw tool output.
6. Cargo.lock merge conflicts were the most common merge failure
Every worktree agent that added dependencies produced a Cargo.lock conflict when merging to dev. The coordinator resolved these by running cargo generate-lockfile after each merge, but this was a recurring tax. In Phase 4, Cargo.lock had to be committed separately before the backend branch could merge cleanly.
Suggestion: This is inherent to Rust workspaces with parallel agents. A crosslink merge <branch> command could automate the "merge, detect Cargo.lock conflict, regenerate lockfile, commit" pattern.
7. ferrolearn-decomp was a repeated merge conflict hotspot
In both Phase 3 and Phase 4, agents working on decomposition-adjacent features (NMF/KernelPCA, LDA/FactorAnalysis) modified ferrolearn-decomp's Cargo.toml and lib.rs, causing merge conflicts. The coordinator resolved these manually each time, but this was predictable and avoidable.
Lesson: Crate ownership boundaries should be made explicit in design docs. If two agents must touch the same crate, one should be sequenced after the other rather than running in parallel.
8. ndarray feature flag issue caught agents off guard
Agent 23 (remaining preprocessors) hit RelativeEq not implemented for ArrayBase — ndarray requires the approx feature flag for approximate comparison traits. This wasn't specified in the design docs and had to be debugged at runtime.
Lesson: Feature flags and conditional compilation dependencies should be enumerated in design documents alongside the main dependency versions.
9. No close/changelog integration for agent-generated work
All agent issues were closed with generic crosslink close <id>, producing changelog entries like "Agent 2: ferrolearn-sparse" under "Changed." These are meaningless in a user-facing changelog.
Suggestion: Consider crosslink close <id> --no-changelog as the default for agent-internal work, or allow --changelog-title "Add sparse matrix types" to override the issue title in the changelog.
10. .gitignore hygiene required post-hoc cleanup
After Phase 4 completion, subcrate .crosslink/ directories had been committed to the repo. A dedicated cleanup pass was needed to add **/.crosslink/ to .gitignore and remove the tracked files.
Suggestion: crosslink init should add .crosslink/ patterns to .gitignore automatically, including for workspace subcrates.
Phase-by-phase breakdown
Phase 1: Core infrastructure (8 agents)
- Agents 1-8: core types, sparse matrices, metrics, preprocessing, linear models, model selection, fixtures, CI setup
- Model mix: 5 opus, 3 sonnet
- Gate: 230 tests, 0 failures
- Key issues: Worktree
git add -Acaptured nested repos; agent 9 worktree cleaned before merge
Phase 2: Algorithm crates (9 agents)
- Agents 9-17: decision trees, k-neighbors, naive Bayes, clustering, decomposition, datasets, I/O, extended metrics, ensemble foundations
- Model mix: 4 opus, 5 sonnet
- Gate: 631 tests, 0 failures
- Key issues: Cargo.lock conflicts on every merge; some agents committed to
devdirectly
Phase 3: Advanced algorithms (8 agents)
- Agents 18-25: GBM/AdaBoost, GMM/Agglomerative, NMF/KernelPCA, Imputers/Selection, remaining preprocessors, backend trait, model-selection/datasets additions, typed pipeline
- Model mix: 4 opus, 4 sonnet
- Gate: 1,054 tests, 0 failures
- Key issues: ferrolearn-decomp merge conflict (two agents modified same crate); ndarray
approxfeature flag; worktree cleanup required--force
Phase 4: Remaining algorithms (8 agents)
- Agents 27-34: PartialFit/SGD, ColumnTransformer, ElasticNet/BayesianRidge/Huber, additional clustering (MeanShift/Spectral/OPTICS), CalibratedClassifierCV/SelfTraining, manifold learning (Isomap/MDS/SpectralEmbedding/LLE), MiniBatchKMeans/IncrementalPCA, LDA/FactorAnalysis/FastICA
- Model mix: 4 opus, 4 sonnet
- Gate: 1,438 tests, 0 failures
- Key issues: Cargo.lock needed separate commit before backend merge; ferrolearn-decomp conflict again (two agents modified it in parallel again)
Post-Phase 4: Test audit & cleanup
- Test audit findings: 7/10 sklearn fixture files orphaned, many shape-only "does it not crash" tests, zero cross-crate integration tests, zero E2E tests
- Two cleanup agents: oracle test writer (compared Rust output against sklearn fixture values) + E2E integration test writer (cross-crate pipelines)
- Fixture extension:
generate_fixtures.pyexpanded to cover all algorithms (RandomForest, KMeans, PCA, GMM, etc.) - Final gate: 1,452 tests, 0 failures
- PR: ferrolearn#1 — 175 files, ~65,000 lines, 50 commits
Post-Phase 4: PyO3 Python bindings
- Single coordinator session (no subagents): Built
ferrolearn-pythoncrate with PyO3 bindings for all 12 core models - Python sklearn wrappers: Inherit from sklearn
BaseEstimator,RegressorMixin,ClassifierMixin, etc. - check_estimator: 619/619 passed, 0 failed (after 4 rounds of fixes: numpy type coercion, pickle support, classification target validation, error message formatting, n_iter_ attribute)
- cross_val_score: 9/9 passed
- Key issues: ndarray version mismatch (numpy 0.24 needs ndarray 0.16, workspace uses 0.17 — fixed by upgrading to pyo3 0.28 + numpy 0.28);
_validate_dataremoved in sklearn 1.8 - PR: ferrolearn#6 — 26 files, +2,561 lines
By the numbers
| Metric | Phase 1 | Phase 2 | Phase 3 | Phase 4 | Post-Phase | Total |
|---|---|---|---|---|---|---|
| Agents spawned | 8 | 9 | 8 | 8 | 2+1 | 33+1 |
| Tests passing | 230 | 631 | 1,054 | 1,438 | 1,452 | 1,452 |
| Context windows consumed | 2 | 2 | 1 | 1 | 1+ | 6+ |
| Merge conflicts resolved | 2 | 3 | 2 | 2 | 0 | 9 |
| Wrong crosslink commands | 3 | 2 | ~2 | ~1 | 0 | ~8 |
| Crates implemented | 7 | 7 | — | — | 1 | 15 |
Final deliverables:
- 14 Rust crates + 1 Python bindings crate (15 total)
- 175 files, ~65,000 lines of Rust code
- 1,452 Rust tests + 628 Python check_estimator/cross_val_score tests
- 50 commits, 33 subagents across 4 phases
- 2 PRs: #1 (core) + #6 (Python bindings)
Recommendations for crosslink v0.3+
crosslink swarmsubcommand group — first-class agent lifecycle management (spawn, status, merge, abort)crosslink knowledge edit --from-doc— parity withadd- Common command aliases —
new→create,issues→list - Coordinator mode — session-scoped relaxation of git mutation hooks
- Compact status output —
crosslink swarm statusreturning a table, not requiring N individual queries - Worktree lifecycle ownership — crosslink manages the full create → gitignore → merge → cleanup cycle
- Changelog-aware close —
--no-changelogdefault for agent work, or--changelog-titleoverride - Lock auto-release on stale sessions — the stale locks visible in the session context should auto-release after session end
- Crate ownership annotation in design docs — when two agents share a crate, crosslink should warn or enforce sequencing
- Auto-gitignore for
.crosslink/—crosslink initshould handle workspace subcrates, not just the root
Conclusion
Crosslink worked remarkably well as an emergent swarm coordinator — the issue tracker, comment system, and knowledge repo provided just enough persistent state to keep a multi-agent build on track across 6+ context window resets and 33 subagents. But the friction points (wrong commands, worktree management, hook conflicts, repeated merge conflicts, no native agent primitives) show that swarm coordination is a natural next step for the tool, not just an incidental use case.
The ferrolearn build proved that design-driven development + crosslink issue tracking + Claude Code agent spawning can produce a substantial, tested codebase (15 crates, 1,452+ tests, ~65,000 lines, plus fully sklearn-compatible Python bindings) from an empty repo in under 4 hours. The bottleneck was not the agents or the code quality — it was the coordination overhead. With first-class swarm primitives, that overhead could be cut in half.