Skip to content

Retrospective: Crosslink as autonomous agent swarm coordinator — ferrolearn case study #231

@dollspace-gay

Description

@dollspace-gay

Context

On March 4, 2026, crosslink was used as the coordination layer for an autonomous agent swarm that built ferrolearn — a scikit-learn equivalent for Rust — from an empty repository. A single "Phase 0 Coordinator" agent orchestrated 33 subagents (opus and sonnet models) across four phases, producing a 14-crate Cargo workspace with 1,452 passing tests and zero failures. A fifth post-phase effort added PyO3 Python bindings passing 619/619 sklearn check_estimator tests.

The session spanned ~13,500 transcript lines, 1,600+ tool calls, 6+ context window continuations, and roughly 4 hours of wall-clock time.

This retrospective documents what worked, what broke, and what crosslink needs to become a first-class swarm manager.


What went right

1. Issue tracking as persistent memory across context compressions

The coordinator session hit context limits 6+ times and was auto-continued. Each time, crosslink issues + comments survived as the canonical state of truth. The coordinator could re-read crosslink list and crosslink show <id> after each compression to reconstruct what agents had completed and what remained. This is crosslink's killer feature for swarms — it's the only state that survives context window resets.

2. Typed comments (--kind plan/decision/observation/result) created an auditable build log

Every agent spawn was logged with --kind plan, every completion with --kind result. When debugging merge conflicts or verifying which agents had delivered, the comment trail was the definitive record. The typed taxonomy made it possible to distinguish "what we planned" from "what actually happened."

3. Design-driven development pattern validated

The /design skill produced 5 design documents (116 requirements, 62 acceptance criteria, 0 open questions) that served as direct agent prompts. Agents didn't need to make judgment calls — the design docs contained exact trait signatures, file paths, dependency versions, and acceptance criteria. Crosslink's knowledge repo stored the design docs so they persisted across sessions.

4. crosslink quick streamlined the create-label-work cycle

The coordinator created 10+ issues rapidly. crosslink quick "title" -p high -l feature collapsing create + label + session work into one command was essential for the fast pace of swarm management.

5. Phase gating worked flawlessly across all 4 phases

The coordinator ran cargo build --workspace && cargo test --workspace at phase boundaries and only proceeded when all tests passed:

  • Phase 1: 230 tests
  • Phase 2: 631 tests
  • Phase 3: 1,054 tests
  • Phase 4: 1,438 tests
  • Post-cleanup: 1,452 tests

Zero failures at every gate. Crosslink issues served as the phase transition record.

6. Model cost optimization via opus/sonnet allocation

The coordinator deliberately assigned opus to architecturally complex agents (GBM, typed pipeline, backend trait, calibrated classifiers, manifold learning) and sonnet to more mechanical ones (scalers, imputers, additional clustering). This kept costs down while maintaining quality where it mattered.

7. Post-phase test audit caught real quality issues

After Phase 4, a dedicated test audit found: 7/10 sklearn fixture files orphaned (tests referenced wrong paths), many "does it not crash" shape-only tests, and zero cross-crate integration tests. Two cleanup agents (oracle tests + E2E integration) raised the test count from 1,438 to 1,452 while dramatically improving test quality.


What went wrong

1. The coordinator repeatedly guessed wrong crosslink subcommands

Errors encountered in the transcript:

  • crosslink issues list → should be crosslink list
  • crosslink new → should be crosslink create
  • crosslink knowledge update → should be crosslink knowledge edit
  • crosslink knowledge edit --from-docedit doesn't accept --from-doc (only add does)
  • crosslink close --reason--reason doesn't exist

Impact: Each wrong guess cost a tool call round-trip (~2-3 seconds). Over 1,600+ tool calls, this adds up.

Suggestion: Consider adding common aliases (newcreate, issueslist) and making the knowledge edit subcommand accept --from-doc for parity with knowledge add.

2. No native swarm/agent coordination primitives

Crosslink tracked issues for individual agents, but the coordinator had to manually:

  • Map agent IDs to crosslink issue IDs
  • Track which agents were running vs. completed
  • Decide merge order
  • Detect stuck agents (by polling TaskOutput repeatedly)

Suggestion: A crosslink agent spawn <issue-id> / crosslink agent status / crosslink agent merge <id> workflow would make swarm coordination a first-class concept rather than an emergent behavior built on top of issue tracking.

3. Worktree management was fragile

Several problems with agent worktrees:

  • Embedded git repository warning when git add -A captured .claude/worktrees/agent-*
  • Worktree branches showed as "not fully merged" because they weren't merged to origin/main, requiring git branch -D instead of git branch -d
  • Agent 9 (tree) had its worktree cleaned up before the coordinator could merge it, losing the branch reference
  • Some agents committed directly to dev while others worked in worktrees, creating an inconsistent merge story
  • Worktree cleanup in Phases 3-4 required --force due to modified/untracked files left behind by agents

Suggestion: Crosslink should own the worktree lifecycle for agent work — create on spawn, auto-add to .gitignore, merge on completion, clean up after merge verification.

4. Hook policy friction during coordinator operation

The work-check.py hook blocked git merge because it was in blocked_git_commands. The user had said "let you claudes go sicko mode" but the config wasn't updated. The coordinator had to:

  1. Diagnose the hook block
  2. Ask the user to create a local override
  3. Wait for the user to respond
  4. Create .crosslink/hook-config.local.json

Suggestion: Consider a crosslink mode coordinator that temporarily relaxes git mutation restrictions for the current session, or allow the hook config to specify per-role permissions.

5. Context window exhaustion was the primary scaling bottleneck

The coordinator consumed 6+ full context windows across 4 phases. Each continuation required the system to generate a multi-page summary, and some details were lost in compression. The coordinator had to re-read design docs, re-check branch state, and re-learn the crosslink API after each reset.

Root cause: The coordinator's context was filled by:

  • Agent prompts (long, detailed, one per agent)
  • TaskOutput polling results (verbose JSON)
  • cargo build/test output
  • Git merge conflict resolution

Suggestion: Crosslink could provide a crosslink swarm status command that returns a compact summary of all active work, replacing the need for the coordinator to individually poll each agent and reconstruct state from raw tool output.

6. Cargo.lock merge conflicts were the most common merge failure

Every worktree agent that added dependencies produced a Cargo.lock conflict when merging to dev. The coordinator resolved these by running cargo generate-lockfile after each merge, but this was a recurring tax. In Phase 4, Cargo.lock had to be committed separately before the backend branch could merge cleanly.

Suggestion: This is inherent to Rust workspaces with parallel agents. A crosslink merge <branch> command could automate the "merge, detect Cargo.lock conflict, regenerate lockfile, commit" pattern.

7. ferrolearn-decomp was a repeated merge conflict hotspot

In both Phase 3 and Phase 4, agents working on decomposition-adjacent features (NMF/KernelPCA, LDA/FactorAnalysis) modified ferrolearn-decomp's Cargo.toml and lib.rs, causing merge conflicts. The coordinator resolved these manually each time, but this was predictable and avoidable.

Lesson: Crate ownership boundaries should be made explicit in design docs. If two agents must touch the same crate, one should be sequenced after the other rather than running in parallel.

8. ndarray feature flag issue caught agents off guard

Agent 23 (remaining preprocessors) hit RelativeEq not implemented for ArrayBase — ndarray requires the approx feature flag for approximate comparison traits. This wasn't specified in the design docs and had to be debugged at runtime.

Lesson: Feature flags and conditional compilation dependencies should be enumerated in design documents alongside the main dependency versions.

9. No close/changelog integration for agent-generated work

All agent issues were closed with generic crosslink close <id>, producing changelog entries like "Agent 2: ferrolearn-sparse" under "Changed." These are meaningless in a user-facing changelog.

Suggestion: Consider crosslink close <id> --no-changelog as the default for agent-internal work, or allow --changelog-title "Add sparse matrix types" to override the issue title in the changelog.

10. .gitignore hygiene required post-hoc cleanup

After Phase 4 completion, subcrate .crosslink/ directories had been committed to the repo. A dedicated cleanup pass was needed to add **/.crosslink/ to .gitignore and remove the tracked files.

Suggestion: crosslink init should add .crosslink/ patterns to .gitignore automatically, including for workspace subcrates.


Phase-by-phase breakdown

Phase 1: Core infrastructure (8 agents)

  • Agents 1-8: core types, sparse matrices, metrics, preprocessing, linear models, model selection, fixtures, CI setup
  • Model mix: 5 opus, 3 sonnet
  • Gate: 230 tests, 0 failures
  • Key issues: Worktree git add -A captured nested repos; agent 9 worktree cleaned before merge

Phase 2: Algorithm crates (9 agents)

  • Agents 9-17: decision trees, k-neighbors, naive Bayes, clustering, decomposition, datasets, I/O, extended metrics, ensemble foundations
  • Model mix: 4 opus, 5 sonnet
  • Gate: 631 tests, 0 failures
  • Key issues: Cargo.lock conflicts on every merge; some agents committed to dev directly

Phase 3: Advanced algorithms (8 agents)

  • Agents 18-25: GBM/AdaBoost, GMM/Agglomerative, NMF/KernelPCA, Imputers/Selection, remaining preprocessors, backend trait, model-selection/datasets additions, typed pipeline
  • Model mix: 4 opus, 4 sonnet
  • Gate: 1,054 tests, 0 failures
  • Key issues: ferrolearn-decomp merge conflict (two agents modified same crate); ndarray approx feature flag; worktree cleanup required --force

Phase 4: Remaining algorithms (8 agents)

  • Agents 27-34: PartialFit/SGD, ColumnTransformer, ElasticNet/BayesianRidge/Huber, additional clustering (MeanShift/Spectral/OPTICS), CalibratedClassifierCV/SelfTraining, manifold learning (Isomap/MDS/SpectralEmbedding/LLE), MiniBatchKMeans/IncrementalPCA, LDA/FactorAnalysis/FastICA
  • Model mix: 4 opus, 4 sonnet
  • Gate: 1,438 tests, 0 failures
  • Key issues: Cargo.lock needed separate commit before backend merge; ferrolearn-decomp conflict again (two agents modified it in parallel again)

Post-Phase 4: Test audit & cleanup

  • Test audit findings: 7/10 sklearn fixture files orphaned, many shape-only "does it not crash" tests, zero cross-crate integration tests, zero E2E tests
  • Two cleanup agents: oracle test writer (compared Rust output against sklearn fixture values) + E2E integration test writer (cross-crate pipelines)
  • Fixture extension: generate_fixtures.py expanded to cover all algorithms (RandomForest, KMeans, PCA, GMM, etc.)
  • Final gate: 1,452 tests, 0 failures
  • PR: ferrolearn#1 — 175 files, ~65,000 lines, 50 commits

Post-Phase 4: PyO3 Python bindings

  • Single coordinator session (no subagents): Built ferrolearn-python crate with PyO3 bindings for all 12 core models
  • Python sklearn wrappers: Inherit from sklearn BaseEstimator, RegressorMixin, ClassifierMixin, etc.
  • check_estimator: 619/619 passed, 0 failed (after 4 rounds of fixes: numpy type coercion, pickle support, classification target validation, error message formatting, n_iter_ attribute)
  • cross_val_score: 9/9 passed
  • Key issues: ndarray version mismatch (numpy 0.24 needs ndarray 0.16, workspace uses 0.17 — fixed by upgrading to pyo3 0.28 + numpy 0.28); _validate_data removed in sklearn 1.8
  • PR: ferrolearn#6 — 26 files, +2,561 lines

By the numbers

Metric Phase 1 Phase 2 Phase 3 Phase 4 Post-Phase Total
Agents spawned 8 9 8 8 2+1 33+1
Tests passing 230 631 1,054 1,438 1,452 1,452
Context windows consumed 2 2 1 1 1+ 6+
Merge conflicts resolved 2 3 2 2 0 9
Wrong crosslink commands 3 2 ~2 ~1 0 ~8
Crates implemented 7 7 1 15

Final deliverables:

  • 14 Rust crates + 1 Python bindings crate (15 total)
  • 175 files, ~65,000 lines of Rust code
  • 1,452 Rust tests + 628 Python check_estimator/cross_val_score tests
  • 50 commits, 33 subagents across 4 phases
  • 2 PRs: #1 (core) + #6 (Python bindings)

Recommendations for crosslink v0.3+

  1. crosslink swarm subcommand group — first-class agent lifecycle management (spawn, status, merge, abort)
  2. crosslink knowledge edit --from-doc — parity with add
  3. Common command aliasesnewcreate, issueslist
  4. Coordinator mode — session-scoped relaxation of git mutation hooks
  5. Compact status outputcrosslink swarm status returning a table, not requiring N individual queries
  6. Worktree lifecycle ownership — crosslink manages the full create → gitignore → merge → cleanup cycle
  7. Changelog-aware close--no-changelog default for agent work, or --changelog-title override
  8. Lock auto-release on stale sessions — the stale locks visible in the session context should auto-release after session end
  9. Crate ownership annotation in design docs — when two agents share a crate, crosslink should warn or enforce sequencing
  10. Auto-gitignore for .crosslink/crosslink init should handle workspace subcrates, not just the root

Conclusion

Crosslink worked remarkably well as an emergent swarm coordinator — the issue tracker, comment system, and knowledge repo provided just enough persistent state to keep a multi-agent build on track across 6+ context window resets and 33 subagents. But the friction points (wrong commands, worktree management, hook conflicts, repeated merge conflicts, no native agent primitives) show that swarm coordination is a natural next step for the tool, not just an incidental use case.

The ferrolearn build proved that design-driven development + crosslink issue tracking + Claude Code agent spawning can produce a substantial, tested codebase (15 crates, 1,452+ tests, ~65,000 lines, plus fully sklearn-compatible Python bindings) from an empty repo in under 4 hours. The bottleneck was not the agents or the code quality — it was the coordination overhead. With first-class swarm primitives, that overhead could be cut in half.

Metadata

Metadata

Assignees

Labels

experimentWe tried something, and something happened!

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions