Skip to content

feat(loop): integrate evolution, memory, and mid-loop critique#9

Closed
electronicBlacksmith wants to merge 7 commits intomainfrom
worktree-fix+loop-slack-feedback
Closed

feat(loop): integrate evolution, memory, and mid-loop critique#9
electronicBlacksmith wants to merge 7 commits intomainfrom
worktree-fix+loop-slack-feedback

Conversation

@electronicBlacksmith
Copy link
Copy Markdown
Owner

@electronicBlacksmith electronicBlacksmith commented Apr 6, 2026

Summary

  • Phase 1: Memory context injection - cached once at loop start, injected into every tick prompt. Cleared on finalize, rebuilt on resume.
  • Phase 2: Post-loop evolution and memory consolidation - bounded transcript accumulation, SessionData synthesis, fire-and-forget pipeline with cost-cap guards matching the interactive session path.
  • Phase 3: Mid-loop critique checkpoints - optional checkpoint_interval triggers Sonnet 4.6 review every N ticks. Guarded by judge availability and cost cap. Awaited before next tick to prevent race conditions.

New files: src/loop/critique.ts, src/loop/post-loop.ts, and 3 test files.

Test plan

  • 981 tests pass (4 new Phase 3 guard tests, prompt injection tests, post-loop synthesis tests)
  • Biome lint clean
  • TypeScript strict mode typecheck clean
  • Dual code review (Haiku 4.5 + Sonnet 4.6) - all critical issues addressed
  • Manual verification: start a loop from Slack with Qdrant + Ollama up, verify tick prompts contain recalled memories
  • Manual verification: run loop to completion, verify observations appear in evolution tables

Closes #8

Closes #5. The feedback pipeline in LoopRunner already existed but was
gated on loop.channelId, which was always null because the agent never
plumbed channel_id/conversation_id into the in-process MCP tool call,
that context only lived in the router.

- AsyncLocalStorage<SlackContext> captures the Slack channel/thread/
  trigger-message for the current turn so phantom_loop can auto-fill
  them when the agent omits them. Explicit tool args still win.
- Reaction ladder on the operator's original message: hourglass ->
  cycle -> terminal (check/stop/warning/x). Restart-safe via
  iteration === 1 check, no in-memory flag.
- Inline unicode progress bar in the edited status message.
- New trigger_message_ts column on loops, appended as migration #11.
- Extracted LoopNotifier into src/loop/notifications.ts, runner.ts
  was already at the 300-line cap.

34 new tests, 938 pass / 0 fail.
…tion

Two defects surfaced during the first Slack end-to-end test of the loop
feedback fix:

1. Stop button disappeared after the first tick. Slack's chat.update
   replaces the message wholesale and strips any blocks the caller does
   not include. postStartNotice attached the button but postTickUpdate
   called updateMessage without blocks, so the button was wiped on the
   first progress edit. Extract buildStatusBlocks() and re-send it on
   every tick edit. Final notice still omits blocks intentionally so the
   button disappears when the loop is no longer interruptible.

2. No end-of-loop summary. The agent curates the state.md body every
   tick (Goal, Progress, Next Action, Notes), but that content never
   reached the operator. Post it as a threaded reply when the loop
   finalizes. No extra agent cost: we surface content the agent already
   wrote. Frontmatter stripped, truncated at 3500 chars, silently
   skipped if the file is missing or empty.

+7 tests covering both regressions. 945 pass / 0 fail.
…l message

1. Tick update race: postTickUpdate was fire-and-forget, so a stop on
   tick N+1 could race with tick N's Slack write. If the tick update's
   HTTP response arrived after postFinalNotice, it overwrote the final
   message and re-sent the Stop button blocks. Awaiting postTickUpdate
   serializes Slack writes so finalize always runs after the last tick
   update completes.

2. Final message now includes the progress bar at its halted position,
   visually consistent with tick updates. A stopped loop at 3/10 shows
   the bar frozen at 3/10 with "stopped" instead of a terse one-liner.
…oop ticks

Loop ticks now use Phantom's full intelligence stack instead of running blind:

Phase 1 - Memory context injection: cached once at loop start from the goal,
injected into every tick prompt via TickPromptOptions. Cleared on finalize,
rebuilt on resume.

Phase 2 - Post-loop evolution and consolidation: bounded transcript
accumulation (first tick + rolling 10 summaries + last tick), SessionData
synthesis in finalize(), fire-and-forget evolution pipeline and LLM/heuristic
memory consolidation with cost-cap guards matching the interactive path.

Phase 3 - Mid-loop critique checkpoints: optional checkpoint_interval param
lets the agent request Sonnet 4.6 review every N ticks. Guard requires
evolution enabled, LLM judges active, and cost cap not exceeded. Critique
is awaited before next tick to avoid race conditions.

Closes #8
- Decouple postLoopDeps so evolution and memory run independently
  (evolution works when memory is down and vice versa)
- Skip mid-loop critique on terminal ticks to avoid wasted Sonnet calls
- Track judge cost on failure paths via JudgeParseError carrying usage data
- Extract recordTranscript/clamp from runner.ts to post-loop.ts (292 < 300 lines)
@electronicBlacksmith electronicBlacksmith self-assigned this Apr 6, 2026
electronicBlacksmith added a commit that referenced this pull request Apr 6, 2026
PR #7 was squash-merged into main while PR #9's branch still had the
original commits. Conflicts were all additive - kept PR #9's features
(checkpoint_interval, memory context, critique, post-loop pipeline)
while adopting main's improved error formatting and race condition
comment in the tick update await.
PR #7 was squash-merged into main while this branch still had the
original commits. Kept all PR #9 features (checkpoint_interval,
memory context, critique, post-loop pipeline) while adopting main's
improved error formatting and race condition comment.
Wire setTriggerDeps before startServer so the handler is ready on the
first request. Use server.url.origin instead of manually building the
URL from server.port which can race in CI. Add a health check fetch
to confirm the server is accepting connections before tests run.
electronicBlacksmith added a commit that referenced this pull request Apr 6, 2026
- Decouple postLoopDeps so evolution and memory run independently
  (evolution works when memory is down and vice versa)
- Skip mid-loop critique on terminal ticks to avoid wasted Sonnet calls
- Track judge cost on failure paths via JudgeParseError carrying usage data
- Extract recordTranscript/clamp from runner.ts to post-loop.ts (292 < 300 lines)
@electronicBlacksmith
Copy link
Copy Markdown
Owner Author

Superseded by #14 (consolidated clean branch)

@electronicBlacksmith electronicBlacksmith deleted the worktree-fix+loop-slack-feedback branch April 6, 2026 23:49
electronicBlacksmith added a commit that referenced this pull request Apr 7, 2026
* feat(loop): integrate evolution, memory, and mid-loop critique into loop ticks

Loop ticks now use Phantom's full intelligence stack instead of running blind:

Phase 1 - Memory context injection: cached once at loop start from the goal,
injected into every tick prompt via TickPromptOptions. Cleared on finalize,
rebuilt on resume.

Phase 2 - Post-loop evolution and consolidation: bounded transcript
accumulation (first tick + rolling 10 summaries + last tick), SessionData
synthesis in finalize(), fire-and-forget evolution pipeline and LLM/heuristic
memory consolidation with cost-cap guards matching the interactive path.

Phase 3 - Mid-loop critique checkpoints: optional checkpoint_interval param
lets the agent request Sonnet 4.6 review every N ticks. Guard requires
evolution enabled, LLM judges active, and cost cap not exceeded. Critique
is awaited before next tick to avoid race conditions.

Closes #8

* fix(loop): address code review findings from PR #9

- Decouple postLoopDeps so evolution and memory run independently
  (evolution works when memory is down and vice versa)
- Skip mid-loop critique on terminal ticks to avoid wasted Sonnet calls
- Track judge cost on failure paths via JudgeParseError carrying usage data
- Extract recordTranscript/clamp from runner.ts to post-loop.ts (292 < 300 lines)

* fix(evolution): support OAuth tokens for LLM judge auth

resolveJudgeMode() and judge client now check ANTHROPIC_AUTH_TOKEN and
CLAUDE_CODE_OAUTH_TOKEN in addition to ANTHROPIC_API_KEY. Enables LLM
judges on Max subscription deployments using OAuth bearer tokens.

* docs: add phantom_loop documentation for upstream PR

Covers MCP tool parameters, state file contract, tick lifecycle,
Slack integration, mid-loop critique, post-loop evolution pipeline,
memory context injection, and tips for writing effective goals.

Closes #12

* fix(test): stabilize trigger-auth and judge-activation tests for CI

trigger-auth: use inline Bun.serve instead of startServer to avoid
module-level globals and disk I/O that can race across test files.

judge-activation: save/restore ANTHROPIC_AUTH_TOKEN and
CLAUDE_CODE_OAUTH_TOKEN alongside ANTHROPIC_API_KEY so tests that
expect "no credentials" actually clear all auth env vars.

---------

Co-authored-by: electronicBlacksmith <electronicBlacksmith@users.noreply.github.com>
electronicBlacksmith added a commit that referenced this pull request Apr 8, 2026
* feat(loop): integrate evolution, memory, and mid-loop critique into loop ticks

Loop ticks now use Phantom's full intelligence stack instead of running blind:

Phase 1 - Memory context injection: cached once at loop start from the goal,
injected into every tick prompt via TickPromptOptions. Cleared on finalize,
rebuilt on resume.

Phase 2 - Post-loop evolution and consolidation: bounded transcript
accumulation (first tick + rolling 10 summaries + last tick), SessionData
synthesis in finalize(), fire-and-forget evolution pipeline and LLM/heuristic
memory consolidation with cost-cap guards matching the interactive path.

Phase 3 - Mid-loop critique checkpoints: optional checkpoint_interval param
lets the agent request Sonnet 4.6 review every N ticks. Guard requires
evolution enabled, LLM judges active, and cost cap not exceeded. Critique
is awaited before next tick to avoid race conditions.

Closes #8

* fix(loop): address code review findings from PR #9

- Decouple postLoopDeps so evolution and memory run independently
  (evolution works when memory is down and vice versa)
- Skip mid-loop critique on terminal ticks to avoid wasted Sonnet calls
- Track judge cost on failure paths via JudgeParseError carrying usage data
- Extract recordTranscript/clamp from runner.ts to post-loop.ts (292 < 300 lines)

* fix(evolution): support OAuth tokens for LLM judge auth

resolveJudgeMode() and judge client now check ANTHROPIC_AUTH_TOKEN and
CLAUDE_CODE_OAUTH_TOKEN in addition to ANTHROPIC_API_KEY. Enables LLM
judges on Max subscription deployments using OAuth bearer tokens.

* docs: add phantom_loop documentation for upstream PR

Covers MCP tool parameters, state file contract, tick lifecycle,
Slack integration, mid-loop critique, post-loop evolution pipeline,
memory context injection, and tips for writing effective goals.

Closes #12

* fix(test): stabilize trigger-auth and judge-activation tests for CI

trigger-auth: use inline Bun.serve instead of startServer to avoid
module-level globals and disk I/O that can race across test files.

judge-activation: save/restore ANTHROPIC_AUTH_TOKEN and
CLAUDE_CODE_OAUTH_TOKEN alongside ANTHROPIC_API_KEY so tests that
expect "no credentials" actually clear all auth env vars.

---------

Co-authored-by: electronicBlacksmith <electronicBlacksmith@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loop ticks should use evolution, judges, and memory - not bypass them

1 participant