Skip to content

feat: per-node maintenance mode#26

Merged
TerrifiedBug merged 10 commits intomainfrom
feat/node-maintenance-mode
Mar 7, 2026
Merged

feat: per-node maintenance mode#26
TerrifiedBug merged 10 commits intomainfrom
feat/node-maintenance-mode

Conversation

@TerrifiedBug
Copy link
Copy Markdown
Owner

Summary

  • Add per-node maintenance mode that stops all pipelines on a specific agent without affecting other nodes in the environment
  • New maintenanceMode boolean and maintenanceModeAt timestamp on VectorNode, with Prisma migration
  • Server-side only: agent config endpoint returns empty pipelines: [] when node is in maintenance (agent stops pipelines naturally, zero agent changes)
  • New setMaintenanceMode fleet router mutation (ADMIN-only, audit logged)
  • Fleet table: orange "Maintenance" status badge + toggle button with confirmation dialog
  • Agent detail page: toggle button with running pipeline count in confirmation, prominent orange banner
  • Deployment matrix: dimmed columns with "Maintenance" label for nodes in maintenance

Test Plan

  • Apply migration, verify maintenanceMode and maintenanceModeAt columns exist on VectorNode
  • Toggle maintenance mode on a node from the fleet table — verify confirmation dialog appears
  • Confirm the node shows orange "Maintenance" badge in status column
  • Check agent detail page shows the orange maintenance banner
  • Verify deployment matrix dims the maintenance node's column
  • With an agent connected: enter maintenance mode and verify the agent stops its pipelines on next poll
  • Exit maintenance mode and verify the agent restarts pipelines on next poll
  • Verify audit log records the maintenance toggle events
  • Confirm self-update (pendingAction) still works while in maintenance mode

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR implements per-node maintenance mode for VectorFlow's fleet management. When enabled, the agent config endpoint returns pipelines: [], causing the connected agent to drain and stop all its pipelines on the next poll — with zero changes required to the agent binary. The feature is gated behind ADMIN-only RBAC, fully audit-logged, and surfaces in the fleet table, node detail page, and deployment matrix.

Key changes:

  • Schema/migration: Two new columns on VectorNodemaintenanceMode BOOLEAN NOT NULL DEFAULT false and maintenanceModeAt TIMESTAMP(3)?
  • setMaintenanceMode tRPC mutation: Correctly guarded by withTeamAccess("ADMIN") and withAudit; team context is properly resolved from nodeId via the existing middleware path
  • Agent config route: Early return with pipelines: [] for maintenance nodes; pendingAction is preserved so self-updates still work during maintenance — but secretBackendConfig is omitted for non-BUILTIN backends (see inline comment)
  • Fleet table: Per-node isPending guard correctly scoped to the in-flight nodeId; orange badge; the three required query invalidations are all present on both the fleet page and the node detail page mutations

Confidence Score: 4/5

  • Safe to merge — one logic gap around secretBackendConfig for non-BUILTIN backends, but harmless for BUILTIN (the common case) and the overall feature is correct and well-guarded.
  • Authorization is correctly enforced via withTeamAccess("ADMIN") with proper nodeId resolution. Audit logging is present. The previously reported bugs (shared isPending state, missing query invalidations) are both fixed. The only new issue is the omission of secretBackendConfig in the maintenance-mode early return for non-BUILTIN secret backends — this is a correctness gap but has no impact unless the environment is configured with Vault/AWS SM and the agent uses that field to maintain a live connection to the backend.
  • src/app/api/agent/config/route.ts — the maintenance mode early return should include secretBackendConfig for non-BUILTIN backends to match the normal response shape.

Important Files Changed

Filename Overview
prisma/migrations/20260307000000_add_node_maintenance_mode/migration.sql Adds maintenanceMode BOOLEAN NOT NULL DEFAULT false and nullable maintenanceModeAt TIMESTAMP(3) to VectorNode. Correct and safe — default false means existing rows are unaffected.
prisma/schema.prisma Adds maintenanceMode Boolean @default(false) and maintenanceModeAt DateTime? to the VectorNode model. Matches the migration exactly.
src/server/routers/fleet.ts Adds setMaintenanceMode mutation with correct withTeamAccess("ADMIN") + withAudit middleware, Zod-validated input, proper NOT_FOUND guard, and sets maintenanceModeAt to null on exit. Authorization is fully correct — withTeamAccess resolves team from nodeId at line 242-250 of init.ts.
src/app/api/agent/config/route.ts Adds maintenance mode early return that serves pipelines: [] to halt the agent. One issue: the early return omits secretBackendConfig that the normal path includes for non-BUILTIN backends, which could cause agents to lose external secret-backend initialization data during maintenance.
src/app/(dashboard)/fleet/[nodeId]/page.tsx Adds maintenance toggle button and orange banner. maintenanceMutation.onSuccess now correctly invalidates all three relevant queries (fleet.get, fleet.list, listWithPipelineStatus) — addressing the previously reported gap.
src/app/(dashboard)/fleet/page.tsx Fleet table gets orange "Maintenance" badge and toggle button. The per-node pending-state guard (setMaintenance.isPending && setMaintenance.variables?.nodeId === node.id) correctly scopes the disabled state to the in-flight node only, resolving the previously reported shared-state bug.
src/components/fleet/deployment-matrix.tsx Dims maintenance-mode columns with opacity-30 and adds an orange "Maintenance" label under node header. Correct and self-contained change.

Sequence Diagram

sequenceDiagram
    participant Admin as Admin Browser
    participant tRPC as tRPC (fleet.setMaintenanceMode)
    participant DB as PostgreSQL
    participant Agent as Vector Agent

    Admin->>tRPC: setMaintenanceMode({ nodeId, enabled: true })
    tRPC->>DB: withTeamAccess resolves nodeId → teamId
    tRPC->>DB: UPDATE VectorNode SET maintenanceMode=true, maintenanceModeAt=now()
    DB-->>tRPC: updated node
    tRPC-->>Admin: success → invalidate fleet.list, listWithPipelineStatus, fleet.get

    Note over Agent: next config poll (≤15s)
    Agent->>DB: GET /api/agent/config (bearer token)
    DB-->>Agent: { pipelines: [], pollIntervalMs, secretBackend, pendingAction }
    Note over Agent: stops all running pipelines naturally

    Admin->>tRPC: setMaintenanceMode({ nodeId, enabled: false })
    tRPC->>DB: UPDATE VectorNode SET maintenanceMode=false, maintenanceModeAt=null
    DB-->>tRPC: updated node
    tRPC-->>Admin: success

    Note over Agent: next config poll
    Agent->>DB: GET /api/agent/config
    DB-->>Agent: { pipelines: [...full config...], ... }
    Note over Agent: resumes all pipelines
Loading

Last reviewed commit: b23ff14

Comment on lines +22 to +37
if (node?.maintenanceMode) {
const environment = await prisma.environment.findUnique({
where: { id: agent.environmentId },
select: { secretBackend: true },
});
const settings = await prisma.systemSettings.findUnique({
where: { id: "singleton" },
select: { fleetPollIntervalMs: true },
});
return NextResponse.json({
pipelines: [],
pollIntervalMs: settings?.fleetPollIntervalMs ?? 15_000,
secretBackend: environment?.secretBackend ?? "BUILTIN",
pendingAction: node.pendingAction ?? undefined,
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secretBackendConfig omitted for non-BUILTIN backends in maintenance response

When the node is in maintenance mode, the early return fetches the environment with only { secretBackend: true } and never includes secretBackendConfig in the response. The normal path includes it for all non-BUILTIN backends:

// normal path
...(environment.secretBackend !== "BUILTIN"
  ? { secretBackendConfig: environment.secretBackendConfig }
  : {}),

If an environment uses Vault, AWS SM, or another external backend and the agent relies on receiving secretBackendConfig to maintain or re-initialize its connection to that backend on each config poll, it would lose that initialization data for the duration of maintenance mode. When maintenance ends, the first poll would restore the full config, but if the agent's secret-backend client has any transient state derived from that field it could fail to reconnect cleanly.

The fix is to select secretBackendConfig alongside secretBackend in the maintenance-mode environment query, and then conditionally include it in the early-return payload to match the normal path:

const environment = await prisma.environment.findUnique({
  where: { id: agent.environmentId },
  select: { secretBackend: true, secretBackendConfig: true },
});
// ...
return NextResponse.json({
  pipelines: [],
  pollIntervalMs: settings?.fleetPollIntervalMs ?? 15_000,
  secretBackend: environment?.secretBackend ?? "BUILTIN",
  ...(environment?.secretBackend !== "BUILTIN" && environment?.secretBackendConfig
    ? { secretBackendConfig: environment.secretBackendConfig }
    : {}),
  pendingAction: node.pendingAction ?? undefined,
});
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/api/agent/config/route.ts
Line: 22-37

Comment:
**`secretBackendConfig` omitted for non-BUILTIN backends in maintenance response**

When the node is in maintenance mode, the early return fetches the environment with only `{ secretBackend: true }` and never includes `secretBackendConfig` in the response. The normal path includes it for all non-BUILTIN backends:

```ts
// normal path
...(environment.secretBackend !== "BUILTIN"
  ? { secretBackendConfig: environment.secretBackendConfig }
  : {}),
```

If an environment uses Vault, AWS SM, or another external backend and the agent relies on receiving `secretBackendConfig` to maintain or re-initialize its connection to that backend on each config poll, it would lose that initialization data for the duration of maintenance mode. When maintenance ends, the first poll would restore the full config, but if the agent's secret-backend client has any transient state derived from that field it could fail to reconnect cleanly.

The fix is to select `secretBackendConfig` alongside `secretBackend` in the maintenance-mode environment query, and then conditionally include it in the early-return payload to match the normal path:

```ts
const environment = await prisma.environment.findUnique({
  where: { id: agent.environmentId },
  select: { secretBackend: true, secretBackendConfig: true },
});
// ...
return NextResponse.json({
  pipelines: [],
  pollIntervalMs: settings?.fleetPollIntervalMs ?? 15_000,
  secretBackend: environment?.secretBackend ?? "BUILTIN",
  ...(environment?.secretBackend !== "BUILTIN" && environment?.secretBackendConfig
    ? { secretBackendConfig: environment.secretBackendConfig }
    : {}),
  pendingAction: node.pendingAction ?? undefined,
});
```

How can I resolve this? If you propose a fix, please make it concise.

@TerrifiedBug TerrifiedBug merged commit c0b2229 into main Mar 7, 2026
10 checks passed
@TerrifiedBug TerrifiedBug deleted the feat/node-maintenance-mode branch March 7, 2026 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant