feat(M009): Fleet-Wide Observability — KPI dashboard, throughput comparison, data loss detection by TerrifiedBug · Pull Request #110 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-03-26T03:00:26Z

Summary

S01 — Fleet KPI Dashboard + Volume Trends: New /fleet/overview page with 4 KPI cards (bytes in/out, events in/out, fleet health), volume trend AreaChart with time range selector (1h/6h/1d/7d/30d), backed by SQL-level date_trunc aggregation in fleet-data.ts service. 19 unit tests.
S02 — Cross-Node Throughput Comparison + Capacity Utilization: Bar chart comparing throughput (events/sec) across nodes, per-node capacity utilization line charts (CPU%, memory%, disk%) with time-range support.
S03 — Data Loss Detection + Enhanced Deployment Matrix: Sortable data loss alert table flagging pipelines where events-out/events-in gap exceeds configurable threshold (default 5%). Deployment matrix cells with throughput rate overlays and red highlights on data loss.

Key decisions

D013: SQL-level aggregation via Prisma.$queryRaw with date_trunc (not JS-side bucketing)
D014: /fleet/overview as sibling page to /fleet list with tab navigation

Stats

13 files changed, 1985 insertions, 4 deletions
19 fleet-data service tests (all passing)
0 TypeScript errors, 0 lint warnings

Test plan

Verify /fleet/overview renders KPI cards and volume trend chart
Confirm time range selector (1h/6h/1d/7d/30d) updates chart data
Verify throughput comparison bar chart shows per-node data
Verify capacity utilization charts show CPU/memory/disk trends
Confirm data loss table highlights pipelines exceeding threshold
Verify deployment matrix shows throughput overlays
Run vitest run src/server/services/__tests__/fleet-data.test.ts — 19/19 pass
Run tsc --noEmit — 0 errors

🤖 Generated with Claude Code

- src/server/services/fleet-data.ts - src/server/services/__tests__/fleet-data.test.ts

- src/server/routers/fleet.ts - src/hooks/use-realtime-invalidation.ts - src/hooks/__tests__/use-realtime-invalidation.test.ts

…d chart, time range selector - Fleet overview page with 15s auto-refresh polling and time range selector - 4 KPI cards: bytes in/out, events in/out, fleet health %, node count - Volume trend AreaChart with bytesIn/bytesOut series using Recharts - Added "Fleet Overview" nav link to fleet page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…arts Add node-level throughput bar chart and per-node capacity utilization trend charts (memory%, disk%, CPU load) to fleet overview page. Includes getNodeThroughput/getNodeCapacity service functions, tRPC procedures with team access control, SSE invalidation, bottleneck highlighting, and 8-color node palette. 6 new tests (505 total). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…verlays Add data loss detection with configurable threshold (default 5%) showing pipelines where events-out/events-in gap is suspicious. Enhance deployment matrix with per-cell throughput rates and red highlighting on cells with data loss. New tRPC endpoints: fleet.dataLoss, fleet.matrixThroughput. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-26T03:04:43Z

Greptile Summary

This PR introduces the M009 Fleet-Wide Observability milestone: a new /fleet/overview page with KPI cards, volume trend charts, cross-node throughput comparison, capacity utilization charts, a data-loss detection table, and throughput overlays on the deployment matrix. The backend is a new fleet-data.ts service using SQL-level date_trunc aggregation via Prisma.$queryRaw, wired into 6 new read-only tRPC procedures all protected with withTeamAccess("VIEWER").\n\nKey findings:\n\n- One correctness bug in data-loss-table.tsx: formatPercent(row.lossRate) passes a 0–1 decimal directly to a formatter that appends % without multiplying by 100 — a 20% loss rate renders as "0.2%". Every other call site in this PR (fleet-kpi-cards.tsx) correctly pre-multiplies by 100.\n- The backend service is well-structured: all SQL queries use Prisma.sql parameterized templates (no injection risk), all 6 procedures carry withTeamAccess(\"VIEWER\"), and cross-team isolation is enforced via environmentId subquery filters.\n- 19 unit tests cover all service functions including null/zero edge cases; the real-time invalidation hook is properly extended.

Confidence Score: 4/5

Safe to merge after fixing the one-line formatPercent call in data-loss-table.tsx; all backend code is correct and secure.

Strong PR overall — secure backend, proper RBAC, parameterized SQL, good test coverage. One concrete P1 display bug in data-loss-table where loss rates will render 100× too small, making the feature misleading in production. A single one-line fix resolves it.

src/components/fleet/data-loss-table.tsx — formatPercent call needs * 100 scaling.

Important Files Changed

Filename	Overview
src/server/services/fleet-data.ts	New service with 6 SQL-level aggregation functions using Prisma.$queryRaw; all parameterized safely with Prisma.sql template tags and properly filtered by environmentId; BigInt→Number conversions handled correctly throughout.
src/server/routers/fleet.ts	Adds 6 new query procedures; all correctly guarded with withTeamAccess("VIEWER") and validated with Zod enum inputs; no mutations so withAudit not required.
src/components/fleet/data-loss-table.tsx	New data loss table component; contains a display bug where formatPercent(row.lossRate) passes a 0–1 fraction directly, causing loss rates to display 100× too small (e.g. 0.2% instead of 20%).
src/app/(dashboard)/fleet/overview/page.tsx	New fleet overview page orchestrating 6 parallel queries with 15s polling; correctly gates all queries on selectedEnvironmentId and handles error/loading states.
src/components/fleet/deployment-matrix.tsx	Augments deployment matrix cells with per-cell throughput rates and red-highlight for data-loss; throughput lookup uses O(1) Map.
src/server/services/tests/fleet-data.test.ts	19 unit tests covering all 6 service functions, including null handling, zero-throughput skipping, and sort order.

Sequence Diagram

sequenceDiagram
    participant Page as FleetOverviewPage
    participant tRPC as tRPC (fleet router)
    participant MW as withTeamAccess("VIEWER")
    participant Svc as fleet-data.ts
    participant DB as PostgreSQL

    Page->>tRPC: fleet.overview(environmentId, range)
    Page->>tRPC: fleet.volumeTrend(environmentId, range)
    Page->>tRPC: fleet.nodeThroughput(environmentId, range)
    Page->>tRPC: fleet.nodeCapacity(environmentId, range)
    Page->>tRPC: fleet.dataLoss(environmentId, range, threshold)
    Page->>tRPC: fleet.matrixThroughput(environmentId, range)

    tRPC->>MW: validate team membership
    MW-->>tRPC: authorized

    tRPC->>Svc: getFleetOverview / getVolumeTrend / etc.
    Svc->>DB: Prisma.$queryRaw (date_trunc aggregation)
    DB-->>Svc: BigInt rows
    Svc-->>tRPC: typed results (BigInt to Number)
    tRPC-->>Page: serialized JSON

    Page->>Page: render KPI cards, charts, data-loss table, matrix

Comments Outside Diff (1)

src/components/fleet/data-loss-table.tsx, line 349-350 (link)

Loss rate displayed 100× too small

lossRate is stored as a fraction (0–1, e.g. 0.20 for 20% loss), but formatPercent simply calls v.toFixed(1) + "%" without multiplying by 100. A 20% loss rate will render as "0.2%" instead of "20.0%".

Every other call in this PR that passes a fractional rate explicitly scales by 100 first — compare fleet-kpi-cards.tsx:

{formatPercent((data?.errorRate ?? 0) * 100)} error rate

The same fix is needed here:

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/components/fleet/data-loss-table.tsx
Line: 349-350

Comment:
**Loss rate displayed 100× too small**

`lossRate` is stored as a fraction (0–1, e.g. `0.20` for 20% loss), but `formatPercent` simply calls `v.toFixed(1) + "%"` without multiplying by 100. A 20% loss rate will render as **"0.2%"** instead of "20.0%".

Every other call in this PR that passes a fractional rate explicitly scales by 100 first — compare `fleet-kpi-cards.tsx`:

```tsx
{formatPercent((data?.errorRate ?? 0) * 100)} error rate
```

The same fix is needed here:



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/components/fleet/data-loss-table.tsx
Line: 349-350

Comment:
**Loss rate displayed 100× too small**

`lossRate` is stored as a fraction (0–1, e.g. `0.20` for 20% loss), but `formatPercent` simply calls `v.toFixed(1) + "%"` without multiplying by 100. A 20% loss rate will render as **"0.2%"** instead of "20.0%".

Every other call in this PR that passes a fractional rate explicitly scales by 100 first — compare `fleet-kpi-cards.tsx`:

```tsx
{formatPercent((data?.errorRate ?? 0) * 100)} error rate
```

The same fix is needed here:

```suggestion
                        {formatPercent(row.lossRate * 100)}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(S03): Data loss detection table + d..." | Re-trigger Greptile}

TerrifiedBug and others added 5 commits March 26, 2026 00:09

test(S01/T01): Created fleet-data.ts service with getFleetOverview (KPI…

4894580

- src/server/services/fleet-data.ts - src/server/services/__tests__/fleet-data.test.ts

perf(S01/T02): Added fleet.overview and fleet.volumeTrend tRPC procedur…

c1f769d

- src/server/routers/fleet.ts - src/hooks/use-realtime-invalidation.ts - src/hooks/__tests__/use-realtime-invalidation.test.ts

github-actions bot added the feature label Mar 26, 2026

TerrifiedBug closed this Mar 26, 2026

TerrifiedBug deleted the milestone/M009 branch March 26, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(M009): Fleet-Wide Observability — KPI dashboard, throughput comparison, data loss detection#110

feat(M009): Fleet-Wide Observability — KPI dashboard, throughput comparison, data loss detection#110
TerrifiedBug wants to merge 5 commits intomainfrom
milestone/M009

TerrifiedBug commented Mar 26, 2026

Uh oh!

greptile-apps bot commented Mar 26, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TerrifiedBug commented Mar 26, 2026

Summary

Key decisions

Stats

Test plan

Uh oh!

greptile-apps bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 26, 2026 •

edited

Loading