Skip to content

feat(M009): Fleet-Wide Observability — KPI dashboard, throughput comparison, data loss detection#110

Closed
TerrifiedBug wants to merge 5 commits intomainfrom
milestone/M009
Closed

feat(M009): Fleet-Wide Observability — KPI dashboard, throughput comparison, data loss detection#110
TerrifiedBug wants to merge 5 commits intomainfrom
milestone/M009

Conversation

@TerrifiedBug
Copy link
Copy Markdown
Owner

Summary

  • S01 — Fleet KPI Dashboard + Volume Trends: New /fleet/overview page with 4 KPI cards (bytes in/out, events in/out, fleet health), volume trend AreaChart with time range selector (1h/6h/1d/7d/30d), backed by SQL-level date_trunc aggregation in fleet-data.ts service. 19 unit tests.
  • S02 — Cross-Node Throughput Comparison + Capacity Utilization: Bar chart comparing throughput (events/sec) across nodes, per-node capacity utilization line charts (CPU%, memory%, disk%) with time-range support.
  • S03 — Data Loss Detection + Enhanced Deployment Matrix: Sortable data loss alert table flagging pipelines where events-out/events-in gap exceeds configurable threshold (default 5%). Deployment matrix cells with throughput rate overlays and red highlights on data loss.

Key decisions

  • D013: SQL-level aggregation via Prisma.$queryRaw with date_trunc (not JS-side bucketing)
  • D014: /fleet/overview as sibling page to /fleet list with tab navigation

Stats

  • 13 files changed, 1985 insertions, 4 deletions
  • 19 fleet-data service tests (all passing)
  • 0 TypeScript errors, 0 lint warnings

Test plan

  • Verify /fleet/overview renders KPI cards and volume trend chart
  • Confirm time range selector (1h/6h/1d/7d/30d) updates chart data
  • Verify throughput comparison bar chart shows per-node data
  • Verify capacity utilization charts show CPU/memory/disk trends
  • Confirm data loss table highlights pipelines exceeding threshold
  • Verify deployment matrix shows throughput overlays
  • Run vitest run src/server/services/__tests__/fleet-data.test.ts — 19/19 pass
  • Run tsc --noEmit — 0 errors

🤖 Generated with Claude Code

TerrifiedBug and others added 5 commits March 26, 2026 00:09
- src/server/services/fleet-data.ts
- src/server/services/__tests__/fleet-data.test.ts
- src/server/routers/fleet.ts
- src/hooks/use-realtime-invalidation.ts
- src/hooks/__tests__/use-realtime-invalidation.test.ts
…d chart, time range selector

- Fleet overview page with 15s auto-refresh polling and time range selector
- 4 KPI cards: bytes in/out, events in/out, fleet health %, node count
- Volume trend AreaChart with bytesIn/bytesOut series using Recharts
- Added "Fleet Overview" nav link to fleet page

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arts

Add node-level throughput bar chart and per-node capacity utilization
trend charts (memory%, disk%, CPU load) to fleet overview page. Includes
getNodeThroughput/getNodeCapacity service functions, tRPC procedures with
team access control, SSE invalidation, bottleneck highlighting, and
8-color node palette. 6 new tests (505 total).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…verlays

Add data loss detection with configurable threshold (default 5%) showing
pipelines where events-out/events-in gap is suspicious. Enhance deployment
matrix with per-cell throughput rates and red highlighting on cells with
data loss. New tRPC endpoints: fleet.dataLoss, fleet.matrixThroughput.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This PR introduces the M009 Fleet-Wide Observability milestone: a new /fleet/overview page with KPI cards, volume trend charts, cross-node throughput comparison, capacity utilization charts, a data-loss detection table, and throughput overlays on the deployment matrix. The backend is a new fleet-data.ts service using SQL-level date_trunc aggregation via Prisma.$queryRaw, wired into 6 new read-only tRPC procedures all protected with withTeamAccess("VIEWER").\n\nKey findings:\n\n- One correctness bug in data-loss-table.tsx: formatPercent(row.lossRate) passes a 0–1 decimal directly to a formatter that appends % without multiplying by 100 — a 20% loss rate renders as "0.2%". Every other call site in this PR (fleet-kpi-cards.tsx) correctly pre-multiplies by 100.\n- The backend service is well-structured: all SQL queries use Prisma.sql parameterized templates (no injection risk), all 6 procedures carry withTeamAccess(\"VIEWER\"), and cross-team isolation is enforced via environmentId subquery filters.\n- 19 unit tests cover all service functions including null/zero edge cases; the real-time invalidation hook is properly extended.

Confidence Score: 4/5

Safe to merge after fixing the one-line formatPercent call in data-loss-table.tsx; all backend code is correct and secure.

Strong PR overall — secure backend, proper RBAC, parameterized SQL, good test coverage. One concrete P1 display bug in data-loss-table where loss rates will render 100× too small, making the feature misleading in production. A single one-line fix resolves it.

src/components/fleet/data-loss-table.tsx — formatPercent call needs * 100 scaling.

Important Files Changed

Filename Overview
src/server/services/fleet-data.ts New service with 6 SQL-level aggregation functions using Prisma.$queryRaw; all parameterized safely with Prisma.sql template tags and properly filtered by environmentId; BigInt→Number conversions handled correctly throughout.
src/server/routers/fleet.ts Adds 6 new query procedures; all correctly guarded with withTeamAccess("VIEWER") and validated with Zod enum inputs; no mutations so withAudit not required.
src/components/fleet/data-loss-table.tsx New data loss table component; contains a display bug where formatPercent(row.lossRate) passes a 0–1 fraction directly, causing loss rates to display 100× too small (e.g. 0.2% instead of 20%).
src/app/(dashboard)/fleet/overview/page.tsx New fleet overview page orchestrating 6 parallel queries with 15s polling; correctly gates all queries on selectedEnvironmentId and handles error/loading states.
src/components/fleet/deployment-matrix.tsx Augments deployment matrix cells with per-cell throughput rates and red-highlight for data-loss; throughput lookup uses O(1) Map.
src/server/services/tests/fleet-data.test.ts 19 unit tests covering all 6 service functions, including null handling, zero-throughput skipping, and sort order.

Sequence Diagram

sequenceDiagram
    participant Page as FleetOverviewPage
    participant tRPC as tRPC (fleet router)
    participant MW as withTeamAccess("VIEWER")
    participant Svc as fleet-data.ts
    participant DB as PostgreSQL

    Page->>tRPC: fleet.overview(environmentId, range)
    Page->>tRPC: fleet.volumeTrend(environmentId, range)
    Page->>tRPC: fleet.nodeThroughput(environmentId, range)
    Page->>tRPC: fleet.nodeCapacity(environmentId, range)
    Page->>tRPC: fleet.dataLoss(environmentId, range, threshold)
    Page->>tRPC: fleet.matrixThroughput(environmentId, range)

    tRPC->>MW: validate team membership
    MW-->>tRPC: authorized

    tRPC->>Svc: getFleetOverview / getVolumeTrend / etc.
    Svc->>DB: Prisma.$queryRaw (date_trunc aggregation)
    DB-->>Svc: BigInt rows
    Svc-->>tRPC: typed results (BigInt to Number)
    tRPC-->>Page: serialized JSON

    Page->>Page: render KPI cards, charts, data-loss table, matrix
Loading

Comments Outside Diff (1)

  1. src/components/fleet/data-loss-table.tsx, line 349-350 (link)

    P1 Loss rate displayed 100× too small

    lossRate is stored as a fraction (0–1, e.g. 0.20 for 20% loss), but formatPercent simply calls v.toFixed(1) + "%" without multiplying by 100. A 20% loss rate will render as "0.2%" instead of "20.0%".

    Every other call in this PR that passes a fractional rate explicitly scales by 100 first — compare fleet-kpi-cards.tsx:

    {formatPercent((data?.errorRate ?? 0) * 100)} error rate

    The same fix is needed here:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/components/fleet/data-loss-table.tsx
    Line: 349-350
    
    Comment:
    **Loss rate displayed 100× too small**
    
    `lossRate` is stored as a fraction (0–1, e.g. `0.20` for 20% loss), but `formatPercent` simply calls `v.toFixed(1) + "%"` without multiplying by 100. A 20% loss rate will render as **"0.2%"** instead of "20.0%".
    
    Every other call in this PR that passes a fractional rate explicitly scales by 100 first — compare `fleet-kpi-cards.tsx`:
    
    ```tsx
    {formatPercent((data?.errorRate ?? 0) * 100)} error rate
    ```
    
    The same fix is needed here:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/components/fleet/data-loss-table.tsx
Line: 349-350

Comment:
**Loss rate displayed 100× too small**

`lossRate` is stored as a fraction (0–1, e.g. `0.20` for 20% loss), but `formatPercent` simply calls `v.toFixed(1) + "%"` without multiplying by 100. A 20% loss rate will render as **"0.2%"** instead of "20.0%".

Every other call in this PR that passes a fractional rate explicitly scales by 100 first — compare `fleet-kpi-cards.tsx`:

```tsx
{formatPercent((data?.errorRate ?? 0) * 100)} error rate
```

The same fix is needed here:

```suggestion
                        {formatPercent(row.lossRate * 100)}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(S03): Data loss detection table + d..." | Re-trigger Greptile

@TerrifiedBug TerrifiedBug deleted the milestone/M009 branch March 26, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant