Skip to content

fix: use idle-based CPU formula in dashboard charts and alert evaluator#54

Merged
TerrifiedBug merged 2 commits intomainfrom
fix/api-v1-bigint-serialization
Mar 7, 2026
Merged

fix: use idle-based CPU formula in dashboard charts and alert evaluator#54
TerrifiedBug merged 2 commits intomainfrom
fix/api-v1-bigint-serialization

Conversation

@TerrifiedBug
Copy link
Copy Markdown
Owner

Summary

  • The fleet detail page correctly uses (total - idle) / total for CPU%, but three other code paths still used the old cpuDelta / wallClockSeconds formula — producing values like 787% on multi-core machines
  • Aligns all CPU calculations to the same idle-based formula clamped to 0-100%

Affected code paths

  • alert-evaluator.ts getCpuUsage() — alert threshold checks (the 787.80% firing alert)
  • dashboard.ts chartMetrics — dashboard CPU chart time-series
  • dashboard.ts nodeCards sparkline — was passing raw cumulative cpuSecondsTotal, now computes percentage

Test plan

  • Dashboard CPU chart shows realistic values (~43% matching top output), not pinned at 100%
  • Node cards sparkline reflects actual CPU usage
  • CPU alert rules fire at correct thresholds (e.g., >50% fires at real 50%, not 400%)

GET /api/v1/pipelines/:id and /api/v1/nodes/:id return 500 because
NodePipelineStatus contains BigInt fields (eventsIn, eventsOut, etc.)
that JSON.stringify cannot serialize. Adds a jsonResponse() helper
using a custom replacer to convert BigInts to numbers.
The fleet detail page correctly uses (total - idle) / total for CPU%,
but the dashboard charts and alert evaluator still used the old
cpuDelta / wallClockSeconds formula which gives per-core percentages
(e.g. 787% on an 8-core machine). This aligns all CPU calculations to
the same idle-based formula clamped to 0-100%.

Affected code paths:
- dashboard.chartMetrics: chart CPU time-series
- dashboard.nodeCards: sparkline CPU values
- alert-evaluator getCpuUsage: alert threshold checks
@github-actions github-actions bot added the fix label Mar 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR aligns CPU usage calculations across three previously-inconsistent code paths — alert-evaluator.ts and two sections of dashboard.ts — to use the idle-based formula (total - idle) / total, matching the fleet detail page that was already correct. It also adds a jsonResponse BigInt-safe helper and wires it into the /api/v1/nodes/[id] and /api/v1/pipelines/[id] routes, where NextResponse.json() would throw on BigInt fields from Prisma.

Changes:

  • alert-evaluator.ts: getCpuUsage() now selects cpuSecondsIdle instead of timestamp (wall-clock delta is no longer needed), computes totalDelta/idleDelta, guards against totalDelta ≤ 0, and clamps the result to 0–100%.
  • dashboard.ts nodeCards: Sparkline was incorrectly passing raw cumulative cpuSecondsTotal (could be millions of seconds); now correctly computes a percentage delta between consecutive rows.
  • dashboard.ts chartMetrics: Replaces cpuDelta / dtSec (yields >100% on multi-core hosts) with the idle-based formula; dtSec ≤ 0 guard is preserved as a sanity check on data ordering.
  • api-handler.ts: Introduces jsonResponse() with a JSON.stringify replacer that converts BigInt → Number; the any type is explicitly suppressed with an ESLint comment.
  • Route files: Both public API routes now use jsonResponse() so that BigInt metric fields (eventsIn, eventsOut, errorsTotal) are serialized correctly instead of throwing at runtime.

No correctness bugs were identified. The formulas are mathematically correct, counter-reset edge cases are properly guarded, and the Prisma schema confirms cpuSecondsTotal/cpuSecondsIdle are Float (TypeScript number), so no BigInt coercion is needed for those fields.

Confidence Score: 5/5

  • This PR is safe to merge — the formula changes are mathematically correct, all edge cases (counter resets, division by zero) are properly guarded, and the BigInt serialization fix prevents runtime errors in the public API routes.
  • All three CPU formula changes consistently apply the same idle-based calculation that the fleet detail page already used. The Prisma schema confirms cpuSecondsIdle is a Float with a default of 0, so no type coercion issues exist. The Math.max/Math.min clamp correctly handles any pathological input (negative idleDelta from an unlikely isolated counter reset). The jsonResponse helper is a straightforward fix for BigInt serialization that was previously crashing the API routes silently.
  • No files require special attention.

Important Files Changed

Filename Overview
src/server/services/alert-evaluator.ts Switches getCpuUsage() from wall-clock division to idle-based formula; removes unused timestamp field from select; correctly guards against totalDelta ≤ 0; clamps result to 0-100%.
src/server/routers/dashboard.ts Adds cpuSecondsIdle to both the nodeCards sparkline query and chartMetrics query; applies idle-based formula correctly in both places; first sparkline point is always cpu=0 by design (acceptable); no BigInt issues since cpuSeconds fields are Float.
src/app/api/v1/_lib/api-handler.ts Adds jsonResponse() helper that converts BigInt values to Number before JSON serialization; uses any type with explicit ESLint disable; Number(bigInt) could lose precision for values > 2^53 but is acceptable for metrics in practice.
src/app/api/v1/nodes/[id]/route.ts Switches final response from NextResponse.json() to jsonResponse() to handle any BigInt fields in the node payload; no functional logic changes.
src/app/api/v1/pipelines/[id]/route.ts Switches final response to jsonResponse(); needed because nodeStatuses includes eventsIn, eventsOut, errorsTotal BigInt fields that JSON.stringify would throw on.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent Heartbeat] --> B[metrics-ingest: Store NodeMetric with cpuSecondsTotal and cpuSecondsIdle]

    B --> C[alert-evaluator getCpuUsage]
    B --> D[dashboard nodeCards sparkline]
    B --> E[dashboard chartMetrics CPU series]

    C --> F["totalDelta = newer.total - older.total\nif totalDelta le 0 return null\ncpu% = clamp totalDelta - idleDelta / totalDelta x 100"]
    D --> G["per-row delta between arr i and arr i-1\nif totalDelta > 0 compute cpu%\nelse cpu = 0"]
    E --> H["per-row delta between curr and prev\nif cpuTotalDelta > 0 compute cpu%\nelse cpu = 0"]

    F --> I[AlertEvent fire or resolve]
    G --> J[Node card sparkline chart]
    H --> K[Dashboard CPU time-series chart]

    L[Public REST routes nodes and pipelines] --> M[jsonResponse: BigInt to Number replacer]
Loading

Last reviewed commit: e18b43d

@TerrifiedBug TerrifiedBug merged commit de45e24 into main Mar 7, 2026
12 checks passed
@TerrifiedBug TerrifiedBug deleted the fix/api-v1-bigint-serialization branch March 7, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant