feat: pipeline health SLIs by TerrifiedBug · Pull Request #33 · TerrifiedBug/vectorflow

TerrifiedBug · 2026-03-07T13:20:06Z

Summary

Add PipelineSli model and database migration for storing per-pipeline health SLI definitions (error_rate, throughput_floor, discard_rate) with configurable thresholds, conditions, and evaluation windows
Add SLI evaluation service that checks pipeline metrics against configured SLI thresholds and returns an overall health status (healthy/degraded/no_data)
Add CRUD procedures (listSlis, upsertSli, deleteSli) and health query to the pipeline router, with audit logging and proper access control; update the list query to include healthStatus for deployed pipelines
Add Health column with color-coded badges to the pipeline list page, SLI configuration form to pipeline settings, and health indicator dot in the flow toolbar
Document pipeline health SLIs in the public docs (available metrics, health badges, configuration steps)

Test plan

Add PipelineSli table to track per-pipeline health SLI definitions (error_rate, throughput_floor, discard_rate) with configurable thresholds, conditions, and evaluation windows.

Evaluates pipeline health by checking configured SLI definitions (error_rate, discard_rate, throughput_floor) against recent pipeline metrics within the configured time window.

Add listSlis, upsertSli, deleteSli, and health procedures to the pipeline router. Update the list query to include healthStatus (healthy/degraded/no_data) for each deployed pipeline.

- Add Health column to pipeline list with green/yellow/gray badges - Add SLI configuration form to pipeline settings (metric, condition, threshold, window with add/remove) - Add health indicator dot next to process status in flow toolbar

Document available metrics (error_rate, discard_rate, throughput_floor), health badges, and step-by-step SLI configuration instructions.

greptile-apps · 2026-03-07T13:23:58Z

Greptile Summary

This PR adds pipeline health SLIs end-to-end: a PipelineSli Prisma model with a composite unique constraint, an aggregate-based evaluator service, four new tRPC procedures (listSlis, upsertSli, deleteSli, health), a per-pipeline PipelineHealthBadge component with 30-second polling, a health indicator dot in the flow toolbar, and SLI management UI in pipeline settings.

All critical correctness bugs from the initial review have been addressed:

deleteSli now includes pipelineId in its input, fixing withTeamAccess resolution
upsertSli uses Prisma's native upsert against the @@unique([pipelineId, metric]) constraint, eliminating the TOCTOU race
The migration adds the composite unique index
The evaluator uses prisma.pipelineMetric.aggregate instead of findMany, avoiding unbounded row fetches
Health is fetched per-pipeline from the frontend rather than inline in the list endpoint, eliminating the N+1 server-side query pattern
no_data is now a distinct badge ("No Data") separate from "No SLIs"
eventsIn === 0 for rate-based metrics now returns no_data instead of silently computing 0

Remaining issue:

The documentation hint added in this PR states that all no-data situations show Degraded, but the code returns a "no_data" status (displayed as a "No Data" badge) when metric rows exist but report zero eventsIn for rate-based SLIs. Operators reading the docs would expect "Degraded" in this case and may not understand why the badge says "No Data" instead.

Confidence Score: 4/5

PR is safe to merge; all critical correctness bugs have been addressed, with only a minor documentation inaccuracy remaining.
All previously identified critical issues (broken deleteSli middleware resolution, TOCTOU duplicate SLI race, unbounded metric row fetch, N+1 queries in list, hardcoded null healthStatus, no_data rendered as "No SLIs") have been fixed. The one remaining issue is a docs/code inconsistency in the newly added warning hint — it describes a single "breached → Degraded" path, but the code has a distinct "no_data" path for rate-based SLIs with zero eventsIn that shows a "No Data" badge instead. This is a documentation error, not a runtime bug, and easily corrected.
docs/public/user-guide/pipelines.md — The warning hint contradicts the implementation for rate-based SLIs with zero throughput.

_{Last reviewed commit: 5cf87e0}

src/server/routers/pipeline.ts

prisma/schema.prisma

src/server/services/sli-evaluator.ts

- Add @@unique([pipelineId, metric]) constraint to prevent duplicate SLIs and use atomic Prisma upsert instead of read-then-write - Add pipelineId to deleteSli input so withTeamAccess can resolve team context, and verify pipelineId matches the SLI being deleted - Replace unbounded findMany with aggregate in SLI evaluator to avoid transferring all metric rows to the application layer

src/server/routers/pipeline.ts

greptile-apps · 2026-03-07T13:32:34Z

src/server/services/sli-evaluator.ts

+  for (const sli of sliDefs) {
+    const since = new Date(Date.now() - sli.windowMinutes * 60_000);
+
+    // Use aggregate to avoid transferring all metric rows to the application
+    const agg = await prisma.pipelineMetric.aggregate({
+      where: { pipelineId, timestamp: { gte: since } },
+      _sum: { eventsIn: true, errorsTotal: true, eventsDiscarded: true },
+      _count: true,
+    });


Aggregate window re-fetches the same rows for pipelines with multiple SLIs sharing the same windowMinutes

The loop calls prisma.pipelineMetric.aggregate(...) once per SLI, even when multiple SLIs share the same windowMinutes. For a pipeline with error_rate (5 m window) and discard_rate (5 m window), identical aggregate queries are issued twice and the DB computes the same full-table scan twice.

The same aggregated sums (eventsIn, errorsTotal, eventsDiscarded) are needed by all three metric types. Deduplicating by windowMinutes (collecting all SLIs that share a window, running one aggregate per unique window, then distributing results to each SLI) would halve or eliminate the redundant queries in the common case where SLIs share a window. The current design is also fragile: if a fourth metric type is added later, it's easy to forget to add its aggregate field and produce silent incorrect results.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/server/services/sli-evaluator.ts Line: 25-33 Comment: **Aggregate window re-fetches the same rows for pipelines with multiple SLIs sharing the same `windowMinutes`** The loop calls `prisma.pipelineMetric.aggregate(...)` once per SLI, even when multiple SLIs share the same `windowMinutes`. For a pipeline with `error_rate` (5 m window) and `discard_rate` (5 m window), identical aggregate queries are issued twice and the DB computes the same full-table scan twice. The same aggregated sums (`eventsIn`, `errorsTotal`, `eventsDiscarded`) are needed by all three metric types. Deduplicating by `windowMinutes` (collecting all SLIs that share a window, running one aggregate per unique window, then distributing results to each SLI) would halve or eliminate the redundant queries in the common case where SLIs share a window. The current design is also fragile: if a fourth metric type is added later, it's easy to forget to add its aggregate field and produce silent incorrect results. How can I resolve this? If you propose a fix, please make it concise.

src/server/routers/pipeline.ts

src/server/services/sli-evaluator.ts

src/app/(dashboard)/pipelines/page.tsx

greptile-apps · 2026-03-07T14:32:44Z

docs/public/user-guide/pipelines.md

+{% hint style="warning" %}
+If no metric data is available for the evaluation window (for example, the pipeline was recently deployed or has no traffic), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.
+{% endhint %}


Documentation contradicts implementation for rate-based SLIs with zero throughput

The hint block says:

"If no metric data is available for the evaluation window... the SLI is treated as breached and the pipeline health will show as Degraded."

This is true when agg._count === 0 (no metric rows at all). However, sli-evaluator.ts handles a second distinct case: when metric rows do exist in the window but eventsIn === 0 for error_rate or discard_rate, the SLI returns "no_data" status (not "breached"), and the UI shows a "No Data" badge — not "Degraded".

A user who configures only rate-based SLIs on a pipeline that has stalled (heartbeats arriving but zero events) will see "No Data" rather than "Degraded", which directly contradicts what this hint tells them to expect.

The hint should be split to cover both cases:

Suggested change

{% hint style="warning" %}

If no metric data is available for the evaluation window (for example, the pipeline was recently deployed or has no traffic), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.

{% endhint %}

{% hint style="warning" %}

If no metric rows exist for the evaluation window (for example, the pipeline was recently deployed), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.

For rate-based SLIs (`error_rate`, `discard_rate`), if metric rows exist but report zero events ingested, the SLI result is **No Data** — no health determination is made. Pair these SLIs with a `throughput_floor` SLI to catch a stalled pipeline as **Degraded**.

{% endhint %}

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/public/user-guide/pipelines.md Line: 120-122 Comment: **Documentation contradicts implementation for rate-based SLIs with zero throughput** The hint block says: > "If no metric data is available for the evaluation window... the SLI is treated as **breached** and the pipeline health will show as **Degraded**." This is true when `agg._count === 0` (no metric rows at all). However, `sli-evaluator.ts` handles a second distinct case: when metric rows *do* exist in the window but `eventsIn === 0` for `error_rate` or `discard_rate`, the SLI returns `"no_data"` status (not `"breached"`), and the UI shows a **"No Data"** badge — not "Degraded". A user who configures only rate-based SLIs on a pipeline that has stalled (heartbeats arriving but zero events) will see "No Data" rather than "Degraded", which directly contradicts what this hint tells them to expect. The hint should be split to cover both cases: ```suggestion {% hint style="warning" %} If no metric rows exist for the evaluation window (for example, the pipeline was recently deployed), the SLI is treated as **breached** and the pipeline health will show as **Degraded**. For rate-based SLIs (`error_rate`, `discard_rate`), if metric rows exist but report zero events ingested, the SLI result is **No Data** — no health determination is made. Pair these SLIs with a `throughput_floor` SLI to catch a stalled pipeline as **Degraded**. {% endhint %} ``` How can I resolve this? If you propose a fix, please make it concise.

When SLIs are configured but the pipeline has no traffic yet, show a gray "No Data" badge instead of hiding the indicator entirely. This was lost during the merge of PR #33 due to conflict resolution.

TerrifiedBug added 5 commits March 7, 2026 13:10

feat: add PipelineSli model and migration

ddd5060

Add PipelineSli table to track per-pipeline health SLI definitions (error_rate, throughput_floor, discard_rate) with configurable thresholds, conditions, and evaluation windows.

feat: add SLI evaluation service

b0dfb34

Evaluates pipeline health by checking configured SLI definitions (error_rate, discard_rate, throughput_floor) against recent pipeline metrics within the configured time window.

feat: add SLI CRUD and health query to pipeline router

2d50ff6

Add listSlis, upsertSli, deleteSli, and health procedures to the pipeline router. Update the list query to include healthStatus (healthy/degraded/no_data) for each deployed pipeline.

feat: add health badges and SLI configuration UI

dc10450

- Add Health column to pipeline list with green/yellow/gray badges - Add SLI configuration form to pipeline settings (metric, condition, threshold, window with add/remove) - Add health indicator dot next to process status in flow toolbar

docs: add pipeline health SLIs section to pipelines page

7aa4354

Document available metrics (error_rate, discard_rate, throughput_floor), health badges, and step-by-step SLI configuration instructions.

github-actions bot added documentation Improvements or additions to documentation feature labels Mar 7, 2026