Skip to content

feat: pipeline health SLIs#33

Merged
TerrifiedBug merged 9 commits intomainfrom
feat/pipeline-slis
Mar 7, 2026
Merged

feat: pipeline health SLIs#33
TerrifiedBug merged 9 commits intomainfrom
feat/pipeline-slis

Conversation

@TerrifiedBug
Copy link
Copy Markdown
Owner

Summary

  • Add PipelineSli model and database migration for storing per-pipeline health SLI definitions (error_rate, throughput_floor, discard_rate) with configurable thresholds, conditions, and evaluation windows
  • Add SLI evaluation service that checks pipeline metrics against configured SLI thresholds and returns an overall health status (healthy/degraded/no_data)
  • Add CRUD procedures (listSlis, upsertSli, deleteSli) and health query to the pipeline router, with audit logging and proper access control; update the list query to include healthStatus for deployed pipelines
  • Add Health column with color-coded badges to the pipeline list page, SLI configuration form to pipeline settings, and health indicator dot in the flow toolbar
  • Document pipeline health SLIs in the public docs (available metrics, health badges, configuration steps)

Test plan

  • Run npx prisma migrate dev against a real database and verify the PipelineSli table is created
  • Create SLIs via the pipeline settings UI and verify they persist (listSlis returns them)
  • Upsert an existing metric and confirm it updates rather than duplicating
  • Delete an SLI and confirm it is removed
  • Deploy a pipeline with SLIs configured and verify the health badge appears in the pipeline list
  • Verify the health indicator dot appears next to process status in the pipeline editor toolbar
  • Verify draft pipelines show "--" in the Health column (no badge)
  • Verify pipelines with no SLIs show "No SLIs" gray badge
  • Confirm TypeScript compilation passes with no errors
  • Confirm ESLint passes with no errors

Add PipelineSli table to track per-pipeline health SLI definitions
(error_rate, throughput_floor, discard_rate) with configurable
thresholds, conditions, and evaluation windows.
Evaluates pipeline health by checking configured SLI definitions
(error_rate, discard_rate, throughput_floor) against recent pipeline
metrics within the configured time window.
Add listSlis, upsertSli, deleteSli, and health procedures to the
pipeline router. Update the list query to include healthStatus
(healthy/degraded/no_data) for each deployed pipeline.
- Add Health column to pipeline list with green/yellow/gray badges
- Add SLI configuration form to pipeline settings (metric, condition,
  threshold, window with add/remove)
- Add health indicator dot next to process status in flow toolbar
Document available metrics (error_rate, discard_rate, throughput_floor),
health badges, and step-by-step SLI configuration instructions.
@github-actions github-actions bot added documentation Improvements or additions to documentation feature labels Mar 7, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR adds pipeline health SLIs end-to-end: a PipelineSli Prisma model with a composite unique constraint, an aggregate-based evaluator service, four new tRPC procedures (listSlis, upsertSli, deleteSli, health), a per-pipeline PipelineHealthBadge component with 30-second polling, a health indicator dot in the flow toolbar, and SLI management UI in pipeline settings.

All critical correctness bugs from the initial review have been addressed:

  • deleteSli now includes pipelineId in its input, fixing withTeamAccess resolution
  • upsertSli uses Prisma's native upsert against the @@unique([pipelineId, metric]) constraint, eliminating the TOCTOU race
  • The migration adds the composite unique index
  • The evaluator uses prisma.pipelineMetric.aggregate instead of findMany, avoiding unbounded row fetches
  • Health is fetched per-pipeline from the frontend rather than inline in the list endpoint, eliminating the N+1 server-side query pattern
  • no_data is now a distinct badge ("No Data") separate from "No SLIs"
  • eventsIn === 0 for rate-based metrics now returns no_data instead of silently computing 0

Remaining issue:

  • The documentation hint added in this PR states that all no-data situations show Degraded, but the code returns a "no_data" status (displayed as a "No Data" badge) when metric rows exist but report zero eventsIn for rate-based SLIs. Operators reading the docs would expect "Degraded" in this case and may not understand why the badge says "No Data" instead.

Confidence Score: 4/5

  • PR is safe to merge; all critical correctness bugs have been addressed, with only a minor documentation inaccuracy remaining.
  • All previously identified critical issues (broken deleteSli middleware resolution, TOCTOU duplicate SLI race, unbounded metric row fetch, N+1 queries in list, hardcoded null healthStatus, no_data rendered as "No SLIs") have been fixed. The one remaining issue is a docs/code inconsistency in the newly added warning hint — it describes a single "breached → Degraded" path, but the code has a distinct "no_data" path for rate-based SLIs with zero eventsIn that shows a "No Data" badge instead. This is a documentation error, not a runtime bug, and easily corrected.
  • docs/public/user-guide/pipelines.md — The warning hint contradicts the implementation for rate-based SLIs with zero throughput.

Last reviewed commit: 5cf87e0

- Add @@unique([pipelineId, metric]) constraint to prevent duplicate
  SLIs and use atomic Prisma upsert instead of read-then-write
- Add pipelineId to deleteSli input so withTeamAccess can resolve
  team context, and verify pipelineId matches the SLI being deleted
- Replace unbounded findMany with aggregate in SLI evaluator to
  avoid transferring all metric rows to the application layer
Comment on lines +25 to +33
for (const sli of sliDefs) {
const since = new Date(Date.now() - sli.windowMinutes * 60_000);

// Use aggregate to avoid transferring all metric rows to the application
const agg = await prisma.pipelineMetric.aggregate({
where: { pipelineId, timestamp: { gte: since } },
_sum: { eventsIn: true, errorsTotal: true, eventsDiscarded: true },
_count: true,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregate window re-fetches the same rows for pipelines with multiple SLIs sharing the same windowMinutes

The loop calls prisma.pipelineMetric.aggregate(...) once per SLI, even when multiple SLIs share the same windowMinutes. For a pipeline with error_rate (5 m window) and discard_rate (5 m window), identical aggregate queries are issued twice and the DB computes the same full-table scan twice.

The same aggregated sums (eventsIn, errorsTotal, eventsDiscarded) are needed by all three metric types. Deduplicating by windowMinutes (collecting all SLIs that share a window, running one aggregate per unique window, then distributing results to each SLI) would halve or eliminate the redundant queries in the common case where SLIs share a window. The current design is also fragile: if a fourth metric type is added later, it's easy to forget to add its aggregate field and produce silent incorrect results.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/server/services/sli-evaluator.ts
Line: 25-33

Comment:
**Aggregate window re-fetches the same rows for pipelines with multiple SLIs sharing the same `windowMinutes`**

The loop calls `prisma.pipelineMetric.aggregate(...)` once per SLI, even when multiple SLIs share the same `windowMinutes`. For a pipeline with `error_rate` (5 m window) and `discard_rate` (5 m window), identical aggregate queries are issued twice and the DB computes the same full-table scan twice.

The same aggregated sums (`eventsIn`, `errorsTotal`, `eventsDiscarded`) are needed by all three metric types. Deduplicating by `windowMinutes` (collecting all SLIs that share a window, running one aggregate per unique window, then distributing results to each SLI) would halve or eliminate the redundant queries in the common case where SLIs share a window. The current design is also fragile: if a fourth metric type is added later, it's easy to forget to add its aggregate field and produce silent incorrect results.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +120 to +122
{% hint style="warning" %}
If no metric data is available for the evaluation window (for example, the pipeline was recently deployed or has no traffic), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.
{% endhint %}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation contradicts implementation for rate-based SLIs with zero throughput

The hint block says:

"If no metric data is available for the evaluation window... the SLI is treated as breached and the pipeline health will show as Degraded."

This is true when agg._count === 0 (no metric rows at all). However, sli-evaluator.ts handles a second distinct case: when metric rows do exist in the window but eventsIn === 0 for error_rate or discard_rate, the SLI returns "no_data" status (not "breached"), and the UI shows a "No Data" badge — not "Degraded".

A user who configures only rate-based SLIs on a pipeline that has stalled (heartbeats arriving but zero events) will see "No Data" rather than "Degraded", which directly contradicts what this hint tells them to expect.

The hint should be split to cover both cases:

Suggested change
{% hint style="warning" %}
If no metric data is available for the evaluation window (for example, the pipeline was recently deployed or has no traffic), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.
{% endhint %}
{% hint style="warning" %}
If no metric rows exist for the evaluation window (for example, the pipeline was recently deployed), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.
For rate-based SLIs (`error_rate`, `discard_rate`), if metric rows exist but report zero events ingested, the SLI result is **No Data** — no health determination is made. Pair these SLIs with a `throughput_floor` SLI to catch a stalled pipeline as **Degraded**.
{% endhint %}
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/public/user-guide/pipelines.md
Line: 120-122

Comment:
**Documentation contradicts implementation for rate-based SLIs with zero throughput**

The hint block says:

> "If no metric data is available for the evaluation window... the SLI is treated as **breached** and the pipeline health will show as **Degraded**."

This is true when `agg._count === 0` (no metric rows at all). However, `sli-evaluator.ts` handles a second distinct case: when metric rows *do* exist in the window but `eventsIn === 0` for `error_rate` or `discard_rate`, the SLI returns `"no_data"` status (not `"breached"`), and the UI shows a **"No Data"** badge — not "Degraded".

A user who configures only rate-based SLIs on a pipeline that has stalled (heartbeats arriving but zero events) will see "No Data" rather than "Degraded", which directly contradicts what this hint tells them to expect.

The hint should be split to cover both cases:

```suggestion
{% hint style="warning" %}
If no metric rows exist for the evaluation window (for example, the pipeline was recently deployed), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.

For rate-based SLIs (`error_rate`, `discard_rate`), if metric rows exist but report zero events ingested, the SLI result is **No Data** — no health determination is made. Pair these SLIs with a `throughput_floor` SLI to catch a stalled pipeline as **Degraded**.
{% endhint %}
```

How can I resolve this? If you propose a fix, please make it concise.

@TerrifiedBug TerrifiedBug merged commit 451a50d into main Mar 7, 2026
12 checks passed
@TerrifiedBug TerrifiedBug deleted the feat/pipeline-slis branch March 7, 2026 14:34
TerrifiedBug added a commit that referenced this pull request Mar 7, 2026
When SLIs are configured but the pipeline has no traffic yet, show a
gray "No Data" badge instead of hiding the indicator entirely. This was
lost during the merge of PR #33 due to conflict resolution.
TerrifiedBug added a commit that referenced this pull request Mar 7, 2026
When SLIs are configured but the pipeline has no traffic yet, show a
gray "No Data" badge instead of hiding the indicator entirely. This was
lost during the merge of PR #33 due to conflict resolution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant