Conversation
Add PipelineSli table to track per-pipeline health SLI definitions (error_rate, throughput_floor, discard_rate) with configurable thresholds, conditions, and evaluation windows.
Evaluates pipeline health by checking configured SLI definitions (error_rate, discard_rate, throughput_floor) against recent pipeline metrics within the configured time window.
Add listSlis, upsertSli, deleteSli, and health procedures to the pipeline router. Update the list query to include healthStatus (healthy/degraded/no_data) for each deployed pipeline.
- Add Health column to pipeline list with green/yellow/gray badges - Add SLI configuration form to pipeline settings (metric, condition, threshold, window with add/remove) - Add health indicator dot next to process status in flow toolbar
Document available metrics (error_rate, discard_rate, throughput_floor), health badges, and step-by-step SLI configuration instructions.
Greptile SummaryThis PR adds pipeline health SLIs end-to-end: a All critical correctness bugs from the initial review have been addressed:
Remaining issue:
Confidence Score: 4/5
Last reviewed commit: 5cf87e0 |
- Add @@unique([pipelineId, metric]) constraint to prevent duplicate SLIs and use atomic Prisma upsert instead of read-then-write - Add pipelineId to deleteSli input so withTeamAccess can resolve team context, and verify pipelineId matches the SLI being deleted - Replace unbounded findMany with aggregate in SLI evaluator to avoid transferring all metric rows to the application layer
| for (const sli of sliDefs) { | ||
| const since = new Date(Date.now() - sli.windowMinutes * 60_000); | ||
|
|
||
| // Use aggregate to avoid transferring all metric rows to the application | ||
| const agg = await prisma.pipelineMetric.aggregate({ | ||
| where: { pipelineId, timestamp: { gte: since } }, | ||
| _sum: { eventsIn: true, errorsTotal: true, eventsDiscarded: true }, | ||
| _count: true, | ||
| }); |
There was a problem hiding this comment.
Aggregate window re-fetches the same rows for pipelines with multiple SLIs sharing the same windowMinutes
The loop calls prisma.pipelineMetric.aggregate(...) once per SLI, even when multiple SLIs share the same windowMinutes. For a pipeline with error_rate (5 m window) and discard_rate (5 m window), identical aggregate queries are issued twice and the DB computes the same full-table scan twice.
The same aggregated sums (eventsIn, errorsTotal, eventsDiscarded) are needed by all three metric types. Deduplicating by windowMinutes (collecting all SLIs that share a window, running one aggregate per unique window, then distributing results to each SLI) would halve or eliminate the redundant queries in the common case where SLIs share a window. The current design is also fragile: if a fourth metric type is added later, it's easy to forget to add its aggregate field and produce silent incorrect results.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/server/services/sli-evaluator.ts
Line: 25-33
Comment:
**Aggregate window re-fetches the same rows for pipelines with multiple SLIs sharing the same `windowMinutes`**
The loop calls `prisma.pipelineMetric.aggregate(...)` once per SLI, even when multiple SLIs share the same `windowMinutes`. For a pipeline with `error_rate` (5 m window) and `discard_rate` (5 m window), identical aggregate queries are issued twice and the DB computes the same full-table scan twice.
The same aggregated sums (`eventsIn`, `errorsTotal`, `eventsDiscarded`) are needed by all three metric types. Deduplicating by `windowMinutes` (collecting all SLIs that share a window, running one aggregate per unique window, then distributing results to each SLI) would halve or eliminate the redundant queries in the common case where SLIs share a window. The current design is also fragile: if a fourth metric type is added later, it's easy to forget to add its aggregate field and produce silent incorrect results.
How can I resolve this? If you propose a fix, please make it concise.| {% hint style="warning" %} | ||
| If no metric data is available for the evaluation window (for example, the pipeline was recently deployed or has no traffic), the SLI is treated as **breached** and the pipeline health will show as **Degraded**. | ||
| {% endhint %} |
There was a problem hiding this comment.
Documentation contradicts implementation for rate-based SLIs with zero throughput
The hint block says:
"If no metric data is available for the evaluation window... the SLI is treated as breached and the pipeline health will show as Degraded."
This is true when agg._count === 0 (no metric rows at all). However, sli-evaluator.ts handles a second distinct case: when metric rows do exist in the window but eventsIn === 0 for error_rate or discard_rate, the SLI returns "no_data" status (not "breached"), and the UI shows a "No Data" badge — not "Degraded".
A user who configures only rate-based SLIs on a pipeline that has stalled (heartbeats arriving but zero events) will see "No Data" rather than "Degraded", which directly contradicts what this hint tells them to expect.
The hint should be split to cover both cases:
| {% hint style="warning" %} | |
| If no metric data is available for the evaluation window (for example, the pipeline was recently deployed or has no traffic), the SLI is treated as **breached** and the pipeline health will show as **Degraded**. | |
| {% endhint %} | |
| {% hint style="warning" %} | |
| If no metric rows exist for the evaluation window (for example, the pipeline was recently deployed), the SLI is treated as **breached** and the pipeline health will show as **Degraded**. | |
| For rate-based SLIs (`error_rate`, `discard_rate`), if metric rows exist but report zero events ingested, the SLI result is **No Data** — no health determination is made. Pair these SLIs with a `throughput_floor` SLI to catch a stalled pipeline as **Degraded**. | |
| {% endhint %} |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/public/user-guide/pipelines.md
Line: 120-122
Comment:
**Documentation contradicts implementation for rate-based SLIs with zero throughput**
The hint block says:
> "If no metric data is available for the evaluation window... the SLI is treated as **breached** and the pipeline health will show as **Degraded**."
This is true when `agg._count === 0` (no metric rows at all). However, `sli-evaluator.ts` handles a second distinct case: when metric rows *do* exist in the window but `eventsIn === 0` for `error_rate` or `discard_rate`, the SLI returns `"no_data"` status (not `"breached"`), and the UI shows a **"No Data"** badge — not "Degraded".
A user who configures only rate-based SLIs on a pipeline that has stalled (heartbeats arriving but zero events) will see "No Data" rather than "Degraded", which directly contradicts what this hint tells them to expect.
The hint should be split to cover both cases:
```suggestion
{% hint style="warning" %}
If no metric rows exist for the evaluation window (for example, the pipeline was recently deployed), the SLI is treated as **breached** and the pipeline health will show as **Degraded**.
For rate-based SLIs (`error_rate`, `discard_rate`), if metric rows exist but report zero events ingested, the SLI result is **No Data** — no health determination is made. Pair these SLIs with a `throughput_floor` SLI to catch a stalled pipeline as **Degraded**.
{% endhint %}
```
How can I resolve this? If you propose a fix, please make it concise.When SLIs are configured but the pipeline has no traffic yet, show a gray "No Data" badge instead of hiding the indicator entirely. This was lost during the merge of PR #33 due to conflict resolution.
When SLIs are configured but the pipeline has no traffic yet, show a gray "No Data" badge instead of hiding the indicator entirely. This was lost during the merge of PR #33 due to conflict resolution.
Summary
PipelineSlimodel and database migration for storing per-pipeline health SLI definitions (error_rate, throughput_floor, discard_rate) with configurable thresholds, conditions, and evaluation windowsTest plan
npx prisma migrate devagainst a real database and verify the PipelineSli table is created