Overview
Currently only the median (P50) is computed. Extending this to P25, P75, P95, and P99 would give full box-plot coverage and make outlier detection practical without leaving dbt-profiler.
Motivation
- P25/P75 (IQR) are the standard for identifying outliers and understanding distribution shape
- P95/P99 are essential for SLA/latency-style columns and catching extreme values
- Users doing data quality monitoring almost always want these alongside mean and stddev
Implementation notes
The per-adapter measure_median infrastructure is already in place. Each adapter override would need additional percentile calls:
| Adapter |
Approach |
| Default (Postgres, Redshift) |
percentile_cont(0.25) within group (order by col) |
| BigQuery |
approx_quantiles(col, 100)[offset(25)] |
| SQL Server |
Extend the existing ROW_NUMBER() approach |
| Snowflake |
percentile_cont(0.25) within group (order by col) |
| Databricks/Athena |
percentile_approx(col, 0.25) |
API design
New optional measures (excluded via exclude_measures like existing ones):
{{ dbt_profiler.get_profile(relation=ref('my_model'), exclude_measures=['p25', 'p75', 'p95', 'p99']) }}
Or grouped under a single toggle (e.g. exclude_measures=['percentiles']) to avoid cluttering the interface — worth discussing.
Overview
Currently only the median (P50) is computed. Extending this to P25, P75, P95, and P99 would give full box-plot coverage and make outlier detection practical without leaving dbt-profiler.
Motivation
Implementation notes
The per-adapter
measure_medianinfrastructure is already in place. Each adapter override would need additional percentile calls:percentile_cont(0.25) within group (order by col)approx_quantiles(col, 100)[offset(25)]ROW_NUMBER()approachpercentile_cont(0.25) within group (order by col)percentile_approx(col, 0.25)API design
New optional measures (excluded via
exclude_measureslike existing ones):{{ dbt_profiler.get_profile(relation=ref('my_model'), exclude_measures=['p25', 'p75', 'p95', 'p99']) }}Or grouped under a single toggle (e.g.
exclude_measures=['percentiles']) to avoid cluttering the interface — worth discussing.