ClickHouse: ORDER BY tuple() on internal metadata tables causes TBs/day of wasted read I/O at scale

### dlt version

1.24.0

### Describe the problem

The three dlt internal metadata tables (`_dlt_version`, `_dlt_pipeline_state`, `_dlt_loads`) are created on ClickHouse with `MergeTree ORDER BY tuple()` — no primary key, no ordering, no deduplication. Combined with dlt's append-only write pattern (rows are inserted but never deleted), these tables grow linearly with the number of pipeline runs.

dlt queries these tables at the start of every pipeline run but only needs the latest row. Without an ORDER BY key, ClickHouse performs a full table scan on every read.

In the ClickHouse destination (`clickhouse.py`), table engine selection skips `ReplacingMergeTree` for tables with `write_disposition: "skip"` (which covers `_dlt_version` and `_dlt_loads`). All three tables end up as plain `MergeTree` with no column hints for `sort` or `primary_key`, resulting in `ORDER BY tuple()`.

The queries themselves are efficient (`ORDER BY inserted_at DESC LIMIT 1`), but without an index on the underlying table, ClickHouse must scan every row to satisfy the sort.


### Expected behavior

Internal metadata tables should use proper `ORDER BY` keys matching dlt's query patterns so reads are index lookups, not full table scans. Superseded rows should be cleaned up after writes so tables stay bounded regardless of how many pipeline runs occur.

| Table | Suggested ORDER BY | Query Pattern |
|---|---|---|
| `_dlt_version` | `(schema_name, inserted_at)` | `WHERE schema_name = ? ORDER BY inserted_at DESC` |
| `_dlt_pipeline_state` | `(pipeline_name, _dlt_load_id)` | `WHERE pipeline_name = ? ORDER BY load_id DESC LIMIT 1` |
| `_dlt_loads` | `(load_id)` | Joined on `load_id` |


### Steps to reproduce

1. Create a dlt pipeline with the ClickHouse destination
2. Run the pipeline repeatedly (the more runs, the more pronounced the issue)
3. Inspect the internal tables:
   ```sql
   SHOW CREATE TABLE <dataset>._dlt_version;
   -- Observe: ENGINE = MergeTree ORDER BY tuple()

   SELECT count() FROM <dataset>._dlt_version;
   -- Row count grows with every pipeline run

   SELECT formatReadableSize(sum(length(schema))) FROM <dataset>._dlt_version;
   -- Total size of schema JSON blobs grows unbounded
   ```
4. Check `system.query_log` to observe full table scans on every pipeline run:
   ```sql
   SELECT
       normalized_query_hash,
       count() AS executions,
       formatReadableSize(sum(read_bytes)) AS total_read,
       left(any(query), 200) AS sample_query
   FROM system.query_log
   WHERE type = 'QueryFinish'
     AND query LIKE '%_dlt_version%'
   GROUP BY normalized_query_hash
   ORDER BY sum(read_bytes) DESC
   LIMIT 5
   ```

The issue is not immediately visible on small deployments. It becomes significant when pipelines run frequently (hundreds to thousands of runs/day) over weeks/months, as each run appends rows that all subsequent runs must scan.


### Operating system

Linux

### Runtime environment

Kubernetes

### Python version

3.13

### dlt data source

Affects all sources

### dlt destination

_No response_

### Other deployment details

_No response_

### Additional information

Current workaround is to run a daily job to prune metadata tables

Table	Suggested ORDER BY	Query Pattern
`_dlt_version`	`(schema_name, inserted_at)`	`WHERE schema_name = ? ORDER BY inserted_at DESC`
`_dlt_pipeline_state`	`(pipeline_name, _dlt_load_id)`	`WHERE pipeline_name = ? ORDER BY load_id DESC LIMIT 1`
`_dlt_loads`	`(load_id)`	Joined on `load_id`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClickHouse: ORDER BY tuple() on internal metadata tables causes TBs/day of wasted read I/O at scale #3806

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ClickHouse: ORDER BY tuple() on internal metadata tables causes TBs/day of wasted read I/O at scale #3806

Description

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions