Skip to content

ClickHouse: ORDER BY tuple() on internal metadata tables causes TBs/day of wasted read I/O at scale #3806

@achow235

Description

@achow235

dlt version

1.24.0

Describe the problem

The three dlt internal metadata tables (_dlt_version, _dlt_pipeline_state, _dlt_loads) are created on ClickHouse with MergeTree ORDER BY tuple() — no primary key, no ordering, no deduplication. Combined with dlt's append-only write pattern (rows are inserted but never deleted), these tables grow linearly with the number of pipeline runs.

dlt queries these tables at the start of every pipeline run but only needs the latest row. Without an ORDER BY key, ClickHouse performs a full table scan on every read.

In the ClickHouse destination (clickhouse.py), table engine selection skips ReplacingMergeTree for tables with write_disposition: "skip" (which covers _dlt_version and _dlt_loads). All three tables end up as plain MergeTree with no column hints for sort or primary_key, resulting in ORDER BY tuple().

The queries themselves are efficient (ORDER BY inserted_at DESC LIMIT 1), but without an index on the underlying table, ClickHouse must scan every row to satisfy the sort.

Expected behavior

Internal metadata tables should use proper ORDER BY keys matching dlt's query patterns so reads are index lookups, not full table scans. Superseded rows should be cleaned up after writes so tables stay bounded regardless of how many pipeline runs occur.

Table Suggested ORDER BY Query Pattern
_dlt_version (schema_name, inserted_at) WHERE schema_name = ? ORDER BY inserted_at DESC
_dlt_pipeline_state (pipeline_name, _dlt_load_id) WHERE pipeline_name = ? ORDER BY load_id DESC LIMIT 1
_dlt_loads (load_id) Joined on load_id

Steps to reproduce

  1. Create a dlt pipeline with the ClickHouse destination
  2. Run the pipeline repeatedly (the more runs, the more pronounced the issue)
  3. Inspect the internal tables:
    SHOW CREATE TABLE <dataset>._dlt_version;
    -- Observe: ENGINE = MergeTree ORDER BY tuple()
    
    SELECT count() FROM <dataset>._dlt_version;
    -- Row count grows with every pipeline run
    
    SELECT formatReadableSize(sum(length(schema))) FROM <dataset>._dlt_version;
    -- Total size of schema JSON blobs grows unbounded
  4. Check system.query_log to observe full table scans on every pipeline run:
    SELECT
        normalized_query_hash,
        count() AS executions,
        formatReadableSize(sum(read_bytes)) AS total_read,
        left(any(query), 200) AS sample_query
    FROM system.query_log
    WHERE type = 'QueryFinish'
      AND query LIKE '%_dlt_version%'
    GROUP BY normalized_query_hash
    ORDER BY sum(read_bytes) DESC
    LIMIT 5

The issue is not immediately visible on small deployments. It becomes significant when pipelines run frequently (hundreds to thousands of runs/day) over weeks/months, as each run appends rows that all subsequent runs must scan.

Operating system

Linux

Runtime environment

Kubernetes

Python version

3.13

dlt data source

Affects all sources

dlt destination

No response

Other deployment details

No response

Additional information

Current workaround is to run a daily job to prune metadata tables

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions