Releases · MTSWebServices/data-rentgen

19 Mar 15:12

github-actions

0.5.0

3ed8514

0.5.0 (2026-03-19) Latest

Latest

OpenLineage-related features

Extracting dataset & job tags

#367, #368, #369, #372

Now DataRentgen extracts tags from OpenLineage events:
- dataset tags (currently not reported by any integration)
- job & run tags
Some of tags are created based on engine versions:
- airflow.version
- dbt.version
- flink.version
- hive.version
- spark.version
- openlineage_adapter.version
- openlineage_client.version (only for Python client v1.38.0 or higher)

Note that passing job & run tags depends on integration. For example, tags can be setup for Spark, Airflow and dbt, but not for Flink or Hive. Also tags are configured in a different way in each integration.

Extracting `nominalTime`

#378

Now DataRentgen extracts nominalTime run facet, and stores values in run.expected_start_at, run.expected_end_at fields.

Extracting `jobDependencies`

#402

Now DataRentgen extracts information from jobDependencies facet, and store it in job_dependency table. For now this is just a simple tuple from_dataset_id, to_dataset_id, type (arbitrary string provided by integration, not enum). This can be changed in future versions of Data.Rentgen.

Currently the only integration providing this kind of information is Airflow. But it is implemented only in most recent version of OpenLineage provider for Airflow (2.10 or higher). For now provider also doesn't send facet with information about direct task -> task dependencies - only indirect ones are included (declared via Asset). So there is a fallback for Airflow which extracts these dependencies from downstream_task_ids and upstream_task_ids task fields.

REST API features

Added `GET /v1/jobs/hierarchy` endpoint

This endpoint can be used retrieve job hierarchy graph (parents, dependencies) for a given job. (#407, #412)

Response example

{
    "relations": {
        "parents": [
            {
                "from": {"kind": "JOB", "id": "1"},
                "to": {"kind": "JOB", "id": "2"}
            }
        ],
        "dependencies": [
            {
                "from": {"kind": "JOB", "id": "3"},
                "to": {"kind": "JOB", "id": "1"},
                "type": "DIRECT_DEPENDENCY"
            },
            {
                "from": {"kind": "JOB", "id": "1"},
                "to": {"kind": "JOB", "id": "4"},
                "type": "DIRECT_DEPENDENCY"
            }
        ]
    },
    "nodes": {
        "jobs": {
            "1": {
                "id": 1,
                "parent_job_id": null,
                "name": "my_job",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            },
            "2": {
                "id": 2,
                "parent_job_id": 1,
                "name": "my_job.child_task",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            },
            "3": {
                "id": 3,
                "parent_job_id": null,
                "name": "source_job",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            },
            "4": {
                "id": 4,
                "parent_job_id": null,
                "name": "target_job",
                "type": "SPARK_APPLICATION",
                "location": {
                    "name": "my_cluster",
                    "type": "YARN"
                }
            }
        }
    }
}

Added parent relation between jobs

Jobs can now reference a parent job via parent_job_id field. (#394)

Before:

Response example

{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... }
            }
        }
    ]
}

After:

Response example

{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... },
                "parent_job_id": "10"
            }
        }
    ]
}

Added JOB-JOB and RUN-RUN relations to lineage API

For example, it is possible to get Airflow DAG → Airflow Task → Spark app chain from a single response. (#392, #399, #401)

Before:

Response example

{
    "relations": {
        "parents": [
            {"from": {"kind": "JOB", "id": "1"}, "to": {"kind": "RUN", "id": "parent-run-uuid"}},
            {"from": {"kind": "JOB", "id": "2"}, "to": {"kind": "RUN", "id": "run-uuid"}}
        ],
        "symlinks": [],
        "inputs": [...],
        "outputs": [...]
    },
    "nodes": {...}
}

After:

Response example

{
    "relations": {
        "parents": [
            {"from": {"kind": "JOB", "id": "1"}, "to": {"kind": "RUN", "id": "parent-run-uuid"}},
            {"from": {"kind": "JOB", "id": "2"}, "to": {"kind": "RUN", "id": "run-uuid"}},
            # NEW:
            {"from": {"kind": "JOB", "id": "1"}, "to": {"kind": "JOB", "id": "2"}},
            {"from": {"kind": "RUN", "id": "parent-run-uuid"}, "to": {"kind": "RUN", "id": "run-uuid"}}
        ],
        "symlinks": [],
        "inputs": [...],
        "outputs": [...]
    },
    "nodes": {...}
}

Include `job` to `GET /v1/runs` response

This allows to show job type & name for specific run without sending additional requests. #411

Before:

Response example

{
    "meta": {
        "page": 1,
        "page_size": 20,
        "total_count": 1,
        "pages_count": 1,
        "has_next": False,
        "has_previous": False,
        "next_page": None,
        "previous_page": None,
    },
    "items": [
        {
            "id": "01908224-8410-79a2-8de6-a769ad6944c9",
            "data": {
                "id": "01908224-8410-79a2-8de6-a769ad6944c9",
                "created_at": "2024-07-05T09:05:49.584000",
                "job_id": "123",
                ...
            },
            "statistics": { ... }
        }
    ]
}

After:

Response example

{
    "meta": {
        "page": 1,
        "page_size": 20,
        "total_count": 1,
        "pages_count": 1,
        "has_next": False,
        "has_previous": False,
        "next_page": None,
        "previous_page": None,
    },
    "items": [
        {
            "id": "01908224-8410-79a2-8de6-a769ad6944c9",
            "data": {
                "id": "01908224-8410-79a2-8de6-a769ad6944c9",
                "created_at": "2024-07-05T09:05:49.584000",
                "job_id": "123",
                ...
            },
            "job": {
                "id": "123",
                "name": "myjob",
                ...
            },
            "statistics": { ... }
        }
    ]
}

Include `last_run` field to `GET /v1/jobs` response

This allows to show last start time, status and duration for each job in the list, without additional requests. #387

Before:

Response example

{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... },
                "parent_job_id": "10"
            }
        }
    ]
}

After:

Response example

{
    "meta": { ... },
    "items": [
        {
            "id": "42",
            "data": {
                "id": "42",
                "name": "my-spark-task",
                "type": "SPARK_APPLICATION",
                "location": { ... },
                "parent_job_id": "10"
            },
            "last_run": {
                "id": "01908224-8410-79a2-8de6-a769ad6944c9",
                "created_at": "2024-07-05T09:05:49.584000",
                "job_id": "123",
                ...
            }
        }
    ]
}

Assets 7

26 Jan 08:56

github-actions

0.4.8

8a00eda

0.4.8 (2025-01-26)

Fixed issue with updating Location's external_id field - server returned response code 200 but ignored the input value.

Assets 7

20 Jan 13:52

github-actions

0.4.7

9bb852a

0.4.7 (2025-01-20)

Dependency-only updates.

Assets 7

12 Jan 14:23

github-actions

0.4.6

0047ff9

0.4.6 (2025-01-12)

Dependency-only updates.

Assets 7

24 Dec 15:48

github-actions

0.4.5

02b53ee

0.4.5 (2025-12-24)

Improvements

Allow disabling SessionMiddleware, as it only required by KeycloakAuthProvider.

Assets 7

21 Nov 16:51

github-actions

0.4.4

d76fdb5

0.4.4 (2025-11-21)

Bug Fixes

0.4.3 release broken inputs with 0 bytes statistics, fixed

Assets 7

21 Nov 15:52

github-actions

0.4.3

04d73bb

0.4.3 (2025-11-21)

Features

Disable server.session.enabled by default. It is required only by KeycloakAuthProvider which is not used by default.

Bug Fixes

Escape unprintable ASCII symbols in SQL queries before storing them in Postgres. Previously saving queries containing \x00 symbol lead to exceptions.
Kafka topic with malformed messages doesn't have to use the same number partitions as input topics.
Prevent OpenLineage from reporting events which claim to read 8 Exabytes of data, this is actually a Spark quirk.

Assets 7

29 Oct 15:32

github-actions

0.4.2

bb01ca3

0.4.2 (2025-10-29)

Bug fixes

Fix search query filter on UI Run list page.
Fix passing multiple filters to GET /v1/runs.

Doc only Changes

Document DATA_RENTGEN__UI__AUTH_PROVIDER config variable.

Assets 7

08 Oct 14:15

github-actions

0.4.1

c5a2ade

0.4.1 (2025-10-08)

Features

Add new GET /v1/locations/types endpoint returning list of all known location types. (#328)
Add new filter to GET /v1/jobs (#328):
- location_type: list[str]
Add new filter to GET /v1/datasets (#328):
- location_type: list[str]
Allow passing multiple location_type filters to GET /v1/locations. (#328)
Allow passing multiple values to GET endpoinds with filters like job_id, parent_run_id, and so on. (#329)

Assets 7

03 Oct 13:56

github-actions

0.4.0

9e97ab2

0.4.0 (2025-10-03)

Features

Introduce new http2kafka component. (#281)

It allows using DataRentgen with OpenLineage HttpTransport. Authentication is done using personal tokens.
Add REST API endpoints for managing personal tokens. (#276)
- List of endpoints:
  - GET /personal-tokens - get personal tokens for current user.
  - POST /personal-tokens - create new personal token for current user.
  - PATCH /personal-tokens/:id - refresh personal token (revoke token and create new one).
  - DELETE /personal-tokens/:id - revoke personal token.
Add new entities Tag and TagValue. #268

Tags can be used as additional properties for another entities. This feature is still under construction.

Added endpoint GET /v1/tags. #289

Tag names and values can be paginated, searched by, or fetched by ids.

Response example

[
    {
    "id": 1,
    "name": "env",
    "values": [
        {
          "id": 1,
          "value": "dev"
        },
        {
          "id": 2,
          "value": "prod"
        }
      ]
    }
]

Updated GET /v1/datasets to include tags: [...] in response. #289

Dataset response examples

Before:

{
    "id": "8400",
    "location": {...},
    "name": "dataset_name",
    "schema": {},
}

After:

{
    "id": "25896",
    "location": {...},
    "name": "dataset_name",
    "schema": {...},
    "tags": [  # <---
        {
            "id": "1",
            "name": "environment",
            "values": [
                {
                    "id": "2",
                    "value": "production"
                }
            ]
        },
        {
            "id": "2",
            "name": "team",
            "values": [
                {
                    "id": "4",
                    "value": "my_awesome_team"
                }
            ]
        }
    ]
}

Added new filters to GET /v1/datasets endpoint. (#294, #289)
- Query params:
  - location_id: int
  - tag_value_id: list[int] - if multiple values are passed, dataset should have all of them.
Added new filters for GET /v1/jobs endpoint. #319
- Query params:
  - location_id: int
  - job_type: list[str]
Added new filters to GET /v1/runs endpoint. (#322, #323)
- Query params:
  - job_type: list[str]
  - status: list[RunStatus]
  - started_since: datetime | None
  - started_until: datetime | None
  - ended_since: datetime | None
  - ended_until: datetime | None
  - job_location_id: int | None
  - started_by_user: list[str] | None
Added new endpoint GET /v1/jobs/types. #319
Add custom dataRentgen_run and dataRentgen_operation facets. #265
- These facets allow to:
  - Passing custom external_id, persistent_log_url and other fields of Run.
  - Passing custom name, description, group, positition fields of Operation.
  - mark event as containing only Operation or both Run + Operation data.
Set output.type based on executed SQL query, e.g. INSERT, UPDATE, DELETE, and so on. #310

Improvements

Improve consumer performance by reducing DB load on reading operations. #314
Add workaround if OpenLineage emitted Spark application event with job.name=unknown. #263

This requires installing OpenLineage with this fix merged: OpenLineage/OpenLineage#3848.
Dataset symlinks with no inputs/outputs are no longer removed from lineage graph. #269
Make matching for addresses and locations more deterministic by converting them to lowercase. #313

Items oracle://host:1521 and ORACLE://HOST:1521 are the same item oracle://host:1521 now.
Make matching for datasets, jobs, tags and user names case-insensitive by using unique indexes on lower(name) expression. #313

Item database.schema.table and DATABASE.SCHEMA.TABLE are the same item now.

As dataset canonical name depends on database naming convention (UPPERCASE for Oracle, lowercase for Postgres), we can't convert them into one specific case (upper or lower). Instead we use first received value as canonical one.

Bug Fixes

For lineage with granularity=DATASET return real lineage graph. #264

v0.4.x resolved lineage by run_id, but this may produce wrong lineage. v0.4.x now resolves lineage by operation_id.
Exclude self-referencing lineage edges in case granularity=DATASET. #261

If some run uses the same table as both input and output (e.g. merging duplicates or performing some checks before writing), DataRentgen excludes dataset1 -> dataset1 relations from lineage.

This doesn't affect chains like dataset1 -> job1 -> dataset1 or dataset1 -> dataset2 -> dataset1.

Assets 7

Releases: MTSWebServices/data-rentgen

0.5.0 (2026-03-19)

OpenLineage-related features

Extracting dataset & job tags

Extracting nominalTime

Extracting jobDependencies

REST API features

Added GET /v1/jobs/hierarchy endpoint

Added parent relation between jobs

Added JOB-JOB and RUN-RUN relations to lineage API

Include job to GET /v1/runs response

Include last_run field to GET /v1/jobs response

Uh oh!

0.4.8 (2025-01-26)

Uh oh!

0.4.7 (2025-01-20)

Uh oh!

0.4.6 (2025-01-12)

Uh oh!

0.4.5 (2025-12-24)

Improvements

Uh oh!

0.4.4 (2025-11-21)

Bug Fixes

Uh oh!

0.4.3 (2025-11-21)

Features

Bug Fixes

Uh oh!

0.4.2 (2025-10-29)

Bug fixes

Doc only Changes

Uh oh!

0.4.1 (2025-10-08)

Features

Uh oh!

0.4.0 (2025-10-03)

Features

Improvements

Bug Fixes

Uh oh!

Extracting `nominalTime`

Extracting `jobDependencies`

Added `GET /v1/jobs/hierarchy` endpoint

Include `job` to `GET /v1/runs` response

Include `last_run` field to `GET /v1/jobs` response