[skip-ci][ntuple][doc] Add reference doc for RNTuple merging #20191

silverweed · 2025-10-24T13:37:56Z

Following this comment's suggestion.
This documentation is meant to be useful for future reference for ourselves (and maybe advanced users who need this information) and we should try to keep it up-to-date as the Merger is updated.

vepadulano

Very nice! A couple of comments for discussion.

vepadulano · 2025-10-24T14:43:06Z

tree/ntuple/doc/Merging.md

+## Goal
+The goal of the RNTuple merging process is producing one output RNTuple from *N* input RNTuples that can be used as if it were produced directly in the merged state. This means that:
+
+* R1: All fields in the output RNTuple are accessible and have a type compatible with the original fields of the input RNTuples.


About the "compatible" bit, it would be good to also clarify how compatibility is defined in this context

This is currently kept vague on purpose as we don't have a precise definition valid for the merger yet. At the moment "compatible" for the merger means "exactly the same type", but we know we can do better in principle. I can add a note that says this in the document if you'd like.

vepadulano · 2025-10-24T14:46:24Z

tree/ntuple/doc/Merging.md

+
+The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header.
+
+In `Union` mode only, we allow any following input RNTuple to define new fields that don't appear in the first input. These fields, after being validated, are late model extended into the output model and will thus appear in the output RNTuple's extended header on disk. This means that all columns that were not part of the first input's schema become deferred columns in the output RNTuple (unless the first source had 0 entries).


unless the first source had 0 entries

In which case the non-extended output schema is equal to the non-extended schema of the second input? Or is there something more?

No, the schema will still be the union of the first two but the columns of the second won't be deferred because they still start at index 0 (since the first input had 0 entries).

Btw you made me realize we weren't testing this case so I added a test: #20241

tree/ntuple/doc/Merging.md

pcanal · 2025-10-24T20:37:15Z

tree/ntuple/doc/Merging.md

+- any field that is projected in the destination must be also projected in the source and must be projected to the same field;
+- any field that is not projected in the destination must also not be projected in the source;
+- the field types names must be **identical** (*this could probably be relaxed in the future to allow for different but compatible types*)
+- the type checksums, if present, must be identical. Note that if a field has a type checksum and the other doesn't, we consider this valid (*is this sound?*);


Side note that mode (L4) would allow this to be fully relaxed.

the input any checksum will do (within the constraint of regular schema evolution support)

the destination checksum would have the same relationship to the in-memory class layout/checksum as with regular RNTuple Write (i.e. be the same)

tree/ntuple/doc/Merging.md

enirolf · 2025-10-28T08:31:14Z

tree/ntuple/doc/Merging.md

+## High-level description
+The merging process requires at least 1 input, consisting in a `RPageSource`.
+
+The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header.


Suggested change

The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header.

The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk, preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header.

Maybe a while instead of the comma would communicate the intended message better? (the idea is that we could still write the output to disk, but we wouldn't be preserving the schema description - whereas the comma makes it sound as if the "writing to disk" is the thing that would be allowed by this mode)

tree/ntuple/doc/Merging.md

[skip-ci][ntuple][doc] Add reference doc for RNTuple merging

c7a43fa

silverweed requested review from enirolf, hahnjo and pcanal October 24, 2025 13:37

silverweed self-assigned this Oct 24, 2025

silverweed requested a review from jblomer as a code owner October 24, 2025 13:37

silverweed added the in:RNTuple label Oct 24, 2025

vepadulano reviewed Oct 24, 2025

View reviewed changes

pcanal reviewed Oct 24, 2025

View reviewed changes

tree/ntuple/doc/Merging.md Outdated Show resolved Hide resolved

jblomer approved these changes Oct 27, 2025

View reviewed changes

tree/ntuple/doc/Merging.md Outdated Show resolved Hide resolved

tree/ntuple/doc/Merging.md Show resolved Hide resolved

tree/ntuple/doc/Merging.md Outdated Show resolved Hide resolved

enirolf reviewed Oct 28, 2025

View reviewed changes

[skip-ci][ntuple][doc] some corrections to Merging.md

a137752

silverweed force-pushed the ntuple_merge_doc branch from 1d85e41 to a137752 Compare October 30, 2025 09:36


		The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header.

		In `Union` mode only, we allow any following input RNTuple to define new fields that don't appear in the first input. These fields, after being validated, are late model extended into the output model and will thus appear in the output RNTuple's extended header on disk. This means that all columns that were not part of the first input's schema become deferred columns in the output RNTuple (unless the first source had 0 entries).

[skip-ci][ntuple][doc] Add reference doc for RNTuple merging #20191

Are you sure you want to change the base?

[skip-ci][ntuple][doc] Add reference doc for RNTuple merging #20191

Uh oh!

Conversation

silverweed commented Oct 24, 2025

Uh oh!

vepadulano left a comment

Choose a reason for hiding this comment

Uh oh!

vepadulano Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

silverweed Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

vepadulano Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

silverweed Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcanal Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enirolf Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

silverweed Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

silverweed Oct 30, 2025 •

edited

Loading