Skip to content

Commit 1d85e41

Browse files
committed
[skip-ci][ntuple][doc] some corrections to Merging.md
1 parent c7a43fa commit 1d85e41

File tree

1 file changed

+23
-12
lines changed

1 file changed

+23
-12
lines changed

tree/ntuple/doc/Merging.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,11 @@ Please note that the RNTupleMerger is currently experimental and the content of
1818
## Goal
1919
The goal of the RNTuple merging process is producing one output RNTuple from *N* input RNTuples that can be used as if it were produced directly in the merged state. This means that:
2020

21-
* R1: All fields in the output RNTuple are accessible and have a type compatible with the original fields of the input RNTuples.
21+
* R1: All fields in the output RNTuple are accessible and have a type compatible<sup>1</sup> with the original fields of the input RNTuples.
2222
* R2: The values of those fields are a concatenation of the original fields. If the first input RNTuple had *M* entries, the first *M* entries of the output RNTuple map to those entries; entry *M+1* maps to the first entry of the second input RNTuple, and so on.
2323

24+
<sup>1</sup>: currently "compatible" means "identical". This may be extended in the future to include fields that have convertible types.
25+
2426
At a lower level, we require that:
2527

2628
* R3: the output RNTuple has the **same non-extended schema description** as the **first input RNTuple**;
@@ -36,40 +38,49 @@ Consequences of R3 and R4:
3638
The following properties are currently true but they are subject to change:
3739

3840
* P1: all output pages have the **same compression** (which may be different from the input pages' compression);
39-
* P2: the output clusters are **the same as the input clusters**;
40-
* P3: the output RNTuple **always has 1 cluster group**
41+
* P2: all pages in the same output column have the **same encoding** (which may be different from the inputs' encoding);
42+
* P3: the output clusters are **the same as the input clusters**;
43+
* P4: the output RNTuple **always has 1 cluster group**
44+
45+
Note that these properties influence and are influenced by the level of merging used.
46+
E.g. P1 and P2 are currently true because we only support L1 merging of pages with identical compressions. This is a limitation that we intend to lift at some point (both for L1 and L0 if we ever support it).
47+
P3 and P4 would not necessarily be true with L4 support (which might be desirable in some cases, e.g. to group pages into smaller/larger clusters).
48+
49+
Therefore we *will* want to drop these properties at some point, in order to improve the capabilities of the Merger.
4150

4251
## High-level description
43-
The merging process requires at least 1 input, consisting in a `RPageSource`.
52+
The merging process requires at least 1 input, in the form of an `RPageSource`.
4453

4554
The first input is attached in `EDescriptorDeserializeMode::kForWriting` mode, which doesn't collate the extended header with the non-extended header. Since we use the first input's descriptor as the output schema (barring late model extensions, see later), opening in `kForWriting` mode allows us to write the output to disk preserving the non-extended schema description as per requirement R3. A consequence of this choice is that the merger never produces (new) deferred columns in the output RNTuple's header.
4655

47-
In `Union` mode only, we allow any following input RNTuple to define new fields that don't appear in the first input. These fields, after being validated, are late model extended into the output model and will thus appear in the output RNTuple's extended header on disk. This means that all columns that were not part of the first input's schema become deferred columns in the output RNTuple (unless the first source had 0 entries).
56+
In `Union` mode only, we allow any subsequent input RNTuple to define new fields that don't appear in the first input. These fields, after being validated, are late model extended into the output model and will thus appear in the output RNTuple's extended header on disk. This means that all columns that were not part of the first input's schema become deferred columns in the output RNTuple (unless the first source had 0 entries).
4857

4958
## Descriptor compatibility and validation
5059
Whenever a new input is processed, we compare its descriptor with the output descriptor to verify that merging is possible.
5160

5261
The comparison function does 3 main things:
53-
- collect all "extra destination fields" (i.e. fields that exist in the destination but not in this input RNTuple)
54-
- collect all "extra source fields"
62+
- collect all "extra destination fields" (i.e. fields that exist in the output but not in this input RNTuple)
63+
- collect all "extra source fields" from the input RNTuple
5564
- collect and validate all common fields.
5665

5766
If the Merging Mode is set to **Filter** we require the "extra destination fields" list to be empty.
58-
If the Merging Mode is set to **Strict** we require both "extra destination fields" and "extra source fields" to be empty.
67+
If the Merging Mode is set to **Strict** we require both the "extra destination fields" and "extra source fields" lists to be empty.
5968
If the Merging Mode is set to **Union**, the "extra source fields" list is used to late model extend the destination model.
6069

6170
As for common fields, they are matched by name and validated as follows:
6271
- any field that is projected in the destination must be also projected in the source and must be projected to the same field;
6372
- any field that is not projected in the destination must also not be projected in the source;
64-
- the field types names must be **identical** (*this could probably be relaxed in the future to allow for different but compatible types*)
73+
- the field types names must be **identical** (*this could probably be relaxed in the future to allow for different but compatible types - see requirement R1*)
6574
- the type checksums, if present, must be identical. Note that if a field has a type checksum and the other doesn't, we consider this valid (*is this sound?*);
6675
- the type versions must be identical;
6776
- the fields' structural roles must be identical;
68-
- the column representations must match, as follows:
77+
- the column representations must match<sup>1</sup>, as follows:
6978
- the source and destination fields must have the same number of columns;
7079
- the types of each column must either be identical or one must be the split/unsplit version of the other;
7180
- the bits on storage of both columns must be identical;
7281
- the value range of both columns must be identical;
73-
- the representation index of the each source column must be 0; (*why?*)
74-
- if the fields have subfields, the number of subfields must be identical, and each source subfield is recursively validated against its destination counterpart via all the rules described in this list.
82+
- the representation index of the each source column must be 0 (i.e. we currently don't support multiple columns representations while merging);
83+
- if the fields have subfields, the number of subfields must be identical, and each source subfield is recursively validated against its destination counterpart via all the rules described in this list.
84+
7585

86+
<sup>1</sup>: these restrictions will likely not be required for L4 merging.

0 commit comments

Comments
 (0)