-
Couldn't load subscription status.
- Fork 56
Description
Currently transform_attributes will only remove the quasi-delta encoding from parent_id column if there were some deleted rows. We're not materializing the parent IDs if there were some only replacements.
otel-arrow/rust/otel-arrow-rust/src/otap/transform.rs
Lines 556 to 569 in d4879f4
| // Possibly remove any delta-encoding on the parent ID column. If there were any | |
| // deletes, it could cause issues if the parent_ids are using the transport optimized | |
| // quasi-delta encoding. This is because subsequent runs of key-value pairs may be | |
| // joined deleted segments, meaning the delta encoding will change. | |
| let any_rows_deleted = keys_transform_result.keep_ranges.is_some(); | |
| let should_materialize_parent_ids = | |
| any_rows_deleted && schema.column_with_name(consts::PARENT_ID).is_some(); | |
| let (attrs_record_batch, schema) = if should_materialize_parent_ids { | |
| let rb = materialize_parent_id_for_attributes::<u16>(attrs_record_batch)?; | |
| let schema = rb.schema(); | |
| (rb, schema) | |
| } else { | |
| (attrs_record_batch.clone(), schema) | |
| }; |
The assumption was that deletes would be the only way for this method to create invalid sequence of quasi-delta encoding (see here for an explanation of how this could happen).
Thinking about this a bit more deeply, I'm not convinced this assumption is correct .. for example if there was a sequence of records like:
(type, key, value, parent_id)
str, "k1", "a", 0,
str, "k1", "a", 1,
str, "k2", "a", 0,
str, "k1", "a", 0,
str, "k1", "a", 1,
If the parent IDs were quasi-delta encoded, this would represent a sequence actual IDs like 0, 1, 0, 0, 1.
If we rename the attribute "k2" to "k1", the sequence becomes:
(type, key, value, parent_id)
str, "k1", "a", 0,
str, "k1", "a", 1,
str, "k1", "a", 0,
str, "k1", "a", 0,
str, "k1", "a", 1,
And then if the parent IDs are considered to be quasi-delta encoded, this sequence now represents plain encoded IDs of 0, 1, 1, 1, 2. This is not correct!
Note that the batch in this example is a bit unusual -- typically when the parent IDs are quasi-delta encoded, by default we always sort by the key column. But a user could technically create a batch like this if the second sequence of "k1" was created by a previous replacement of a key that would get sorted after "k2". e.g. user had previously renamed "k3" -> "k1", where "k3" was following "k2" -- so even though it's unusual, we shouldn't assume it would never happen.
To fix:
The safest & easiest way to avoid this is probably to always materialize the parent IDs when we transform attributes.
Although, we could actually do this in a more optimized way by checking if the rows bordering the contiguous ranges we transform have the same type/key/value, and if this never happens, there's no reason to materialize the IDs in any case. Note that this becomes somewhat non-trivial when there's multiple renamings happening at once as we need to check the sequences after all the renaming and deletes have occurred.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status