`transform_attributes` could create an invalid sequence of encoded `parent_ids` using `rename` with no deletes

Currently `transform_attributes` will only remove the quasi-delta encoding from `parent_id` column if there were some deleted rows. We're not materializing the parent IDs if there were some only replacements.

https://github.com/open-telemetry/otel-arrow/blob/d4879f499c1819a268de6bb399c842db626b7a4d/rust/otel-arrow-rust/src/otap/transform.rs#L556-L569

The assumption was that deletes would be the only way for this method to create invalid sequence of quasi-delta encoding (see [here](https://github.com/open-telemetry/otel-arrow/blob/d4879f499c1819a268de6bb399c842db626b7a4d/rust/otel-arrow-rust/src/otap/transform.rs#L2870-L2885) for an explanation of how this could happen).

Thinking about this a bit more deeply, I'm not convinced this assumption is correct .. for example if there was a sequence of records like:
```
(type, key, value, parent_id)
str, "k1", "a", 0,
str, "k1", "a", 1,
str, "k2", "a", 0,
str, "k1", "a", 0,
str, "k1", "a", 1,
```
If the parent IDs were quasi-delta encoded, this would represent a sequence actual IDs like `0, 1, 0, 0, 1`.

If we rename the attribute "k2" to "k1", the sequence becomes:
```
(type, key, value, parent_id)
str, "k1", "a", 0,
str, "k1", "a", 1,
str, "k1", "a", 0,
str, "k1", "a", 0,
str, "k1", "a", 1,
```
And then if the parent IDs are considered to be quasi-delta encoded, this sequence now represents plain encoded IDs of `0, 1, 1, 1, 2`. This is not correct!

Note that the batch in this example is a bit unusual -- typically when the parent IDs are quasi-delta encoded, by default we always sort by the key column. But a user _could_ technically create a batch like this if the second sequence of "k1" was created by a previous replacement of a key that would get sorted after "k2". e.g. user had previously renamed "k3" -> "k1", where "k3" was following "k2" -- so even though it's unusual, we shouldn't assume it would _never_ happen.

**To fix:**

The safest & easiest way to avoid this is probably to always materialize the parent IDs when we transform attributes. 

Although, we could actually do this in a more optimized way by checking if the rows bordering the contiguous ranges we transform have the same type/key/value, and if this never happens, there's no reason to materialize the IDs in any case. Note that this becomes somewhat non-trivial when there's multiple renamings happening at once as we need to check the sequences _after_ all the renaming and deletes have occurred.


	// Possibly remove any delta-encoding on the parent ID column. If there were any
	// deletes, it could cause issues if the parent_ids are using the transport optimized
	// quasi-delta encoding. This is because subsequent runs of key-value pairs may be
	// joined deleted segments, meaning the delta encoding will change.
	let any_rows_deleted = keys_transform_result.keep_ranges.is_some();
	let should_materialize_parent_ids =
	any_rows_deleted && schema.column_with_name(consts::PARENT_ID).is_some();
	let (attrs_record_batch, schema) = if should_materialize_parent_ids {
	let rb = materialize_parent_id_for_attributes::<u16>(attrs_record_batch)?;
	let schema = rb.schema();
	(rb, schema)
	} else {
	(attrs_record_batch.clone(), schema)
	};

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`transform_attributes` could create an invalid sequence of encoded `parent_ids` using `rename` with no deletes #966

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

transform_attributes could create an invalid sequence of encoded parent_ids using rename with no deletes #966

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`transform_attributes` could create an invalid sequence of encoded `parent_ids` using `rename` with no deletes #966