Skip to content

transform_attributes could create an invalid sequence of encoded parent_ids using rename with no deletes #966

@albertlockett

Description

@albertlockett

Currently transform_attributes will only remove the quasi-delta encoding from parent_id column if there were some deleted rows. We're not materializing the parent IDs if there were some only replacements.

// Possibly remove any delta-encoding on the parent ID column. If there were any
// deletes, it could cause issues if the parent_ids are using the transport optimized
// quasi-delta encoding. This is because subsequent runs of key-value pairs may be
// joined deleted segments, meaning the delta encoding will change.
let any_rows_deleted = keys_transform_result.keep_ranges.is_some();
let should_materialize_parent_ids =
any_rows_deleted && schema.column_with_name(consts::PARENT_ID).is_some();
let (attrs_record_batch, schema) = if should_materialize_parent_ids {
let rb = materialize_parent_id_for_attributes::<u16>(attrs_record_batch)?;
let schema = rb.schema();
(rb, schema)
} else {
(attrs_record_batch.clone(), schema)
};

The assumption was that deletes would be the only way for this method to create invalid sequence of quasi-delta encoding (see here for an explanation of how this could happen).

Thinking about this a bit more deeply, I'm not convinced this assumption is correct .. for example if there was a sequence of records like:

(type, key, value, parent_id)
str, "k1", "a", 0,
str, "k1", "a", 1,
str, "k2", "a", 0,
str, "k1", "a", 0,
str, "k1", "a", 1,

If the parent IDs were quasi-delta encoded, this would represent a sequence actual IDs like 0, 1, 0, 0, 1.

If we rename the attribute "k2" to "k1", the sequence becomes:

(type, key, value, parent_id)
str, "k1", "a", 0,
str, "k1", "a", 1,
str, "k1", "a", 0,
str, "k1", "a", 0,
str, "k1", "a", 1,

And then if the parent IDs are considered to be quasi-delta encoded, this sequence now represents plain encoded IDs of 0, 1, 1, 1, 2. This is not correct!

Note that the batch in this example is a bit unusual -- typically when the parent IDs are quasi-delta encoded, by default we always sort by the key column. But a user could technically create a batch like this if the second sequence of "k1" was created by a previous replacement of a key that would get sorted after "k2". e.g. user had previously renamed "k3" -> "k1", where "k3" was following "k2" -- so even though it's unusual, we shouldn't assume it would never happen.

To fix:

The safest & easiest way to avoid this is probably to always materialize the parent IDs when we transform attributes.

Although, we could actually do this in a more optimized way by checking if the rows bordering the contiguous ranges we transform have the same type/key/value, and if this never happens, there's no reason to materialize the IDs in any case. Note that this becomes somewhat non-trivial when there's multiple renamings happening at once as we need to check the sequences after all the renaming and deletes have occurred.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomersrustPull requests that update Rust code

    Type

    Projects

    Status

    Priority 2

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions