Skip to content

Conversation

@rkrenzler
Copy link

Reduce the number of segments returned by choose_checkpoints. Note, even if each key dimension is split only once, a high number of key dimensions can still result in an excessive number of segments. This significantly lowers the performance of reladiff. For example, 20 key columns may lead to 1.048.576 segments.

…l key space

Reduce the number of segments returned by choose_checkpoints.
Note, even if each key dimension is split only once,
a high number of key dimensions can still result in an excessive number of segments.
This significantly lowers the performance of reladiff.
For example, 20 key columns may lead to 1.048.576 segments.
@erezsh
Copy link
Owner

erezsh commented Aug 7, 2025

Can you explain the intuition behind the new split_compound_key_space?

I understand the need to limit the number of segments, but I'm asking about the specific method.

@rkrenzler
Copy link
Author

Remark: k splits leads to k+1 segments.
The current implementation estimates the number of splits for each dimension from the total number of splits for the whole key space in the function choose_checkpoints(...) with formula count = int(count ** (1 / len(self.key_columns))) or 1.
The number of splits per dimension is stored to the variable count and it is at least 1. Then split_compound_key_space uses this count to split every key dimension. This means, that even if we want to split the whole key space in two segments only, we will get at least 2^n sections (all n dimensions split at least once) where n is the dimension of the key space.

In the new implementation I do not try to split all the dimensions from the very beginning. I start splitting dimensions one-by-one from the beginning by the number estimated by choose_checkpoints(...). Each dimension I split with the function split_key_space(...). In each step I check if I still need to split. If I reach enough segments the splitting will stop. The remaining dimensions will not be split. Because the function split_key_space() does not split a dimension if the difference between min and max is 1, at some point the very first dimensions will not be split anymore, and the algorithm starts to split the following dimensions.

Example: Split 3 dimensional keys [1,1,1] [3,3,3] into maximally two segments in the worst-case scenario.

Old implementation:

It splits every dimension by 1 and creates a mesh [1,2,3], [1,2,3], [1,2,3] which splits the key space into 8 segments.
In the subsequent interaction it investigates [1,2], [1,2,3], [1,2,3] and [2,3], [1,2,3], [1,2,3]. ( [1,2] and [2,3] are not more splitable by split_key_space) and so on. We can see it in layers:
([1,2,3], [1,2,3], [1,2,3])
([1,2], [1,2,3], [1,2,3]) ([2,3], [1,2,3], [1,2,3])
([1,2], [1,2], [1,2,3]), ([1,2], [2,3], [1,2,3]), ([2,3], [1,2], [1,2,3]) ([2,3], [2,3], [1,2,3])
([1,2], [1,2], [1,2]), ([1,2], [1,2], [2,3]) ([1,2], [2,3], [1,2]), ([1,2], [2,3], [2,3]) ([2,3], [1,2], [1,2]), ([2,3], [1,2], [2,3]) ...

New implementation:
It splits keys by maximally two segments for the whole space:
([1,2,3], [1,3], [1,3]) # maximal 2 segments for the whole space reached after the first split. Stop splitting until the next call.
([1,2], [1,2,3], [1,3]) # 2 segments reached after the second split,... ([2,3], [1,2,3], [1,3])
([1,2], [1,2], [1,2,3]), ([1,2], [2,3], [1,2,3]), ([2,3], [1,2], [1,2,3]) ([2,3], [2,3], [1,2,3])
([1,2], [1,2], [1,2]), ([1,2], [1,2], [2,3]) ([1,2], [2,3], [1,2]), ([1,2], [2,3], [2,3]) ([2,3], [1,2], [1,2]), ([2,3], [1,2], [2,3]) ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants