-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Sometimes, loc with msk (or feature) selects different rows in the table_current and table_update frames.
Since, table_update is updated by transforming values from table_current, this can cause the wrong values to be updated.
I became aware of this issue due to a seed frame with many zero values, which become non-zero after the IPF procedure.
Cause
In the Pandas version of the iteration method, a copy of the original df is created for each feature that is to be fitted in that iteration, before the fitting starts.
For each feature, each adjacent pair of these copies (referenced as table_current and table_update) in the list is then sorted by that feature column (through sort_index).
This can cause the sort order of frames to diverge with 3 or more features, specifically if there are non-unique values for the sort feature.
This in turn means that the mask for the feature values applies to different rows in the table_current and table_update frames, causing the update to happen on incorrect totals. Interestingly, the solution still converges, but becomes apparent when some values in the seed are zero (multiplying with zero should always result in zero, but the resulting table sometimes has non-zero values in these positions)
Minimal example
While not a full reconstruction of the inner workings of the method, this simple example should demonstrate when the issue occurs:
Setup
Consider a simple data frame
df = pd.DataFrame(dict(
A = [2, 1, 2, 1],
B = [1, 1, 2, 2,],
))| A | B | |
|---|---|---|
| 0 | 2 | 1 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 1 | 2 |
Both A and B consist of two levels, where each combination of (1,2) occurs once. Notably, A is not monotonically sorted.
Now assume IPF will perform 3 (or more) iterations (note: this DF only has 2 features and no total column to keep the example minimal), then the code copies the original df three times into the table list.
df_feature_1 = df.copy()
df_feature_2 = df.copy()
df_feature_3 = df.copy()
Iteration 1
In iteration 1, df_feature_1 is table_current, and df_feature_2 is table_update. For both, we set A as the index, and sort the index
df_feature_1 = df_feature_1.set_index('A').sort_index()
df_feature_2 = df_feature_2.set_index('A').sort_index()Both frames look like this (where A is the index)
| A | B |
|---|---|
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
Note, however, that df_feature_3 still maintains the original sort order. Now df_feature_2 becomes table_current and df_feature_3 becomes table_update. Note how df_feature_2 and df_feature_3 now have been sorted differently. Sort both again:
Iteration 2
Now df_feature_2 becomes table_current and df_feature_3 becomes table_update. Note how df_feature_2 and df_feature_3 now have been sorted differently. Sort both again:
df_feature_2 = df_feature_2.reset_index().set_index('B').sort_index()
df_feature_3 = df_feature_3.set_index('B').sort_index()Now df_feature_2 (table_current) looks like this:
| B | A |
|---|---|
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |
while df_feature_3 (table_update) looks like this:
| B | A |
|---|---|
| 1 | 2 |
| 1 | 1 |
| 2 | 2 |
| 2 | 1 |
Note how column A is sorted differently for both tables.