Pandas iteration version can cause wrong values to be updated

Sometimes, `loc` with `msk` (or `feature`) selects different rows in the `table_current` and `table_update` frames.

Since, `table_update` is updated by transforming values from `table_current`, this can cause the wrong values to be updated.

I became aware of this issue due to a seed frame with many zero values, which become non-zero after the IPF procedure.

## Cause
In the Pandas version of the `iteration` method, a copy of the original `df` is created for each feature that is to be fitted in that iteration, before the fitting starts.

For each feature, each adjacent pair of these copies (referenced as `table_current` and `table_update`) in the list is then sorted by that feature column (through `sort_index`).

This can cause the sort order of frames to diverge with 3 or more features, specifically if there are non-unique values for the sort feature. 

This in turn means that the mask for the feature values applies to different rows in the `table_current` and `table_update` frames, causing the update to happen on incorrect totals. Interestingly, the solution still converges, but becomes apparent when some values in the seed are zero (multiplying with zero should always result in zero, but the resulting table sometimes has non-zero values in these positions)

## Minimal example
While not a full reconstruction of the inner workings of the method, this simple example should demonstrate when the issue occurs:

### Setup
Consider a simple data frame

```python
df = pd.DataFrame(dict(
        A = [2, 1, 2, 1],
        B = [1, 1, 2, 2,],
))
```
|   | A | B |
|---|---|---|
| 0 | 2 | 1 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 1 | 2 |


Both `A` and `B` consist of two levels, where each combination of (1,2) occurs once. Notably, `A` is not monotonically sorted.

Now assume IPF will perform 3 (or more) iterations (note: this DF only has 2 features and no `total` column to keep the example minimal), then the code copies the original `df` three times into the `table` list.

```
df_feature_1 = df.copy()
df_feature_2 = df.copy()
df_feature_3 = df.copy()
```

### Iteration 1
In iteration 1, `df_feature_1` is `table_current`, and `df_feature_2` is `table_update`. For both, we set `A` as the index, and sort the index

```python3
df_feature_1 = df_feature_1.set_index('A').sort_index()
df_feature_2 = df_feature_2.set_index('A').sort_index()
```

Both frames look like this (where `A` is the index)

| A | B |
|---|---|
| 1 | 1 | 
| 1 | 2 | 
| 2 | 1 | 
| 2 | 2 | 

Note, however, that `df_feature_3` still maintains the original sort order.  Now `df_feature_2` becomes `table_current` and `df_feature_3` becomes `table_update`. Note how `df_feature_2` and `df_feature_3` now have been sorted differently. Sort both again: 

### Iteration 2
Now `df_feature_2` becomes `table_current` and `df_feature_3` becomes `table_update`. Note how `df_feature_2` and `df_feature_3` now have been sorted differently. Sort both again: 

```python3
df_feature_2 = df_feature_2.reset_index().set_index('B').sort_index()
df_feature_3 = df_feature_3.set_index('B').sort_index()
```

Now `df_feature_2` (`table_current`) looks like this:
| B | A |
|---|---|
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |

while `df_feature_3` (`table_update`) looks like this:

| B | A |
|---|---|
| 1 | 2 |
| 1 | 1 |
| 2 | 2 |
| 2 | 1 |

Note how column `A` is sorted differently for both tables.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas iteration version can cause wrong values to be updated #28

Cause

Minimal example

Setup

Iteration 1

Iteration 2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pandas iteration version can cause wrong values to be updated #28

Description

Cause

Minimal example

Setup

Iteration 1

Iteration 2

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions