Skip to content

Conversation

@mayer79
Copy link
Collaborator

@mayer79 mayer79 commented May 10, 2025

Implements #201

@mayer79 mayer79 self-assigned this May 10, 2025
@mayer79 mayer79 added the enhancement New feature or request label May 10, 2025
@mayer79 mayer79 marked this pull request as draft May 10, 2025 10:04
@mayer79 mayer79 changed the title Add compute_permutation_importance() Add Permutation Importance May 10, 2025
@mayer79
Copy link
Collaborator Author

mayer79 commented May 16, 2025

This is the current basic call:

import numpy as np
import polars as pl
from sklearn.linear_model import LinearRegression

from model_diagnostics.xai import plot_permutation_importance

rng = np.random.default_rng(1)
n = 1000

X = pl.DataFrame(
    {
        "area": rng.uniform(30, 120, n),
        "rooms": rng.choice([2.5, 3.5, 4.5], n),
        "age": rng.uniform(0, 100, n),
    }
)

y = X["area"] + 20 * X["rooms"] + rng.normal(0, 1, n)

model = LinearRegression()
model.fit(X, y)

_ = plot_permutation_importance(
    predict_function=model.predict,
    X=X,
    y=y,
)

image

The extended feature API allows to permute groups like this:

_ = plot_permutation_importance(
    predict_function=model.predict,
    features={"size": ["area", "rooms"], "age": "age"},
    X=X,
    y=y,
)

image

from model_diagnostics.scoring import SquaredError


def safe_copy(X):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to put safe_copy and safe_column_names into _utils.array and add tests for them. I think they cause the current CI failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My local tests are failing for the Python 3.9 environment only (pandas and pyarrow). I will move the functions to _utils.array, draft some unit tests, and rename safe_column_names() to get_column_names().

@lorentzenchr
Copy link
Owner

This will be a great addition! Thanks @mayer79

@lorentzenchr
Copy link
Owner

The failing test is in the python 3.9 env with
numpy 1.22.0
polars 1.0.0
scipy 1.10.0
pandas 1.5.3
pyarrow 11.0.0

Could you check if increasing one of the versions fixes the problem, e.g. polars version?

@mayer79
Copy link
Collaborator Author

mayer79 commented May 24, 2025

The failing test is in the python 3.9 env with numpy 1.22.0 polars 1.0.0 scipy 1.10.0 pandas 1.5.3 pyarrow 11.0.0

Could you check if increasing one of the versions fixes the problem, e.g. polars version?

The following changes in the 3.9 env would be necessary. I don't know how much it would hurt to abandon pandas 1

  • pyarrow 11 -> 13
  • pandas 1.5 -> 2.0

I have added some additional unit tests and moved safe_copy() and get_column_names() to array.py.

@mayer79 mayer79 marked this pull request as ready for review May 29, 2025 08:37
@lorentzenchr
Copy link
Owner

fyi, CI will fail due to new versions of polars and numpy. I am working on a fix.

@lorentzenchr
Copy link
Owner

Fix in #203, you need to sync (e.g. merge) with the main branch (and maybe hatch env prune on your local machine).

Copy link
Owner

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of array utils.

Comment on lines +294 to +300
# if not x.index.is_unique:
# Pandas might error with:
# cannot reindex on an axis with duplicate labels
# Try reindexing ourselves.
x = x.reset_index(drop=True)
# if not pd_values.index.is_unique:
pd_values = pd_values.reset_index(drop=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you find a condition (if statement) such that this is only executed if (strictly) required?

Could you add a test case (to an existing test) or a new test that fails without this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I think this test would fail with Pandas 1.5, because the index of the assignment is not matching:

def test_safe_assign_column_works_for_pandas_with_inconsistent_index():
    """Test that safe_assign_column works for pandas dfs regarding indices."""
    df = pd_DataFrame({"a": [0, 1, 2]}, index=[0, 1, 2])
    if isinstance(df, SkipContainer):
        pytest.skip("Module for data container not imported.")

    df = safe_assign_column(
        df, values=pd_Series([10, 20, 30], index=[1, 1, 0]), column_index=0
    )

    expected = pd_DataFrame({"a": [10, 20, 30]})
    assert_array_equal(df, expected)

But I don't know how to test because such test is skipped.

Now, when we remove pandas <2 support, safe_assign_column() does not need to be as strict as in the commited version.

@lorentzenchr
Copy link
Owner

lorentzenchr commented Jul 17, 2025

fyi: I am preparing to bump the minimum versions of python to 3.11 and numpy to 2. This implies polars 1.1.0, pandas >= 2.2.2 and pyarrow >= 16, see #206.

@mayer79
Copy link
Collaborator Author

mayer79 commented Jul 26, 2025

I have modified these aspects in the main functionality:

  • compute_permutation_importance() now returns both score differences and score ratios
  • Instead of standard deviations, the function returns standard errors
  • The plot function has received an argument which="difference" to select if score differences or ratios are to be plotted.
  • By default, the plot function shows approximate 95% CIs. The API as in your other functions, i.e., when confidence_level=0, no error bars are plotted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants