-
-
Couldn't load subscription status.
- Fork 1.2k
Fix drop_sel for a MultiIndex
#10863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Most labels passed into `drop_sel` can be handled by the underlying libraries, and will covert to an array as the current implementation does. xr.DataArray is a special case that is supported as set of labels but doesn't interact well with pandas coversion to an array.
|
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
| Data variables: | ||
| A (x, y) int64 32B 0 2 3 5 | ||
| """ | ||
| from xarray.core.dataarray import DataArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've imported DataArray here as that seems to fit with the style in other parts of the codebase (here), however I'm not entirely sure why this is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that dataarray.py imports dataset.py already (historically Dataset predates DataArray slightly), so this avoids a recursive import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the explanation👍
|
I don't know this area that well, though it does seem to solve the immediate problem. would anyone have alternative approaches that get at the root of the issue? otherwise I would suggest merging this, maybe with a comment explaining, and then nothing prevents refactoring later... (adding @benbovy , the indexing Czar, though others will also know more) |
I have to admit the same! To try and provide more context I've dived down the git blame rabbit hole to explain the original reason for the cast to an array type within the code. It was introduced in this PR #3177. The PR intorduces a whole bunch of type hints for the old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @owena11 !
| Data variables: | ||
| A (x, y) int64 32B 0 2 3 5 | ||
| """ | ||
| from xarray.core.dataarray import DataArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that dataarray.py imports dataset.py already (historically Dataset predates DataArray slightly), so this avoids a recursive import.
| # Most conversion to arrays is better handled in the indexer, however | ||
| # DataArrays are a special case where the underlying libraries don't provide | ||
| # a good conversition. | ||
| if isinstance(labels_for_dim, DataArray): | ||
| labels_for_dim = np.asarray(labels_for_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted to make this a little safer, could add:
| # Most conversion to arrays is better handled in the indexer, however | |
| # DataArrays are a special case where the underlying libraries don't provide | |
| # a good conversition. | |
| if isinstance(labels_for_dim, DataArray): | |
| labels_for_dim = np.asarray(labels_for_dim) | |
| # Most conversion to arrays is better handled in the indexer, however | |
| # DataArrays are a special case where the underlying libraries don't provide | |
| # a good conversition. | |
| if isinstance(labels_for_dim, DataArray): | |
| if labels_for_dim.dims not in ((), (dim,)): | |
| raise ValueError( | |
| "cannot use drop_sel() with DataArray values with " | |
| "along dimensions other than the dimensions being " | |
| f"indexed along: {labels_for_dim}" | |
| ) | |
| labels_for_dim = np.asarray(labels_for_dim) |
But this LGTM to me! Definitely an incremental improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit and then reverted, highlighed by the tests it might break peoples current usage where a DataArray gets assigned the default dim names (i.e dim_0 etc) .
Also despite thinking this would nudge sel and drop_sel to be more consistent. After checking neither enforcing the alignment between dims for selecting with a DataArray:
>>> data = xr.Dataset({"x": ["a", "b"]})
>>> data.sel(x=xr.DataArray(["a",], dims=("y",)))
<xarray.Dataset> Size: 4B
Dimensions: (y: 1)
Coordinates:
x (y) <U1 4B 'a'
Dimensions without coordinates: y
Data variables:
*empty*So would propose leaving this change for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, sel() imposes the dimensions and coordinates of the indexer rather than checking for alignment. It is not obvious to me what the inverse of that would be!
Add additional validation for `drop_sel` such that when selecting with a DataArray the dimensions must be named consistently between the DataArray and the dimension you're dropping the selection from. This matches the behaviour with `sel`. Co-authored-by: Stephan Hoyer <shoyer@google.com>
for more information, see https://pre-commit.ci
This reverts commit ca372a1.
This reverts commit 293fa1f.
|
Thanks for the time reviewing @shoyer & @max-sixty! |

This is a proposed fix to #10862 to allow consistent usage between
selanddrop_selfor MultiIndexes. The current implementation seems to convert the labels to an array to handle the edge case of labels of typexr.DataArray. This change makes that edge case more explict and delegates the reponsibility for converting to an array if needed down to the indexer.Exisitng indexes already do this and so the change shouldn't have any preformance implications:
pandas.Index.drop- herepandas.MultiIndex.drop- hereCloses Differing behaviour between
selanddrop_selfor MultiIndexes #10862Tests added