fix: Ignore dy.Any columns in Schema.cast#315
Conversation
8114fec to
378f7c1
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes issue #314 by preventing dy.Any columns from being cast during schema validation. The Schema.cast method now skips casting for dy.Any columns since they should accept any data type, rather than attempting to cast to their default pl.Null() type. This resolves the error that occurred when roundtripping collections with parquet files containing dy.Any columns.
Changes:
- Modified
Schema.castmethod to check if a column is of typedy.Anyand skip casting for such columns - Added import for the
Anycolumn type asAnyColumnin schema.py - Added a test to verify that casting a DataFrame with an
Anycolumn preserves the original dtype
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
dataframely/schema.py |
Added import for Any column type and modified cast method to skip casting for dy.Any columns |
tests/column_types/test_any.py |
Added test verifying that Schema.cast preserves dtype for dy.Any columns |
Comments suppressed due to low confidence (1)
tests/column_types/test_any.py:5
- An extra blank line is added after the license header (line 4), creating two blank lines between the license header and imports. This doesn't match the convention in other test files (e.g., test_string.py) which have only one blank line. Consider removing the extra blank line to maintain consistency.
from typing import Any
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #315 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3218 3234 +16
=========================================
+ Hits 3218 3234 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Thanks for finding this and providing a fix! I'd just like to refactor it a little
| lf = df.lazy().select( | ||
| pl.col(name).cast(col.dtype) for name, col in cls.columns().items() | ||
| # Skip casting for Any columns since they accept any type | ||
| pl.col(name) if isinstance(col, AnyColumn) else pl.col(name).cast(col.dtype) |
There was a problem hiding this comment.
Instead of building special treatment for Any here, how about we move ownership of casting into the Column itself? I.e. Column gets a cast method, and the default implementation is:
def cast(self, col: pl.Expr) -> pl.Expr:
return col.cast(self.dtype)In Any, we then implement the override:
def cast(self, col: pl.Expr) -> pl.Expr:
return colI think this would be neat because you never have to think about special casting logic outside the column implementations themselves
There was a problem hiding this comment.
I was thinking about this solution as well, and then I thought about the dy.Integer column. How should we manage this case ? Maybe this Column.cast function should take as well the type of the input expression, meaning that we need to wrapped it with pipe_with_schema.
I am away from the computer right now, I can have a deeper look on Tuesday.
Motivation
Fixes: #314
Changes
Schema.castignores thedy.Anycolumns.