Skip to content

Conversation

@Christos-Hadjinikolis
Copy link
Collaborator

@Christos-Hadjinikolis Christos-Hadjinikolis commented Apr 9, 2025

🚀 Migrate WithS3File to AWS Data Wrangler

Links

RND-13624

Summary

This PR refactors the WithS3File mixin to use [awswrangler](https://github.com/aws/aws-sdk-pandas) for all S3 I/O operations (CSV, JSON, Parquet, and HDF), removing reliance on local temp files, boto3, and the AWS CLI. HDF files are now streamed in-memory using boto3.client().download_fileobj and PyTables.

Key Changes

  • ✅ Migrated all S3 reads/writes to use awswrangler:
    • CSV, JSON, and Parquet are handled via direct wr.s3.read_*/to_* calls.
    • HDF files are streamed to memory using BytesIO.
  • ✅ Refactored I/O logic into _read_s3_{type}_file / _write_s3_{type}_file methods.
  • ✅ Updated allow_options decorator to:
    • Filter passthrough kwargs per backend (pandas, pyarrow, wrangler, etc).
    • Support argument containers like pandas_kwargs, pyarrow_additional_kwargs, s3_additional_kwargs.
  • ✅ Blocked automatic datetime parsing in JSON readers:
    • convert_dates=True is ignored to ensure consistency.
  • ✅ Introduced support for single-record JSON objects:
    • Added single_record=True option to force wrapping a single JSON object into a tabular DataFrame.
    • Detected and warned on suspiciously structured inputs (e.g. when pandas misparses a dict as a 1-column/2-row DataFrame).
  • ✅ Ensured full consistency between pandas.read_json and awswrangler.read_json for orient="records" and lines=True usage.
  • 🧪 Extended and updated test coverage, including:
    • Ensuring only schema-specified columns are loaded.
    • Verifying column order is preserved.
    • Warning is raised for likely single-record JSONs when option is missing.
    • Confirming timestamps are not auto-parsed.

Benefits

  • 🚀 Improved performance and native AWS integration via awswrangler.
  • 🧼 Cleaner, more maintainable, and safer I/O logic.
  • 🧪 Easier to test (no tmp files, no shell commands).
  • 🔍 Transparent, configurable behavior for edge cases like JSON formatting.

Notes

  • Legacy S3 logic (_s3_reader, _s3_writer, _s3_named_file_reader) has been fully removed.
  • awswrangler is now a required dependency.
  • Developers can enable consistent behavior for single-record JSON inputs via single_record=True.

@Christos-Hadjinikolis Christos-Hadjinikolis changed the title Rnd 13624 feat(RND-13624): Migrate WithS3File to AWS Data Wrangler Apr 9, 2025
@codecov
Copy link

codecov bot commented Apr 9, 2025

Codecov Report

Attention: Patch coverage is 91.25000% with 14 lines in your changes missing coverage. Please review.

Project coverage is 89.40%. Comparing base (e1f5c5f) to head (9cb7282).

Files with missing lines Patch % Lines
dynamicio/mixins/with_local.py 64.28% 5 Missing and 5 partials ⚠️
dynamicio/mixins/with_s3.py 96.29% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
+ Coverage   88.21%   89.40%   +1.18%     
==========================================
  Files          16       16              
  Lines        1256     1321      +65     
  Branches      142      153      +11     
==========================================
+ Hits         1108     1181      +73     
+ Misses        110       98      -12     
- Partials       38       42       +4     
Flag Coverage Δ
unittests 89.40% <91.25%> (+1.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -1,4 +1,6 @@
"""Invokes dynamicio cli."""

# Application Imports
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black does this now...

Comment on lines +22 to +30
"""Handles I/O operations for AWS Athena.
Note:
The `__abstractmethods__ = frozenset()` is used to silence false positives from pylint,
which might wrongly assume this class has abstract methods due to NotImplementedError.
This class is *not* an abstract base class and does not use `abc.ABC`.
"""

__abstractmethods__ = frozenset()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes a pylint false positive on abstract methods not implemented...

Comment on lines +142 to +153
# 🧼 Check if this is a single-record json file
if is_single_record:
# Re-wrap as single dict row — i.e., rehydrate the record
df = pd.DataFrame([{df.columns[0]: dict(zip(df.index, df.iloc[:, 0]))}])
elif (
df.shape[1] == 1
and df.columns.dtype == "object"
and df.index.dtype == "object"
and all(isinstance(i, str) for i in df.index)
and all(isinstance(v, (str, int, float, bool, type(None))) for v in df.iloc[:, 0])
):
logger.warning("[local-json-read] File appears to be a single-record JSON object. Pass 'single_record=True' in options to handle this case.")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary for alignment with wrangler.read_json() behaviour.

@Christos-Hadjinikolis Christos-Hadjinikolis marked this pull request as draft April 23, 2025 11:04
@sonarqubecloud
Copy link

Copy link

@yuan-vortexa yuan-vortexa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@matteo-pallini matteo-pallini removed their request for review June 4, 2025 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants