Skip to content

Refactor mdf_reader to use polars#213

Draft
jtsiddons wants to merge 50 commits intoglamod:mainfrom
jtsiddons:polarising
Draft

Refactor mdf_reader to use polars#213
jtsiddons wants to merge 50 commits intoglamod:mainfrom
jtsiddons:polarising

Conversation

@jtsiddons
Copy link
Collaborator

@jtsiddons jtsiddons commented Jan 28, 2025

An inprogress re-write of the pandas components of mdf_reader into polars. This could allow for improved performance in terms of memory usage and speed.

Todo:

  • Match Unit Tests
  • Current implementation uses pandas.MultiIndex for the columns, allowing core.YR access. Polars does not support this behaviour.
  • Handling of missing_values and fields to ignore
  • Remove re-conversion to pandas.DataFrame
  • Allow chunking
  • Decoding Step
  • Conversion Step
  • Validation Step

@jtsiddons
Copy link
Collaborator Author

@ludwiglierhammer: Do I need to add polars to the ci requirements/environment files?

@ludwiglierhammer
Copy link
Collaborator

@ludwiglierhammer: Do I need to add polars to the ci requirements/environment files?

Yes please.

@github-actions github-actions bot added the CI label Jan 28, 2025
@github-actions github-actions bot added the docs label Jan 30, 2025
@ludwiglierhammer
Copy link
Collaborator

@jtsiddons: In general, we have to read the input data line by line, as the sentinals etc can vary from line to line. I just wanted to let you konw, not that you are putting time and effort into it.

@jtsiddons
Copy link
Collaborator Author

@jtsiddons: In general, we have to read the input data line by line, as the sentinals etc can vary from line to line. I just wanted to let you konw, not that you are putting time and effort into it.

Using polars I scan the column for the sentinal. I then separate the section (as a single column) if the column is present (values are None if the sentinal is not present). I then sequentially slice into the section column splitting out all of the fields.

If the sentinal is None or "", then it is assumed that the section is guaranteed to be present. And the slicing/splitting is performed.

No splitting on values is performed if the sentinal is otherwise missing. Slicing on None results in two values of None.

The operations are performed on the correct lines.

@ludwiglierhammer
Copy link
Collaborator

@jtsiddons: I did some code restructuring to unify code snippets in mdf_reader and cdm_mapper. Unfortunately, I created some merge conflicts in this PR. Could you solve them or should we try to fix them together? This conflicts are only resolveable using the command line.

@jtsiddons
Copy link
Collaborator Author

@jtsiddons: I did some code restructuring to unify code snippets in mdf_reader and cdm_mapper. Unfortunately, I created some merge conflicts in this PR. Could you solve them or should we try to fix them together? This conflicts are only resolveable using the command line.

No worries - I can fix the conflicts. I am working on other things this week so I'll resolve them on Monday morning.

@jtsiddons
Copy link
Collaborator Author

@ludwiglierhammer: Have moved decode/convert and validate steps into a _read_loop method, meaning that only a single TextParser loop is required. This avoids significant refactor of the Configurator class to allow for conversion/validation in the open_* methods.

@jtsiddons
Copy link
Collaborator Author

jtsiddons commented Mar 28, 2025

Current status - re-factored validators for polars.DataFrame, mask is now generated in Configurator.open_* method, and passed as input to validator functions.

Next steps:

  • Ensure that the mask values are correctly assigned for missing sections/disabled sections/ignored sections/missing values
  • Ensure polars -> pandas type conversion is correct (note: pandas has two int64 types: int64 is not nullable, Int64 is nullable)
  • Ensure column names are not converted to tuple if only one component to the name.

edit: add extra todo step

@jtsiddons
Copy link
Collaborator Author

May be of some use for this: https://narwhals-dev.github.io/narwhals/

Allows for interoperability between python DataFrame libraries (e.g. polars and pandas).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments