Refactor `mdf_reader` to use `polars` by jtsiddons · Pull Request #213 · glamod/cdm_reader_mapper

jtsiddons · 2025-01-28T07:33:59Z

An inprogress re-write of the pandas components of mdf_reader into polars. This could allow for improved performance in terms of memory usage and speed.

Todo:

Match Unit Tests
Current implementation uses pandas.MultiIndex for the columns, allowing core.YR access. Polars does not support this behaviour.
Handling of missing_values and fields to ignore
Remove re-conversion to pandas.DataFrame
Allow chunking
Decoding Step
Conversion Step
Validation Step

for more information, see https://pre-commit.ci

jtsiddons · 2025-01-28T07:36:51Z

@ludwiglierhammer: Do I need to add polars to the ci requirements/environment files?

ludwiglierhammer · 2025-01-28T08:01:28Z

@ludwiglierhammer: Do I need to add polars to the ci requirements/environment files?

Yes please.

for more information, see https://pre-commit.ci

…netCDF

for more information, see https://pre-commit.ci

ludwiglierhammer · 2025-02-10T12:47:09Z

@jtsiddons: In general, we have to read the input data line by line, as the sentinals etc can vary from line to line. I just wanted to let you konw, not that you are putting time and effort into it.

jtsiddons · 2025-02-10T12:58:25Z

@jtsiddons: In general, we have to read the input data line by line, as the sentinals etc can vary from line to line. I just wanted to let you konw, not that you are putting time and effort into it.

Using polars I scan the column for the sentinal. I then separate the section (as a single column) if the column is present (values are None if the sentinal is not present). I then sequentially slice into the section column splitting out all of the fields.

If the sentinal is None or "", then it is assumed that the section is guaranteed to be present. And the slicing/splitting is performed.

No splitting on values is performed if the sentinal is otherwise missing. Slicing on None results in two values of None.

The operations are performed on the correct lines.

ludwiglierhammer · 2025-02-12T12:49:10Z

@jtsiddons: I did some code restructuring to unify code snippets in mdf_reader and cdm_mapper. Unfortunately, I created some merge conflicts in this PR. Could you solve them or should we try to fix them together? This conflicts are only resolveable using the command line.

jtsiddons · 2025-02-12T13:06:52Z

@jtsiddons: I did some code restructuring to unify code snippets in mdf_reader and cdm_mapper. Unfortunately, I created some merge conflicts in this PR. Could you solve them or should we try to fix them together? This conflicts are only resolveable using the command line.

No worries - I can fix the conflicts. I am working on other things this week so I'll resolve them on Monday morning.

…f' options

…name open_polars -> open_text.

…ave/write out to StringIO

jtsiddons · 2025-03-27T14:44:12Z

@ludwiglierhammer: Have moved decode/convert and validate steps into a _read_loop method, meaning that only a single TextParser loop is required. This avoids significant refactor of the Configurator class to allow for conversion/validation in the open_* methods.

…thod

jtsiddons · 2025-03-28T14:47:20Z

Current status - re-factored validators for polars.DataFrame, mask is now generated in Configurator.open_* method, and passed as input to validator functions.

Next steps:

Ensure that the mask values are correctly assigned for missing sections/disabled sections/ignored sections/missing values
Ensure polars -> pandas type conversion is correct (note: pandas has two int64 types: int64 is not nullable, Int64 is nullable)
Ensure column names are not converted to tuple if only one component to the name.

edit: add extra todo step

…unds

jtsiddons · 2025-04-28T06:13:57Z

May be of some use for this: https://narwhals-dev.github.io/narwhals/

Allows for interoperability between python DataFrame libraries (e.g. polars and pandas).

jtsiddons added 3 commits January 27, 2025 15:26

deps: add polars

1c0a2e8

refactor!: add functions for reading sections using polars

90aedfb

chore: remove print

9cca765

github-actions bot added mdf_reader release labels Jan 28, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

8bc58b1

for more information, see https://pre-commit.ci

deps: add polars to ci requirements/environment files

6ee5757

jtsiddons added 3 commits January 28, 2025 09:23

fix: typo in log message

9c98891

refactor: missing values are only assigned for missing sections

f286232

ignore: add uv.lock

c129c9b

github-actions bot added the CI label Jan 28, 2025

jtsiddons added 10 commits January 29, 2025 12:09

refactor: return polars Frame after reading netCDF

58bb58d

fix: use ":" as delimiter in column names

3f7f5a5

refactor: use polars operations. Remove chunksize option for polars.

9263d4e

refactor: set_missing_values to polars

e8b1219

feat: add polars dtypes

61a5216

refactor: decoders into polars

476e6c4

refactor: converters to polars

f8a78b7

refactor: remove chunk looping from convert_and_decode_entries

0ee0297

refactor: update properties to reflect polars, minor corrections

2eca5d0

docs: update and run example notebook

3d12a12

github-actions bot added the docs label Jan 30, 2025

pre-commit-ci bot and others added 6 commits January 30, 2025 08:48

[pre-commit.ci] auto fixes from pre-commit.com hooks

bfa2bbd

for more information, see https://pre-commit.ci

fix: simplify convert_dtype_to_default

25cd538

fix: replace todo comment with note

8df5370

fix: ensure output polars frame has index when reading sections from …

a96ef2b

…netCDF

chore: ruff linter fixes

b1e7f5c

fix: get fields after checking if disable_read is True

1f65f8c

pre-commit-ci bot and others added 4 commits January 31, 2025 09:35

[pre-commit.ci] auto fixes from pre-commit.com hooks

9af7950

for more information, see https://pre-commit.ci

chore: remove unused function

d662b84

Merge branch 'main' into polarising

fc6e37a

[pre-commit.ci] auto fixes from pre-commit.com hooks

a7b4a87

for more information, see https://pre-commit.ci

Merge branch 'main' into polarising

ddfaa8b

jtsiddons added 4 commits March 27, 2025 07:28

Merge branch 'main' into polarising

b1219c7

refactor: use pandas to read, open_with -> format with 'text', 'netcd…

3b18679

…f' options

refactor: Configurator open_ methods now return two polars Frames. Re…

fb75779

…name open_polars -> open_text.

refactor: perform all read steps in one loop rather than repeatedly s…

60e45e0

…ave/write out to StringIO

jtsiddons added 13 commits March 27, 2025 14:46

fix: remove duplicate "widths" argument being passed to _read_text me…

d84e119

…thod

fix: set column name for full-string read by read_text

b738f2c

fix: correct call to get field name from _get_index

97e60fd

fix: cast binary to string

b59ccd5

fix: correct column name indexing and naming for mask

219f75a

fix: drop section from data if delimited

e3d450c

fix: add row index at return

e38b9a1

opt: use tail rather than slice

536406c

fix: don't convert to polars in read_loop

d905c23

fix: following polars method column name

2b077c5

fix: don't add index to data and mask polars frames

e849ff3

refactor(validators)!: polarise validators, pass mask as first argument

90fd5d9

chore: remove debug print statement

4973677

fix: reduce complexity, handle explicit None in schema for numeric bo…

56ab3fe

…unds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `mdf_reader` to use `polars`#213

Refactor `mdf_reader` to use `polars`#213
jtsiddons wants to merge 50 commits intoglamod:mainfrom
jtsiddons:polarising

jtsiddons commented Jan 28, 2025 •

edited

Loading

Uh oh!

jtsiddons commented Jan 28, 2025

Uh oh!

ludwiglierhammer commented Jan 28, 2025

Uh oh!

ludwiglierhammer commented Feb 10, 2025

Uh oh!

jtsiddons commented Feb 10, 2025

Uh oh!

ludwiglierhammer commented Feb 12, 2025

Uh oh!

jtsiddons commented Feb 12, 2025

Uh oh!

jtsiddons commented Mar 27, 2025

Uh oh!

jtsiddons commented Mar 28, 2025 •

edited

Loading

Uh oh!

jtsiddons commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

jtsiddons commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Todo:

Uh oh!

jtsiddons commented Jan 28, 2025

Uh oh!

ludwiglierhammer commented Jan 28, 2025

Uh oh!

ludwiglierhammer commented Feb 10, 2025

Uh oh!

jtsiddons commented Feb 10, 2025

Uh oh!

ludwiglierhammer commented Feb 12, 2025

Uh oh!

jtsiddons commented Feb 12, 2025

Uh oh!

jtsiddons commented Mar 27, 2025

Uh oh!

jtsiddons commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtsiddons commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

jtsiddons commented Jan 28, 2025 •

edited

Loading

jtsiddons commented Mar 28, 2025 •

edited

Loading