Memory optimization NOMAD nexus parser by mkuehbach · Pull Request #750 · FAIRmat-NFDI/pynxtools

mkuehbach · 2026-03-17T20:38:11Z

Addressing one aspect to issue #737:

Currently most datasets by pynxtools-plugins are written using the simple contiguous storage layout. Upon parsing these will be loaded fully into main memory, unpacked via hdf_node[...] at once if of dtype kind iufc.
If chunked storage layout is used, iterating over chunks respectively hyperslabs is not taken advantage of.
Consequently, the old implementation unnecessarily loads entire datasets into main memory instead of processing these off chunk-by-chunk. For large datasets, e.g. image and spectra stacks the impact is significant, for laptops eventually deal breaking: keep a rather flat few MiB per chunk flat RAM usage profile rather than provoke GiB spikes that may exceed even system max RAM. These spikes are particularly nasty if hosts are shared by multiple users.

Pitfalls: np.mean has optimized numerics (not only for speed but also compensating precision), chunking may need https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Implementation
Testing individual parts, specifically Welford part
Testing in production

…istics, for chunked need to compute by hand, take care for complex values

@rettigl

…e welford needs incremental updates which wont end up getting vectorized, the only issue that this comes at all up is because we wish to take the mean value (and show population standard deviation), we should consider to keep it with min and max and just pick the first value, it will avoid using Welford and alike and with ought to speed up the parsing significantly, the mean value of a coordinate axis value triplet 1, 0, 0 with 0.333 and stdev is anyway of limited value, I understood @rettigl such that uses summary stats for navigation, might need discussion for a better compromise, mind that also np.mean and np.std even on cotiguous array are not without imprecisions, in edge cases eventually even weaker than Welford, I am supportive though that for something that is just a value to show in the GUI in NOMAD we should not invoke an eventually very costly algorithm, I kept both the incremental (non vectorized) as well as the numpy batch implementation to support the discussion, my preference is to remove mean and stdev altogether, if people are interested in this, they should compute it in pynxtools-plugin, when they have that dataset anyway at some point in main memory, we should not abuse the parser for these computations, as currently with contiguous storage layout they create memory consumption spikes.

mkuehbach · 2026-03-19T16:51:48Z

Current practice is that non-scalar iuf h5py.Datasets in NOMAD get the mean value as the field value. This is a convention going back to a suggestion from @sanbrock on how to reduce Metainfo instances per NeXus concepts when registering NeXus h5py.Dataset instances in Metainfo.

Also min, max, and population standard deviation were computed (given the formula np.mean is parameterized with ddof=0 by default). I would like to discuss if we could possibly live without mean and stdev and instead keep of course min, max, size, and ndim, and set arbitraily the first value of array in Metainfo.

For complex non-scalar datasets we anyway have no statistics at the moment.

The motivation is there is a tradeoff when reading efficiently from chunks: Namely that mean and stdev need to be computed incrementally and that is not trivially vectorizable in particular not if we wish to stick with Python code.

I suggest that we rather offload all the summary statistics to the plugin and attach these as arrays if required.

That would offer those folks who wish to have highly numerically accurate mean and std computed with in the parser ending just as a lookup, there its computation was damn costly for contiguous storage layout and is not getting cheaper when using chunked storage.

I also would like to motivate everybody to rather export their non-scalar arrays using the chunked storage layout.

Mind that this does not mean that you need to do any compression. Chunking is necessary for using HDF5 compression filters but alone not a sufficient criterion.

Using chunking would enable us to reduce memory consumption peaks especially when we could agree to avoid computing mean and std in the parsing stage.

Not urgent but thoughts would be appreciated @sherjeelshabih @rettigl @lukaspie @RubelMozumder @sanbrock

incomplete skeleton, untested, unfinished, next steps, implement stat…

3b7bdeb

…istics, for chunked need to compute by hand, take care for complex values

mkuehbach changed the title ~~Memory optimization parsing~~ Memory optimization NOMAD nexus parser Mar 17, 2026

atomprobe-tc added 7 commits March 18, 2026 20:53

Welford's algorithm skeleton

d94d471

skeleton for refactoring field_statistics data structure

3c591d1

fix bug in _get_value

2bf1ff5

first test version

3ad6d0c

typing

634e6bb

bugfixing

694b272

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimization NOMAD nexus parser#750

Memory optimization NOMAD nexus parser#750
mkuehbach wants to merge 8 commits intomasterfrom
mem_optimization_parsing

mkuehbach commented Mar 17, 2026 •

edited

Loading

Uh oh!

mkuehbach commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mkuehbach commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkuehbach commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkuehbach commented Mar 17, 2026 •

edited

Loading