Skip to content

Memory optimization NOMAD nexus parser#750

Draft
mkuehbach wants to merge 8 commits intomasterfrom
mem_optimization_parsing
Draft

Memory optimization NOMAD nexus parser#750
mkuehbach wants to merge 8 commits intomasterfrom
mem_optimization_parsing

Conversation

@mkuehbach
Copy link
Collaborator

@mkuehbach mkuehbach commented Mar 17, 2026

Addressing one aspect to issue #737:

Currently most datasets by pynxtools-plugins are written using the simple contiguous storage layout. Upon parsing these will be loaded fully into main memory, unpacked via hdf_node[...] at once if of dtype kind iufc.
If chunked storage layout is used, iterating over chunks respectively hyperslabs is not taken advantage of.
Consequently, the old implementation unnecessarily loads entire datasets into main memory instead of processing these off chunk-by-chunk. For large datasets, e.g. image and spectra stacks the impact is significant, for laptops eventually deal breaking: keep a rather flat few MiB per chunk flat RAM usage profile rather than provoke GiB spikes that may exceed even system max RAM. These spikes are particularly nasty if hosts are shared by multiple users.

Pitfalls: np.mean has optimized numerics (not only for speed but also compensating precision), chunking may need https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

  • Implementation
  • Testing individual parts, specifically Welford part
  • Testing in production

…istics, for chunked need to compute by hand, take care for complex values
@mkuehbach mkuehbach changed the title Memory optimization parsing Memory optimization NOMAD nexus parser Mar 17, 2026
…e welford needs incremental updates which wont end up getting vectorized, the only issue that this comes at all up is because we wish to take the mean value (and show population standard deviation), we should consider to keep it with min and max and just pick the first value, it will avoid using Welford and alike and with ought to speed up the parsing significantly, the mean value of a coordinate axis value triplet 1, 0, 0 with 0.333 and stdev is anyway of limited value, I understood @rettigl such that uses summary stats for navigation, might need discussion for a better compromise, mind that also np.mean and np.std even on cotiguous array are not without imprecisions, in edge cases eventually even weaker than Welford, I am supportive though that for something that is just a value to show in the GUI in NOMAD we should not invoke an eventually very costly algorithm, I kept both the incremental (non vectorized) as well as the numpy batch implementation to support the discussion, my preference is to remove mean and stdev altogether, if people are interested in this, they should compute it in pynxtools-plugin, when they have that dataset anyway at some point in main memory, we should not abuse the parser for these computations, as currently with contiguous storage layout they create memory consumption spikes.
@mkuehbach
Copy link
Collaborator Author

Current practice is that non-scalar iuf h5py.Datasets in NOMAD get the mean value as the field value. This is a convention going back to a suggestion from @sanbrock on how to reduce Metainfo instances per NeXus concepts when registering NeXus h5py.Dataset instances in Metainfo.

Also min, max, and population standard deviation were computed (given the formula np.mean is parameterized with ddof=0 by default). I would like to discuss if we could possibly live without mean and stdev and instead keep of course min, max, size, and ndim, and set arbitraily the first value of array in Metainfo.

For complex non-scalar datasets we anyway have no statistics at the moment.

The motivation is there is a tradeoff when reading efficiently from chunks: Namely that mean and stdev need to be computed incrementally and that is not trivially vectorizable in particular not if we wish to stick with Python code.

I suggest that we rather offload all the summary statistics to the plugin and attach these as arrays if required.

That would offer those folks who wish to have highly numerically accurate mean and std computed with in the parser ending just as a lookup, there its computation was damn costly for contiguous storage layout and is not getting cheaper when using chunked storage.

I also would like to motivate everybody to rather export their non-scalar arrays using the chunked storage layout.

Mind that this does not mean that you need to do any compression. Chunking is necessary for using HDF5 compression filters but alone not a sufficient criterion.

Using chunking would enable us to reduce memory consumption peaks especially when we could agree to avoid computing mean and std in the parsing stage.

Not urgent but thoughts would be appreciated @sherjeelshabih @rettigl @lukaspie @RubelMozumder @sanbrock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants