Skip to content

Adding explicit support for blosc but keeping deflate the default#747

Open
mkuehbach wants to merge 5 commits intomasterfrom
add_blosc_but_keep_deflate_the_default
Open

Adding explicit support for blosc but keeping deflate the default#747
mkuehbach wants to merge 5 commits intomasterfrom
add_blosc_but_keep_deflate_the_default

Conversation

@mkuehbach
Copy link
Collaborator

@mkuehbach mkuehbach commented Mar 17, 2026

  • Adding whats required to use blosc2 but deactivated
  • A comment to the compression md in the learning section with instruction on how to activate would be good to add

@mkuehbach mkuehbach changed the title carried over from NXapm run-through Adding explicit support for blosc but keeping deflate the default Mar 17, 2026
Comment on lines +47 to +51
if (
PYNX_ENABLE_BLOSC
and importlib.util.find_spec("hdf5plugin") is not None
and importlib.util.find_spec("blosc2") is not None
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we add these two dependencies to optional in the pyproject.toml then we have a check for them here. Or we keep the pyproject with these as mandatory dependencies and skip the check.

Comment on lines +55 to +56
NTHREADS_BLOSC = blosc2.set_nthreads(max(int(os.cpu_count() / 2), 1))
# do not oversubscribe to use hyperthreading cores
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Aren't there any sane defaults in the hdf5lib already?

Comment on lines +197 to +209
if compression_filter == "gzip":
compression_config = dict(
compression=compression_filter,
compression_opts=compression_strength,
)
else: # by virtue of construction blosc
compression_config = hdf5plugin.Blosc2(cname="zstd", clevel=9)
grp.create_dataset(
entry_name,
data=data["compress"],
compression=compression_filter,
chunks=chunking_strategy(data),
compression_opts=compression_strength,
**compression_config,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have one function that deals with our compression needs instead of if statements here.

Comment on lines +31 to +37
# use only when it is acceptable to work with blosc2-compressed content downstreams
# mind that doing so in C/C++, Matlab, and Fortran application requires specific
# linking of these apps with a customized HDF5 library that links to the blosc library
# consider that using blosc sets explicit a certain number of cores eligible for
# doing compression and decompression work that may drain resources when pynxtools
# is used in conjunction with other apps and services like NOMAD
# check the set_nthreads in writer.py to modify accordingly for your best practice
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much of an in code comment. Let's simplify and use some defaults or add this to a more accessible README.md section/link to another .md and/or to the docs.

@sherjeelshabih
Copy link
Collaborator

Can you also add a small text in this PR even for now of what changes this introduces to the way the user interacts with this? I believe the writer.py now expects a "filter" key in the Template object. It will be nice to know the "user interface" changes in the PR.

Thanks for introducing this. I hope it makes it easier for the large datasets we run into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants