Skip to content

Conversation

@eschnett
Copy link

This is a continuation of braingram#1. This PR discusses two improvements to the current interface to blosc:

  1. The compressor does not pass the data type size. Knowing the data type size allows the shuffle filter to reorder the data, exposing more regularity, which allows the compression algorithm to compress better.
  2. The decompressor has the ability to write into a preallocated buffer instead of allocating its own output buffer. This saves memory bandwidth and would improve the decompression speed slightly.

I experimented with creating a large 3d float64 array (1000 x 1000 x 250 elements) and compressing it with the shuffle filter, using as type sizes either 8 (describing the data) or 1:

  for (int64_t i = 0; i < ni; ++i)
    for (int64_t j = 0; j < nj; ++j)
      for (int64_t k = 0; k < nk; ++k) {
        int64_t idx = getidx(i, j, k);
        rho.at(idx) = 1.0 / (1.1 * i + 1.2 * j + 1.3 * k + 1);
      }

. The resulting file sizes are:

  -rw-r--r--   1 eschnett staff 1993847361 Nov 16 11:31 large-new-shuffle-typesize-1.asdf
  -rw-r--r--   1 eschnett staff  395927299 Nov 16 11:29 large-new-shuffle-typesize-8.asdf

In this case the efficiency drops by a factor of 5 when using the wrong type size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant