Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety
Repo: dlio_benchmark
Date: March 2026
Severity: Medium–High (performance regression for all NPZ filesystem/S3 readers; correctness concern for O_DIRECT; security concern in O_DIRECT parser)
Affects: NPZ and NPY readers across filesystem, S3-simple, and O_DIRECT paths
Background
A review of all NPZ/NPY reader implementations against an expected behavior table
(covering CRC verification, member materialization, allocation count, and I/O API)
revealed four distinct issues. The S3-iterable path (NPZReaderS3Iterable /
NPYReaderS3Iterable via _S3IterableMixin) is unaffected — it never calls
np.load() and never decodes numpy.
Related Bug
This is really just bug / issue #223 with more details and proposed solution
Issue 1 — np.load() CRC-32 Cannot Be Disabled; No Bypass Exists
Files affected
dlio_benchmark/reader/npz_reader.py (NPZReader.open)
dlio_benchmark/reader/npz_reader_s3.py (NPZReaderS3.open)
Description
Both NPZReader and NPZReaderS3 call np.load() to read NPZ files:
# npz_reader.py
return np.load(filename, allow_pickle=True)['x']
# npz_reader_s3.py
return np.load(io.BytesIO(data), allow_pickle=True)['x']
np.load() opens NPZ files via Python's zipfile.ZipExtFile. That class
always performs CRC-32 verification on every read() call — there is no
verify_crc=False parameter in np.load() or in zipfile.
Confirmed from zipfile.ZipExtFile source:
def _update_crc(self, newdata):
if self._expected_crc is None:
return
self._running_crc = crc32(newdata, self._running_crc)
if self._eof and self._running_crc != self._expected_crc:
raise BadZipFile("Bad CRC-32 for file %r" % self.name)
For a storage benchmark, CRC verification adds pure CPU overhead on every file
read with zero benefit: the files are synthetic, generated by the benchmark itself,
and the benchmark discards the decoded data immediately after reading
(DLIO yields self._args.resized_image, not the decoded bytes).
Expected behavior
An optimized buffered reader that manually parses the ZIP local-file header and
decompresses the member data without performing CRC verification — as the O_DIRECT
reader (npz_reader_odirect.py) already does via parse_npz() + parse_npy().
Workaround
None for the filesystem and S3-simple paths. The S3-iterable readers
(NPZReaderS3Iterable) already avoid this by not decoding numpy at all.
Issue 2 — npz_reader_odirect.py Decodes ALL Members, Not One
File affected
dlio_benchmark/reader/npz_reader_odirect.py (NPZReaderODIRECT.parse_npz)
Description
parse_npz() iterates the entire ZIP local-file stream, decoding and storing
every member it encounters before returning the requested one:
def parse_npz(self, mem_view):
files = {}
pos = 0
while pos < len(mem_view):
local_header_signature = mem_view[pos:pos+4].tobytes()
if local_header_signature != b'\x50\x4b\x03\x04':
break
# ... parse compressed_size, uncompressed_size, filename ...
compressed_data = mem_view[pos:pos+compressed_size]
pos += compressed_size
files[filename] = self.parse_npy(uncompressed_data) # ← decodes ALL
return files # caller picks ["x"] and discards the rest
For DLIO-generated NPZ files that contain exactly one member (x), this is
equivalent to reading one member. However:
- The code is misleadingly documented — the table accompanying this issue
describes O_DIRECT as reading "exactly one (target)" member, which is only
accidentally true.
- Any NPZ file with multiple members will have all of them decoded and allocated
in memory simultaneously, then discarded — wasted CPU and allocation.
- The loop should break early once the target member is found.
Expected behavior
parse_npz() should accept an optional only_key parameter (defaulting to the
value of DLIO_NPZ_KEY env var, then "x"). When set, the loop breaks
immediately after the target member is parsed:
def parse_npz(self, mem_view, only_key=None):
target = only_key or os.environ.get("DLIO_NPZ_KEY", "x")
files = {}
pos = 0
while pos < len(mem_view):
...
files[filename] = self.parse_npy(uncompressed_data)
if filename == target:
break # ← exit early
return files
Additional fragility
parse_npz() raises ValueError("Unexpected file in npz: {filename}") for any
ZIP entry that does not end in .npy. Standard NPZ files written by NumPy only
contain .npy entries, but this makes the parser unnecessarily brittle. Non-npy
entries should be skipped, not treated as errors.
Issue 3 — npz_reader_s3.py Allocates Three Copies of Each File
File affected
dlio_benchmark/reader/npz_reader_s3.py (NPZReaderS3.open)
Description
NPZReaderS3.open() has three sequential allocations for every file read:
def open(self, filename):
data = self.storage.get_data(filename, None) # copy 1: bytes from S3
image = io.BytesIO(data) # copy 2: BytesIO internal buffer
return np.load(image, allow_pickle=True)['x'] # copy 3: decompressed ndarray
io.BytesIO(data) copies data into a new internal byte buffer — it does not
wrap the existing buffer. So peak RSS per-file is approximately:
3 × file_size_on_wire (plus zipfile's internal decompression workspace).
For a file size of 150 MB this is 450 MB of peak allocation per file per thread.
Expected behavior
Use a zero-copy path equivalent to what npz_reader_odirect.py does:
replace io.BytesIO(data) + np.load() with bytearray(data) →
memoryview(buf) → parse_npz(mem_view)["x"]. This eliminates the BytesIO
copy and the CRC computation (Issue 1) simultaneously:
def open(self, filename):
data = self.storage.get_data(filename, None)
buf = bytearray(data) # one allocation (same size as data)
return parse_npz(memoryview(buf))["x"] # zero-copy ndarray view
The same pattern applies to NPYReaderS3.open():
# current (2 copies):
data = self.storage.get_data(filename, None)
return np.load(io.BytesIO(data), allow_pickle=True)
# better (1 copy):
data = self.storage.get_data(filename, None)
return parse_npy(memoryview(bytearray(data)))
Issue 4 — eval() on File Content in parse_npy() (Security / Correctness)
Files affected
dlio_benchmark/reader/npy_reader_odirect.py (NPYReaderODirect.parse_npy)
dlio_benchmark/reader/npz_reader_odirect.py (inherits via NPYReaderODirect)
Description
The NPY header parser uses Python's eval() to parse the NPY file header:
header_dict = eval(header.decode('latin1'))
The NPY header is a Python literal string (e.g. {'descr': '<f4', 'fortran_order': False, 'shape': (1, 224, 224), }) that NumPy's own loader also evaluates. For DLIO's use case — reading files generated by the benchmark itself — the content is always safe.
However, eval() on binary file content is arbitrary code execution if the file
is ever sourced from an untrusted location. NumPy's own numpy.lib.format
module uses ast.literal_eval() for this parsing, which is safe:
import ast
header_dict = ast.literal_eval(header.decode('latin1'))
ast.literal_eval() only evaluates Python literals (dicts, tuples, strings,
ints, bools) and raises ValueError on anything unsafe.
Expected behavior
Replace eval(...) with ast.literal_eval(...) in parse_npy().
Summary Table
| # |
Issue |
Files |
Impact |
| 1 |
np.load() always runs CRC-32 via zipfile; no bypass |
npz_reader.py, npz_reader_s3.py |
CPU overhead on every file read; no way to disable |
| 2 |
parse_npz() decodes all ZIP members, not just the target |
npz_reader_odirect.py |
Wasted decode + allocation for multi-member files; misleading docs |
| 3 |
NPZReaderS3.open() makes 3 copies (bytes + BytesIO + ndarray) |
npz_reader_s3.py, npy_reader_s3.py |
3× peak memory per file; BytesIO is an avoidable copy |
| 4 |
parse_npy() uses eval() on file header content |
npy_reader_odirect.py |
Potential arbitrary code execution on untrusted input; use ast.literal_eval() |
Proposed Fix Path
The O_DIRECT reader already has the right approach in parse_npz() + parse_npy().
Refactor those two methods into a shared module (e.g. _npz_parser.py) and use it
in all three paths:
| Reader |
Current |
After fix |
NPZReader (filesystem) |
np.load(filename) — CRC on |
open(f,'rb') → bytearray(f.read()) → parse_npz(mv)["x"] — CRC off |
NPZReaderS3 (S3-simple) |
np.load(BytesIO(data)) — CRC on, 3 copies |
bytearray(data) → parse_npz(mv)["x"] — CRC off, 1 copy |
NPZReaderODIRECT (O_DIRECT) |
parse_npz(mv)["x"] — decodes all, eval() |
parse_npz(mv, only_key="x")["x"] — early exit, ast.literal_eval() |
NPZReaderS3Iterable |
no decode (byte count only) |
no change needed |
The NPY readers (NPYReader, NPYReaderS3) follow the same pattern but without
the ZIP layer — only Issue 3 (BytesIO extra copy) and Issue 4 (eval()) apply.
Reproduction
No special setup needed — these are code-path issues visible from static analysis:
# Confirm CRC is always on in zipfile:
python3 -c "
import zipfile, inspect
src = inspect.getsource(zipfile.ZipExtFile)
for i, line in enumerate(src.split('\n')):
if 'crc' in line.lower() or 'BadZip' in line:
print(f'{i}: {line}')
"
# Show eval() in parse_npy:
grep -n 'eval(' dlio_benchmark/dlio_benchmark/reader/npy_reader_odirect.py
# Show BytesIO double-copy in S3 readers:
grep -n 'BytesIO' dlio_benchmark/dlio_benchmark/reader/npz_reader_s3.py \
dlio_benchmark/dlio_benchmark/reader/npy_reader_s3.py
# Show parse_npz decodes all members (no break):
grep -n 'break\|only_key\|DLIO_NPZ_KEY' dlio_benchmark/dlio_benchmark/reader/npz_reader_odirect.py
Bug Report: NPZ/NPY Reader Issues — CRC Overhead, Memory Copies, O_DIRECT Decode-All, eval() Safety
Repo:
dlio_benchmarkDate: March 2026
Severity: Medium–High (performance regression for all NPZ filesystem/S3 readers; correctness concern for O_DIRECT; security concern in O_DIRECT parser)
Affects: NPZ and NPY readers across filesystem, S3-simple, and O_DIRECT paths
Background
A review of all NPZ/NPY reader implementations against an expected behavior table
(covering CRC verification, member materialization, allocation count, and I/O API)
revealed four distinct issues. The S3-iterable path (
NPZReaderS3Iterable/NPYReaderS3Iterablevia_S3IterableMixin) is unaffected — it never callsnp.load()and never decodes numpy.Related Bug
This is really just bug / issue #223 with more details and proposed solution
Issue 1 —
np.load()CRC-32 Cannot Be Disabled; No Bypass ExistsFiles affected
dlio_benchmark/reader/npz_reader.py(NPZReader.open)dlio_benchmark/reader/npz_reader_s3.py(NPZReaderS3.open)Description
Both
NPZReaderandNPZReaderS3callnp.load()to read NPZ files:np.load()opens NPZ files via Python'szipfile.ZipExtFile. That classalways performs CRC-32 verification on every
read()call — there is noverify_crc=Falseparameter innp.load()or inzipfile.Confirmed from
zipfile.ZipExtFilesource:For a storage benchmark, CRC verification adds pure CPU overhead on every file
read with zero benefit: the files are synthetic, generated by the benchmark itself,
and the benchmark discards the decoded data immediately after reading
(DLIO yields
self._args.resized_image, not the decoded bytes).Expected behavior
An optimized buffered reader that manually parses the ZIP local-file header and
decompresses the member data without performing CRC verification — as the O_DIRECT
reader (
npz_reader_odirect.py) already does viaparse_npz()+parse_npy().Workaround
None for the filesystem and S3-simple paths. The S3-iterable readers
(
NPZReaderS3Iterable) already avoid this by not decoding numpy at all.Issue 2 —
npz_reader_odirect.pyDecodes ALL Members, Not OneFile affected
dlio_benchmark/reader/npz_reader_odirect.py(NPZReaderODIRECT.parse_npz)Description
parse_npz()iterates the entire ZIP local-file stream, decoding and storingevery member it encounters before returning the requested one:
For DLIO-generated NPZ files that contain exactly one member (
x), this isequivalent to reading one member. However:
describes O_DIRECT as reading "exactly one (target)" member, which is only
accidentally true.
in memory simultaneously, then discarded — wasted CPU and allocation.
Expected behavior
parse_npz()should accept an optionalonly_keyparameter (defaulting to thevalue of
DLIO_NPZ_KEYenv var, then"x"). When set, the loop breaksimmediately after the target member is parsed:
Additional fragility
parse_npz()raisesValueError("Unexpected file in npz: {filename}")for anyZIP entry that does not end in
.npy. Standard NPZ files written by NumPy onlycontain
.npyentries, but this makes the parser unnecessarily brittle. Non-npyentries should be skipped, not treated as errors.
Issue 3 —
npz_reader_s3.pyAllocates Three Copies of Each FileFile affected
dlio_benchmark/reader/npz_reader_s3.py(NPZReaderS3.open)Description
NPZReaderS3.open()has three sequential allocations for every file read:io.BytesIO(data)copiesdatainto a new internal byte buffer — it does notwrap the existing buffer. So peak RSS per-file is approximately:
3 × file_size_on_wire(plus zipfile's internal decompression workspace).For a file size of 150 MB this is 450 MB of peak allocation per file per thread.
Expected behavior
Use a zero-copy path equivalent to what
npz_reader_odirect.pydoes:replace
io.BytesIO(data)+np.load()withbytearray(data)→memoryview(buf)→parse_npz(mem_view)["x"]. This eliminates the BytesIOcopy and the CRC computation (Issue 1) simultaneously:
The same pattern applies to
NPYReaderS3.open():Issue 4 —
eval()on File Content inparse_npy()(Security / Correctness)Files affected
dlio_benchmark/reader/npy_reader_odirect.py(NPYReaderODirect.parse_npy)dlio_benchmark/reader/npz_reader_odirect.py(inherits viaNPYReaderODirect)Description
The NPY header parser uses Python's
eval()to parse the NPY file header:The NPY header is a Python literal string (e.g.
{'descr': '<f4', 'fortran_order': False, 'shape': (1, 224, 224), }) that NumPy's own loader also evaluates. For DLIO's use case — reading files generated by the benchmark itself — the content is always safe.However,
eval()on binary file content is arbitrary code execution if the fileis ever sourced from an untrusted location. NumPy's own
numpy.lib.formatmodule uses
ast.literal_eval()for this parsing, which is safe:ast.literal_eval()only evaluates Python literals (dicts, tuples, strings,ints, bools) and raises
ValueErroron anything unsafe.Expected behavior
Replace
eval(...)withast.literal_eval(...)inparse_npy().Summary Table
np.load()always runs CRC-32 viazipfile; no bypassnpz_reader.py,npz_reader_s3.pyparse_npz()decodes all ZIP members, not just the targetnpz_reader_odirect.pyNPZReaderS3.open()makes 3 copies (bytes + BytesIO + ndarray)npz_reader_s3.py,npy_reader_s3.pyparse_npy()useseval()on file header contentnpy_reader_odirect.pyast.literal_eval()Proposed Fix Path
The O_DIRECT reader already has the right approach in
parse_npz()+parse_npy().Refactor those two methods into a shared module (e.g.
_npz_parser.py) and use itin all three paths:
NPZReader(filesystem)np.load(filename)— CRC onopen(f,'rb')→bytearray(f.read())→parse_npz(mv)["x"]— CRC offNPZReaderS3(S3-simple)np.load(BytesIO(data))— CRC on, 3 copiesbytearray(data)→parse_npz(mv)["x"]— CRC off, 1 copyNPZReaderODIRECT(O_DIRECT)parse_npz(mv)["x"]— decodes all,eval()parse_npz(mv, only_key="x")["x"]— early exit,ast.literal_eval()NPZReaderS3IterableThe NPY readers (
NPYReader,NPYReaderS3) follow the same pattern but withoutthe ZIP layer — only Issue 3 (
BytesIOextra copy) and Issue 4 (eval()) apply.Reproduction
No special setup needed — these are code-path issues visible from static analysis: