Opschema metadata by mzient · Pull Request #6280 · NVIDIA/DALI

mzient · 2026-04-03T18:03:58Z

Co-authored-by: Rostan Tabet rtabet@nvidia.com

Category:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change

Description:

This change adds static metadata inference (ndim, layout, dtype) to OpSchema. Most operators can infer it from OpSpec.
OpSpec now carries the statically inferred metadata.
Actual inputs and outputs, as seen in the workspace, are now automatically validated against OpSpec in OperatorBase.

Breaking change - since there's a default way to handle metadata inference, custom operators may become broken and need user attention. This is not ideal - some way to handle it would be nice (e.g. make DALI_SCHEMA behave differently when compiling libdali_operators).

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

greptile-apps · 2026-04-03T18:21:27Z

Greptile Summary

This PR adds static metadata inference (ndim, dtype, layout) to OpSchema and propagates it through the pipeline graph into OpSpec, then validates actual workspace tensors against those inferred descriptors at runtime in OperatorBase::Setup/Run. Roughly 60 operators are annotated with callbacks, and the experimental dynamic mode gains lazy metadata resolution without requiring full tensor evaluation.

P1 – join.cc negative axis bug: axis = desc.layout->ndim() - axis should be + axis; for any negative axis the normalized value is always out of range, so OutputLayout silently returns nullopt instead of the correct layout.
P1 – DataNode.__str__: dtype and layout are both formatted with the label ndim=, making the repr actively misleading.

Confidence Score: 4/5

Safe to merge after fixing the join.cc negative-axis logic bug and the DataNode.str label errors.

Two P1 defects are present: the negative-axis normalization in join.cc silently produces wrong layout metadata for any negative axis value, and DataNode.str mislabels dtype and layout as ndim making debug output unreliable. The remaining findings are P2 (dead code, typos, stale docstrings). The core schema/spec/graph infrastructure is well-structured and the validation logic is correct.

dali/operators/generic/join.cc (negative axis bug) and dali/python/nvidia/dali/data_node.py (wrong str labels and typos)

Important Files Changed

Filename	Overview
dali/pipeline/operator/op_schema.h	New metadata inference API (OutputDType/NDim/Layout with callbacks, ExpandedDims, CalculateOutput* methods); copy-pasted docstrings on OutputNDimFn and OutputLayoutFn
dali/pipeline/operator/op_schema.cc	Implements metadata calculation with ancestor traversal and lazy caching; dead code block in CalculateOutputDType (computed local variable never used)
dali/pipeline/operator/op_spec.h	InOutDesc struct extended with optional ndim/dtype/layout; AddInput/AddArgumentInput accept metadata; OutputDesc/InputDesc accessors added; output map key type updated
dali/pipeline/operator/op_spec.cc	AddInput/AddArgumentInput/AddOutput now reset metadata-inferred flag; InferOutputMetadata delegates to schema's Calculate* methods
dali/pipeline/operator/operator.cc	ValidateInputMetadata and ValidateOutputMetadata added; typo "layuot" in comment; debug-mode guard via __debug argument skips validation for eager mode
dali/pipeline/operator/operator.h	Setup/Run virtual methods gain validate_metadata parameter (default=true); ValidateInput/OutputMetadata declared; breaking change for direct Run overriders
dali/pipeline/graph/node_meta.cc	New file: DFS propagation of output metadata through the op graph, feeding each node's input descriptors before calling InferOutputMetadata
dali/python/nvidia/dali/data_node.py	DataNode gains ndim/dtype/layout populated from OpSpec.OutputDesc; str has two bugs: dtype and layout both labeled "ndim="; error messages spell "Msmatch"
dali/operators/generic/join.cc	OutputLayout callback added; negative axis normalization uses subtraction instead of addition, making layout inference always fail for any negative axis value
dali/pipeline/operator/sequence_operator.h	Setup/Run correctly propagate validate_metadata; expanded workspace path calls ValidateInputMetadata separately since base Setup is called with false
dali/pipeline/operator/eager_operator.h	Adds __debug=true to eager operator spec to bypass metadata validation which would fail without full graph-build metadata propagation
dali/test/python/experimental_mode/test_output_metadata.py	New test file: validates that deferred-mode operators return correct dtype/ndim/layout without triggering early evaluation

Sequence Diagram

sequenceDiagram
    participant PY as Python ops/__init__.py
    participant DN as DataNode
    participant OS as OpSpec
    participant SC as OpSchema
    participant GM as graph/node_meta.cc
    participant OP as OperatorBase

    PY->>DN: DataNode(name, device, source, index)
    DN->>OS: spec.OutputDesc(index)
    OS-->>DN: (name, device, ndim, dtype, layout)

    PY->>OS: AddInput(name, device, ndim, dtype, layout)
    PY->>OS: AddOutput(name, device)

    Note over GM: Pipeline::Build()
    GM->>GM: PropagateDataNodeMetadata (DFS)
    GM->>OS: MutableInputDesc(i) ← producer OutputDesc
    GM->>OS: InferOutputMetadata()
    OS->>SC: CalculateOutputDType/NDim/Layout(i, spec)
    SC-->>OS: optional<dtype/ndim/layout>

    Note over OP: Execution
    OP->>OP: Setup(ws, validate_metadata=true)
    OP->>OP: ValidateInputMetadata(ws, spec)
    OP->>OP: RunImpl(ws)
    OP->>OP: ValidateOutputMetadata(ws, spec)

_{Reviews (1): Last reviewed commit: "TODO(michalz): Fix exception tests." | Re-trigger Greptile}

greptile-apps · 2026-04-03T18:21:31Z

dali/operators/generic/join.cc

+      if (!desc.layout || desc.layout->empty())
+        continue;
+      if (axis < 0)
+        axis = desc.layout->ndim() - axis;


Wrong negative-axis normalization sign

The subtraction ndim - axis produces the wrong index for every negative axis. For axis = -1 with ndim = 2, the result is 2 - (-1) = 3, which immediately fails the range check on the next line and silently returns nullopt — so OutputLayout for any join with a negative axis is always nullopt. Standard normalization is ndim + axis.

Suggested change

axis = desc.layout->ndim() - axis;

axis = desc.layout->ndim() + axis;

greptile-apps · 2026-04-03T18:21:32Z

dali/python/nvidia/dali/data_node.py

+        if self.dtype is not None:
+            s += f", ndim={self.dtype}"
+        if self.layout is not None:
+            s += f", ndim={repr(self.layout)}"


Wrong field labels in __str__ for dtype and layout

Both dtype and layout are formatted with the prefix ndim=, making the repr actively misleading for debugging. The labels should match the field names.

Suggested change

if self.dtype is not None:

s += f", ndim={self.dtype}"

if self.layout is not None:

s += f", ndim={repr(self.layout)}"

if self.ndim is not None:

s += f", ndim={self.ndim}"

if self.dtype is not None:

s += f", dtype={self.dtype}"

if self.layout is not None:

s += f", layout={repr(self.layout)}"

greptile-apps · 2026-04-03T18:21:33Z

dali/pipeline/operator/op_schema.cc

+  const decltype(output_dtype_fn_) &output_dtype_fn = [&]() {
+    if (!output_dtype_fn_.empty())
+      return output_dtype_fn_;
+    for (auto *parent : GetParents())
+      if (!parent->output_dtype_fn_.empty())
+        return parent->output_dtype_fn_;
+    return output_dtype_fn_;
+  }();


Dead code — output_dtype_fn variable is never used

The local variable output_dtype_fn is computed by the immediately-invoked lambda but is never referenced again; the rest of the function delegates to OutputDTypeFn() / OutputDTypeFuncs() which already handle inheritance. This block can be removed without changing behaviour, and the compiler may warn about an unused variable.

greptile-apps · 2026-04-03T18:21:34Z

dali/python/nvidia/dali/data_node.py

+                raise ValueError("Msmatch between OpSpec and explicit `ndim` argument.")
+            if dtype is not None and self.dtype is not None and dtype != self.dtype:
+                raise ValueError("Msmatch between OpSpec and explicit `dtype` argument.")
+            if layout is not None and self.layout is not None and layout != self.layout:
+                raise ValueError("Msmatch between OpSpec and explicit `layout` argument.")


Typo "Msmatch" in error messages

All three ValueError messages say "Msmatch" instead of "Mismatch".

Suggested change

raise ValueError("Msmatch between OpSpec and explicit `ndim` argument.")

if dtype is not None and self.dtype is not None and dtype != self.dtype:

raise ValueError("Msmatch between OpSpec and explicit `dtype` argument.")

if layout is not None and self.layout is not None and layout != self.layout:

raise ValueError("Msmatch between OpSpec and explicit `layout` argument.")

if ndim is not None and self.ndim is not None and ndim != self.ndim:

raise ValueError("Mismatch between OpSpec and explicit `ndim` argument.")

if dtype is not None and self.dtype is not None and dtype != self.dtype:

raise ValueError("Mismatch between OpSpec and explicit `dtype` argument.")

if layout is not None and self.layout is not None and layout != self.layout:

raise ValueError("Mismatch between OpSpec and explicit `layout` argument.")

greptile-apps · 2026-04-03T18:21:35Z

dali/pipeline/operator/operator.cc

+      const OpSpec::InOutDesc &against,
+      std::string_view category,
+      NameType &&index_or_name) {
+  if (what.num_samples() == 0)  // empty batch may have improper ndim/layuot, but we don't care


Typo "layuot" in comment

Suggested change

if (what.num_samples() == 0) // empty batch may have improper ndim/layuot, but we don't care

if (what.num_samples() == 0) // empty batch may have improper ndim/layout, but we don't care

greptile-apps · 2026-04-03T18:21:36Z

dali/pipeline/operator/op_schema.h

+  /** Gets the function that computes the output dtype for the given output.
+   *
+   * The returned function may be inherited from a parent schema.
+   */
+  OutputNDimFunc OutputNDimFn(int index) const;
+
+  /** Gets the function that computes the output dtype for the given output.
+   *
+   * The returned function may be inherited from a parent schema.
+   */
+  OutputLayoutFunc OutputLayoutFn(int index) const;


Copy-pasted docstrings for OutputNDimFn and OutputLayoutFn

Both functions are documented as "Gets the function that computes the output dtype for the given output", which is the docstring for OutputDTypeFn. The descriptions for OutputNDimFn and OutputLayoutFn should reference ndim and layout respectively.

rostan-t and others added 9 commits April 3, 2026 20:04

Support specifying output dtype, ndim and layout in OpSchema

4239610

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

Infer output ndim, dtype, and layout without full evaluation

35078ec

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

Start adding output metadata inference to operator schemas

7ed71ca

Signed-off-by: Rostan Tabet <rtabet@nvidia.com>

Metadata inference working in pipeline mode.

8a9c0d8

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Revert files with only copyright change.

2b128f5

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fixes.

c522579

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

More fixes.

5fd7ecf

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Fixed static inference all operators and their tests. Fix debug mode.

af53ff9

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

Rebase cleanup.

156fd04

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the opschema-metadata branch from b0330dc to 156fd04 Compare April 3, 2026 18:08

TODO(michalz): Fix exception tests.

644da43

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opschema metadata#6280

Opschema metadata#6280
mzient wants to merge 10 commits intoNVIDIA:mainfrom
mzient:opschema-metadata

mzient commented Apr 3, 2026

Uh oh!

greptile-apps bot commented Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

greptile-apps bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	axis = desc.layout->ndim() - axis;
	axis = desc.layout->ndim() + axis;

	if (what.num_samples() == 0) // empty batch may have improper ndim/layuot, but we don't care
	if (what.num_samples() == 0) // empty batch may have improper ndim/layout, but we don't care

Conversation

mzient commented Apr 3, 2026

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Uh oh!

greptile-apps bot commented Apr 3, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants