Skip to content

Opschema metadata#6280

Open
mzient wants to merge 10 commits intoNVIDIA:mainfrom
mzient:opschema-metadata
Open

Opschema metadata#6280
mzient wants to merge 10 commits intoNVIDIA:mainfrom
mzient:opschema-metadata

Conversation

@mzient
Copy link
Copy Markdown
Contributor

@mzient mzient commented Apr 3, 2026

Co-authored-by: Rostan Tabet rtabet@nvidia.com

Category:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change

Description:

This change adds static metadata inference (ndim, layout, dtype) to OpSchema. Most operators can infer it from OpSpec.
OpSpec now carries the statically inferred metadata.
Actual inputs and outputs, as seen in the workspace, are now automatically validated against OpSpec in OperatorBase.

Breaking change - since there's a default way to handle metadata inference, custom operators may become broken and need user attention. This is not ideal - some way to handle it would be nice (e.g. make DALI_SCHEMA behave differently when compiling libdali_operators).

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: N/A

rostan-t and others added 9 commits April 3, 2026 20:04
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Rostan Tabet <rtabet@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient mzient force-pushed the opschema-metadata branch from b0330dc to 156fd04 Compare April 3, 2026 18:08
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR adds static metadata inference (ndim, dtype, layout) to OpSchema and propagates it through the pipeline graph into OpSpec, then validates actual workspace tensors against those inferred descriptors at runtime in OperatorBase::Setup/Run. Roughly 60 operators are annotated with callbacks, and the experimental dynamic mode gains lazy metadata resolution without requiring full tensor evaluation.

  • P1 – join.cc negative axis bug: axis = desc.layout->ndim() - axis should be + axis; for any negative axis the normalized value is always out of range, so OutputLayout silently returns nullopt instead of the correct layout.
  • P1 – DataNode.__str__: dtype and layout are both formatted with the label ndim=, making the repr actively misleading.

Confidence Score: 4/5

Safe to merge after fixing the join.cc negative-axis logic bug and the DataNode.str label errors.

Two P1 defects are present: the negative-axis normalization in join.cc silently produces wrong layout metadata for any negative axis value, and DataNode.str mislabels dtype and layout as ndim making debug output unreliable. The remaining findings are P2 (dead code, typos, stale docstrings). The core schema/spec/graph infrastructure is well-structured and the validation logic is correct.

dali/operators/generic/join.cc (negative axis bug) and dali/python/nvidia/dali/data_node.py (wrong str labels and typos)

Important Files Changed

Filename Overview
dali/pipeline/operator/op_schema.h New metadata inference API (OutputDType/NDim/Layout with callbacks, ExpandedDims, CalculateOutput* methods); copy-pasted docstrings on OutputNDimFn and OutputLayoutFn
dali/pipeline/operator/op_schema.cc Implements metadata calculation with ancestor traversal and lazy caching; dead code block in CalculateOutputDType (computed local variable never used)
dali/pipeline/operator/op_spec.h InOutDesc struct extended with optional ndim/dtype/layout; AddInput/AddArgumentInput accept metadata; OutputDesc/InputDesc accessors added; output map key type updated
dali/pipeline/operator/op_spec.cc AddInput/AddArgumentInput/AddOutput now reset metadata-inferred flag; InferOutputMetadata delegates to schema's Calculate* methods
dali/pipeline/operator/operator.cc ValidateInputMetadata and ValidateOutputMetadata added; typo "layuot" in comment; debug-mode guard via __debug argument skips validation for eager mode
dali/pipeline/operator/operator.h Setup/Run virtual methods gain validate_metadata parameter (default=true); ValidateInput/OutputMetadata declared; breaking change for direct Run overriders
dali/pipeline/graph/node_meta.cc New file: DFS propagation of output metadata through the op graph, feeding each node's input descriptors before calling InferOutputMetadata
dali/python/nvidia/dali/data_node.py DataNode gains ndim/dtype/layout populated from OpSpec.OutputDesc; str has two bugs: dtype and layout both labeled "ndim="; error messages spell "Msmatch"
dali/operators/generic/join.cc OutputLayout callback added; negative axis normalization uses subtraction instead of addition, making layout inference always fail for any negative axis value
dali/pipeline/operator/sequence_operator.h Setup/Run correctly propagate validate_metadata; expanded workspace path calls ValidateInputMetadata separately since base Setup is called with false
dali/pipeline/operator/eager_operator.h Adds __debug=true to eager operator spec to bypass metadata validation which would fail without full graph-build metadata propagation
dali/test/python/experimental_mode/test_output_metadata.py New test file: validates that deferred-mode operators return correct dtype/ndim/layout without triggering early evaluation

Sequence Diagram

sequenceDiagram
    participant PY as Python ops/__init__.py
    participant DN as DataNode
    participant OS as OpSpec
    participant SC as OpSchema
    participant GM as graph/node_meta.cc
    participant OP as OperatorBase

    PY->>DN: DataNode(name, device, source, index)
    DN->>OS: spec.OutputDesc(index)
    OS-->>DN: (name, device, ndim, dtype, layout)

    PY->>OS: AddInput(name, device, ndim, dtype, layout)
    PY->>OS: AddOutput(name, device)

    Note over GM: Pipeline::Build()
    GM->>GM: PropagateDataNodeMetadata (DFS)
    GM->>OS: MutableInputDesc(i) ← producer OutputDesc
    GM->>OS: InferOutputMetadata()
    OS->>SC: CalculateOutputDType/NDim/Layout(i, spec)
    SC-->>OS: optional<dtype/ndim/layout>

    Note over OP: Execution
    OP->>OP: Setup(ws, validate_metadata=true)
    OP->>OP: ValidateInputMetadata(ws, spec)
    OP->>OP: RunImpl(ws)
    OP->>OP: ValidateOutputMetadata(ws, spec)
Loading

Reviews (1): Last reviewed commit: "TODO(michalz): Fix exception tests." | Re-trigger Greptile

if (!desc.layout || desc.layout->empty())
continue;
if (axis < 0)
axis = desc.layout->ndim() - axis;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong negative-axis normalization sign

The subtraction ndim - axis produces the wrong index for every negative axis. For axis = -1 with ndim = 2, the result is 2 - (-1) = 3, which immediately fails the range check on the next line and silently returns nullopt — so OutputLayout for any join with a negative axis is always nullopt. Standard normalization is ndim + axis.

Suggested change
axis = desc.layout->ndim() - axis;
axis = desc.layout->ndim() + axis;

Comment on lines +97 to +100
if self.dtype is not None:
s += f", ndim={self.dtype}"
if self.layout is not None:
s += f", ndim={repr(self.layout)}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong field labels in __str__ for dtype and layout

Both dtype and layout are formatted with the prefix ndim=, making the repr actively misleading for debugging. The labels should match the field names.

Suggested change
if self.dtype is not None:
s += f", ndim={self.dtype}"
if self.layout is not None:
s += f", ndim={repr(self.layout)}"
if self.ndim is not None:
s += f", ndim={self.ndim}"
if self.dtype is not None:
s += f", dtype={self.dtype}"
if self.layout is not None:
s += f", layout={repr(self.layout)}"

Comment on lines +1068 to +1075
const decltype(output_dtype_fn_) &output_dtype_fn = [&]() {
if (!output_dtype_fn_.empty())
return output_dtype_fn_;
for (auto *parent : GetParents())
if (!parent->output_dtype_fn_.empty())
return parent->output_dtype_fn_;
return output_dtype_fn_;
}();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead code — output_dtype_fn variable is never used

The local variable output_dtype_fn is computed by the immediately-invoked lambda but is never referenced again; the rest of the function delegates to OutputDTypeFn() / OutputDTypeFuncs() which already handle inheritance. This block can be removed without changing behaviour, and the compiler may warn about an unused variable.

Comment on lines +80 to +84
raise ValueError("Msmatch between OpSpec and explicit `ndim` argument.")
if dtype is not None and self.dtype is not None and dtype != self.dtype:
raise ValueError("Msmatch between OpSpec and explicit `dtype` argument.")
if layout is not None and self.layout is not None and layout != self.layout:
raise ValueError("Msmatch between OpSpec and explicit `layout` argument.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Typo "Msmatch" in error messages

All three ValueError messages say "Msmatch" instead of "Mismatch".

Suggested change
raise ValueError("Msmatch between OpSpec and explicit `ndim` argument.")
if dtype is not None and self.dtype is not None and dtype != self.dtype:
raise ValueError("Msmatch between OpSpec and explicit `dtype` argument.")
if layout is not None and self.layout is not None and layout != self.layout:
raise ValueError("Msmatch between OpSpec and explicit `layout` argument.")
if ndim is not None and self.ndim is not None and ndim != self.ndim:
raise ValueError("Mismatch between OpSpec and explicit `ndim` argument.")
if dtype is not None and self.dtype is not None and dtype != self.dtype:
raise ValueError("Mismatch between OpSpec and explicit `dtype` argument.")
if layout is not None and self.layout is not None and layout != self.layout:
raise ValueError("Mismatch between OpSpec and explicit `layout` argument.")

const OpSpec::InOutDesc &against,
std::string_view category,
NameType &&index_or_name) {
if (what.num_samples() == 0) // empty batch may have improper ndim/layuot, but we don't care
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Typo "layuot" in comment

Suggested change
if (what.num_samples() == 0) // empty batch may have improper ndim/layuot, but we don't care
if (what.num_samples() == 0) // empty batch may have improper ndim/layout, but we don't care

Comment on lines +314 to +324
/** Gets the function that computes the output dtype for the given output.
*
* The returned function may be inherited from a parent schema.
*/
OutputNDimFunc OutputNDimFn(int index) const;

/** Gets the function that computes the output dtype for the given output.
*
* The returned function may be inherited from a parent schema.
*/
OutputLayoutFunc OutputLayoutFn(int index) const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Copy-pasted docstrings for OutputNDimFn and OutputLayoutFn

Both functions are documented as "Gets the function that computes the output dtype for the given output", which is the docstring for OutputDTypeFn. The descriptions for OutputNDimFn and OutputLayoutFn should reference ndim and layout respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants