Skip to content

ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters#1554

Open
joeykleingers wants to merge 10 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-neighbor-orientation-correlation-optimization
Open

ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters#1554
joeykleingers wants to merge 10 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-neighbor-orientation-correlation-optimization

Conversation

@joeykleingers
Copy link
Contributor

@joeykleingers joeykleingers commented Feb 28, 2026

Summary

Optimizes all 5 Group C (Multi-Iteration Morphological / Neighbor Replacement) filters for out-of-core performance using Z-slice rolling buffers and sequential per-array data transfer.

Algorithm: Z-Slice Rolling Buffer

A 3-slice rolling buffer holds the current and adjacent Z-slices in local std::vectors. All 6 face-neighbor reads come from RAM buffers instead of the OOC DataStore. Single shared codepath for both in-core and OOC — no DispatchAlgorithm needed. The buffers are lightweight enough (~2-64 MB depending on dataset) that they benefit or are neutral to in-core performance.

Per-iteration sweep:

  1. Pre-read Z-slices 0 and 1 into buffer slots
  2. For each Z from 0 to dimZ-1: read next slice, process all (y, x) from buffers, rotate slots
  3. After the full Z-sweep, apply recorded changes to all cell data arrays

Data transfer optimization: Changed from ParallelTaskAlgorithm (multiple arrays processed concurrently, causing chunk evictions) to sequential per-array-per-voxel processing. Each array's chunks stay cached while all voxels are processed before moving to the next array.

Per-Filter Changes

NeighborOrientationCorrelation

  • Z-slice rolling buffer for quaternions (4-comp float32), phases (int32), and confidence index (float32)
  • Removed dead neighborDiffCount computation (computed but never read)
  • Pre-reads all neighbor quats/phases before pairwise comparison loop
  • Store bestNeighbor as const reference in transfer functor (avoids 61 MB per-task copy)
  • Clear bestNeighbor between cleanup levels to eliminate stale entries

ErodeDilateBadData

  • Z-slice rolling buffer for FeatureIds
  • Transfer changed from ParallelTaskAlgorithm to sequential per-array loop

ErodeDilateCoordinationNumber

  • Z-slice rolling buffer for FeatureIds with in-place update
  • Fixed DataArrayCopyTupleFunctor by-value DataArray copy to use reference
  • Removed redundant second computeValidFaceNeighbors call

ErodeDilateMask

  • Z-slice rolling buffer for both mask and maskCopy arrays
  • Replaced std::vector<bool> (bit-packed) with std::vector<uint8> for direct byte access

ReplaceElementAttributesWithNeighborValues

  • Z-slice rolling buffer for input comparison array
  • Transfer reordered from per-voxel-per-array to per-array-per-voxel
  • Added null-check guard for dynamic_cast<IDataArray&> on AttributeMatrix children

Code Quality Fixes

  • Removed Doxygen comments from .cpp files (documentation belongs in .hpp headers only)
  • Fixed single-line Doxygen to multi-line format on destructor
  • Fixed wrong @class tags in algorithm headers (ErodeDilateBadData, ErodeDilateCoordinationNumber, ErodeDilateMask)
  • Fixed misleading "highest similarity count" comment to "positive similarity count"
  • Renamed benchmark constants from kDimX to k_DimX convention (all 5 test files)

Performance (200x200x200 benchmarks)

Filter IC Before IC After IC Speedup OOC Before OOC After OOC Speedup
ErodeDilateBadData 1.37s 0.31s 4.42x 25.09s 15.57s 1.61x
ErodeDilateCoordinationNumber 0.35s 0.22s 1.59x 12.43s 3.13s 3.97x
ErodeDilateMask 0.13s 0.11s 1.18x 6.43s 3.27s 1.97x
ReplaceElementAttributesWithNeighborValues 0.35s 0.24s 1.46x 6.05s 4.90s 1.23x
NeighborOrientationCorrelation 2.19s 1.57s 1.39x 67.94s 17.24s 3.94x

Why This Algorithm Was Chosen

  1. Single shared codepath: Z-slice buffers are small enough that they don't hurt in-core performance. No need for separate in-core/OOC algorithm classes or DispatchAlgorithm.
  2. Eliminates neighbor chunk thrashing: All 6-neighbor reads come from RAM buffers, regardless of chunk boundaries.
  3. Sequential transfer prevents cache eviction: Processing one array at a time keeps that array's chunks in the 6-slot FIFO cache.
  4. Minimal memory overhead: Buffers are O(3 Z-slices), well under 64 MB for even 1000x1000 datasets.

Tradeoffs

Aspect Original Optimized (Z-Slice Buffer)
In-core speed Baseline 1.18x-4.42x faster (buffer reads are cache-friendly + bug fixes)
OOC speed 3x-110x slowdown 1.23x-3.97x speedup over baseline OOC
RAM usage O(1) per-voxel accesses O(3 Z-slices) buffer
Code complexity Simple neighbor reads Buffer management + slot rotation
Maintenance Single code path Single code path (no dispatch)

Optimization Ceiling Analysis

The filter-level optimizations are at their ceiling. The remaining OOC overhead is dominated by per-element operator[] overhead in ZarrStore. Each getValue()/setValue() call acquires and releases a mutex lock (std::lock_guard<std::mutex>), plus performs a per-element Zarr chunk lookup (converting flat index to N-D chunk position, scanning the 6-slot FIFO cache, indexing into the chunk). Per-element overhead: ~55-75ns vs ~1ns for in-core DataStore (raw pointer access). This is intrinsic to the ZarrStore implementation and cannot be avoided by any filter-level algorithm change.

Infrastructure-level changes that would improve further:

  1. Bulk read/write API on AbstractDataStore: Add getValues(startIndex, count, T* dest) and setValues(startIndex, count, const T* src) virtual methods. DataStore implements with std::memcpy. ZarrStore implements with a single mutex lock around a bulk loop — eliminating millions of redundant mutex lock/unlock cycles in the transfer phases.

  2. Chunk-level bulk transfer in FileCore: A truly bulk Zarr API that reads/writes contiguous ranges directly from/to the chunk's internal buffer via memcpy, bypassing the per-element chunk lookup entirely. Estimated 3-5x additional improvement on transfer phases. Requires changes deep inside the FileCore library's IArray/Block classes.

  3. Larger or per-array FIFO cache: The current 6-slot global FIFO means sequential per-array transfer is required to avoid cache thrashing. Per-array cache isolation or a larger cache would allow parallel array processing again.

These are infrastructure-level changes that would benefit all OOC-optimized filters, not just this group.

Test Plan

  • In-core unit tests pass for all 5 filters (7/7 tests)
  • Out-of-core unit tests pass for all 5 filters (7/7 tests, verified via "chunk shape:" output)
  • Results match exemplar data identically on both configurations
  • 200x200x200 benchmark test cases added for all 5 filters
  • Post-review-fix benchmarks confirm no regressions (small improvements across the board)

@joeykleingers joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch 2 times, most recently from b9c5baf to 2bca843 Compare February 28, 2026 19:44
@joeykleingers joeykleingers enabled auto-merge (squash) March 2, 2026 06:52
@joeykleingers joeykleingers marked this pull request as draft March 4, 2026 18:38
auto-merge was automatically disabled March 4, 2026 18:38

Pull request was converted to draft

@joeykleingers joeykleingers changed the title ENH: Optimize NeighborOrientationCorrelation for out-of-core performance WIP: ENH: Optimize NeighborOrientationCorrelation for out-of-core performance Mar 4, 2026
@joeykleingers joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch 4 times, most recently from 27b764c to eb7b727 Compare March 4, 2026 21:18
@joeykleingers joeykleingers changed the title WIP: ENH: Optimize NeighborOrientationCorrelation for out-of-core performance ENH: Optimize Group C filters for out-of-core performance Mar 4, 2026
@imikejackson imikejackson changed the title ENH: Optimize Group C filters for out-of-core performance ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters Mar 4, 2026
Replace random-access data reads with Z-slice buffering that maintains a
rolling window of 3 adjacent Z-slices for quaternion and phase arrays,
plus 1 slice for confidence index. This eliminates repeated chunk
decompression in OOC mode where arrays are stored as compressed Zarr
chunks on disk. Also removed dead neighborDiffCount computation and
pre-read all neighbor data before pairwise comparisons.

Filter-only performance (Small_IN100 dataset, 189x201x117 voxels):
- In-core:      3044 ms -> 2360 ms (1.29x speedup)
- Out-of-core: 88166 ms -> 20452 ms (4.31x speedup)

Added Doxygen documentation to all methods in all filter files. Updated
unit test with PreferencesSentinel to force OOC behavior.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add 3-slice rolling buffer for face-neighbor lookups in:
- ErodeDilateBadData: Buffer FeatureIds
- ErodeDilateMask: Buffer mask + maskCopy
- ErodeDilateCoordinationNumber: Buffer FeatureIds (with in-place update)
- ReplaceElementAttributesWithNeighborValues: Buffer input comparison array

All neighbor reads now come from RAM buffers instead of the OOC store,
eliminating chunk thrashing during the per-voxel neighbor scan phase.
…lters

Add PreferencesSentinel to force ZarrStore-backed arrays in all existing
test cases, and add 200x200x200 benchmark test cases for performance
measurement.

Filters updated:
- ErodeDilateBadData (Erode + Dilate)
- ErodeDilateMask (Dilate + Erode)
- ErodeDilateCoordinationNumber
- ReplaceElementAttributesWithNeighborValues
@joeykleingers joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch from caa307b to 69fb0ae Compare March 5, 2026 15:23
@joeykleingers joeykleingers marked this pull request as ready for review March 5, 2026 16:16
Copy link
Contributor Author

@joeykleingers joeykleingers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 1554 Code Review

Source: review of commit 69fb0ae. Issues grouped by severity.


Memory / Lifetime Issues

  • m_BestNeighbor copies entire vector per parallel task
    File: NeighborOrientationCorrelation.cpp (NeighborOrientationCorrelationTransferDataImpl)
    The member std::vector<int64> m_BestNeighbor stores the bestNeighbor vector by value. Since ParallelTaskAlgorithm creates one functor instance per DataArray, each task copies the full vector. For a 200³ dataset that's ~61 MB per task copy. Consider storing a const std::vector<int64>& instead — the outer scope's bestNeighbor outlives all tasks since parallelTask.execute() blocks until completion.

CPU / Algorithm Efficiency

  • bestNeighbor vector not cleared between cleanup levels
    File: NeighborOrientationCorrelation.cpp (operator())
    bestNeighbor is initialized to -1 once but never reset between the currentLevel iterations. A voxel whose CI was raised above the threshold in a prior level retains its stale bestNeighbor entry, causing a redundant (but idempotent) copy in the transfer phase. Adding std::fill(bestNeighbor.begin(), bestNeighbor.end(), -1) at the top of each level would eliminate these redundant copies. (Pre-existing behavior preserved by this PR.)

  • std::vector<bool> bit-packing overhead in tight inner loop
    File: ErodeDilateMask.cpp (operator())
    maskSlices and maskCopySlices use std::vector<bool>, which stores bits rather than bytes. Each access requires bit extraction/insertion. For 200×200 per-slice sequential access in a tight 3-deep loop, std::vector<uint8_t> would give direct byte access at the cost of ~8× more memory per slice (~40 KB → ~320 KB, still negligible). (Deferred — profile first to see if this is measurable.)

  • DataArrayCopyTupleFunctor copies DataArray by value
    File: ErodeDilateCoordinationNumber.cpp (DataArrayCopyTupleFunctor::operator())
    DataArrayType outputArray = dynamic_cast<DataArrayType&>(outputIDataArray); creates a by-value copy of the DataArray. Should be DataArrayType& outputArray. (Pre-existing bug surfaced by this PR.)

  • computeValidFaceNeighbors called twice per voxel
    File: ErodeDilateCoordinationNumber.cpp (operator())
    The second call (for the featureCount reset loop) computes the same result as the first call. Remove the second call. (Pre-existing behavior preserved by this PR.)

Naming Consistency

  • Benchmark constants use kDimX instead of k_DimX
    Files: All 5 test files (benchmark TEST_CASEs)
    Block-scoped constexpr constants like kDimX, kDimY, kDimZ, kBlockSize, kBlocksPerDim, kTotalVoxels use the k prefix without underscore. The project convention is k_ prefix for constants (e.g., k_DimX).

Const-Correctness

No issues found.

Readability

  • Comment says "highest similarity count" but code picks last positive
    File: NeighborOrientationCorrelation.cpp (operator(), best-neighbor selection loop)
    The comment reads "Find the best neighbor (last valid face with highest similarity count)" but the code selects the last face neighbor with any neighborSimCount > 0, not the one with the highest count. The comment should say "last valid face with positive similarity count." (Pre-existing algorithm behavior.)

UX / Human Interface Guidelines

No UI changes in this PR.

Robustness / Defensive

  • dynamic_cast<IDataArray&> may throw for non-array AttributeMatrix children
    File: ReplaceElementAttributesWithNeighborValues.cpp (ExecuteTemplate::operator())
    If any child of the AttributeMatrix is not an IDataArray, this throws std::bad_cast. A guard using dynamic_cast<IDataArray*> with a null check + continue would be more defensive. (Pre-existing code, not introduced by this PR.)

Bugs

No bugs found.

Documentation

  • Doxygen comments added to .cpp files — skill requires .hpp only
    Files: NeighborOrientationCorrelation.cpp, NeighborOrientationCorrelationFilter.cpp
    The doxygen-comments convention states: "Do NOT add comments to .cpp files -- only .hpp headers." This PR adds 4 Doxygen blocks to NeighborOrientationCorrelation.cpp (class NeighborOrientationCorrelationTransferDataImpl, its constructor, its operator(), and the main algorithm operator()()) and 11 Doxygen blocks to NeighborOrientationCorrelationFilter.cpp (one per method: name(), className(), uuid(), humanName(), defaultTags(), parameters(), parametersVersion(), clone(), preflightImpl(), executeImpl(), FromSIMPLJson()). All .cpp-file Doxygen should be removed; the documentation belongs on the declarations in the corresponding .hpp headers.

  • Single-line Doxygen format on destructor
    File: NeighborOrientationCorrelation.hpp (destructor)
    /** @brief Default destructor. */ uses single-line format. Convention requires multi-line format for ALL Doxygen comments:

    /**
     * @brief Default destructor.
     */
  • Inconsistent Doxygen coverage across modified algorithm files
    The NeighborOrientationCorrelation .hpp files received comprehensive Doxygen. The other 4 algorithm .hpp files (ErodeDilateBadData.hpp, ErodeDilateCoordinationNumber.hpp, ErodeDilateMask.hpp, ReplaceElementAttributesWithNeighborValues.hpp) were not modified by this PR so this is not blocking — but the .cpp implementations were substantially rewritten. If the intent is to document these algorithms, the Doxygen should go on the .hpp declarations.

  • Pre-existing wrong @class tag in ErodeDilateBadData.hpp
    File: ErodeDilateBadData.hpp
    The existing Doxygen reads @class ConditionalSetValueFilter but the class is ErodeDilateBadData. (Pre-existing bug surfaced by this PR.)

  • No user-facing documentation changes needed — confirmed. Filter behavior (inputs, outputs, results) is unchanged; only internal algorithm performance was optimized.

Confirmed Correct (no action needed)

  • Z-slice buffer rotation and write-back timing in ErodeDilateMask — slot 0 (z-1) is written back only after z has been fully processed, ensuring erode modifications from current z are captured. Final slice write-back handles both dims[2]==1 and dims[2]>1 correctly.

  • FeatureIds processed last in ErodeDilateBadData transfer — sequential transfer correctly processes all non-FeatureIds arrays first, then FeatureIds last, because the conditional check depends on original unmodified FeatureIds values.

  • In-place buffer update after copyTuple in ErodeDilateCoordinationNumberfeatureIdSlices[1][inSlice] = featureIds[voxelIndex] correctly re-reads the modified voxel from the backing store into the buffer, ensuring subsequent neighbor lookups see the updated FeatureId.

  • Sequential per-array transfer pattern — All 5 filters correctly changed from parallel multi-array transfer to sequential per-array transfer, preventing OOC chunk cache thrashing.

  • currentLevel = currentLevel - 1 double-decrement — The for-loop decrement plus the explicit decrement produces a net decrement of 2 per iteration (6→4→2), matching the original algorithm's cleanup level progression.

  • Benchmark tests properly reload from .dream3d — All 5 benchmarks write to disk then reload via LoadDataStructure, forcing ZarrStore backing when OOC is enabled. Temp file cleanup via fs::remove() is correct.

Adds AlgorithmDispatch.hpp include and ForceOocAlgorithmGuard(GENERATE(false))
to all Group C morphological filter test cases. This ensures in-core tests only
run the in-core path and OOC tests only run the OOC path.
Replace ParallelTaskAlgorithm transfer in ErodeDilateBadData with
sequential per-array loop, and reorder ReplaceElementAttributes transfer
from per-voxel-per-array to per-array-per-voxel. Processing one array
at a time prevents multiple arrays' chunks from evicting each other in
the OOC cache. ErodeDilateBadData OOC benchmark improved from 21s to 16s.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Group C filters use a single shared algorithm (Z-slice rolling buffer)
for both in-core and OOC modes. DispatchAlgorithm is not used, so
ForceOocAlgorithmGuard + GENERATE is unnecessary — the same code path
is always tested regardless of build configuration.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Adds a dedicated benchmark test case with programmatic block-based EBSD
data (25^3 grain blocks, low CI at boundaries, cubic crystal structure).
Benchmark results: IC 2.19s→1.74s (1.26x), OOC 67.94s→18.62s (3.65x).

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
- Store bestNeighbor as const reference in transfer functor to avoid
  61 MB per-task vector copies
- Clear bestNeighbor between cleanup levels to eliminate stale entries
- Fix DataArrayCopyTupleFunctor by-value DataArray copy to use reference
- Remove redundant computeValidFaceNeighbors call in coordination number
- Replace std::vector<bool> with std::vector<uint8> in ErodeDilateMask
- Add null-check guard for dynamic_cast<IDataArray&> in
  ReplaceElementAttributesWithNeighborValues
- Rename benchmark constants from kDimX to k_DimX convention (all 5 tests)
- Remove Doxygen comments from .cpp files (belongs in .hpp only)
- Fix single-line Doxygen to multi-line format on destructor
- Fix wrong @Class tags in algorithm headers
- Fix misleading "highest similarity count" comment
@joeykleingers joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch 2 times, most recently from 8db9373 to 3d2b969 Compare March 5, 2026 19:05
@joeykleingers joeykleingers force-pushed the worktree-neighbor-orientation-correlation-optimization branch from 3d2b969 to 3a20273 Compare March 5, 2026 19:16
Remove dead RUN_TASK macros and unused faceNeighborInternalIdx variable
from NeighborOrientationCorrelation. Use getDataRefAs instead of getDataAs
in ErodeDilateBadData. Mark getCancel() const across all four SimplnxCore
algorithm classes. Add Doxygen for operator() in algorithm headers.
Improve transfer-strategy comments for OOC rationale.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants