ENH: OoC optimizations for NeighborOrientationCorrelation and Morphological Filters#1554
Conversation
b9c5baf to
2bca843
Compare
Pull request was converted to draft
27b764c to
eb7b727
Compare
Replace random-access data reads with Z-slice buffering that maintains a rolling window of 3 adjacent Z-slices for quaternion and phase arrays, plus 1 slice for confidence index. This eliminates repeated chunk decompression in OOC mode where arrays are stored as compressed Zarr chunks on disk. Also removed dead neighborDiffCount computation and pre-read all neighbor data before pairwise comparisons. Filter-only performance (Small_IN100 dataset, 189x201x117 voxels): - In-core: 3044 ms -> 2360 ms (1.29x speedup) - Out-of-core: 88166 ms -> 20452 ms (4.31x speedup) Added Doxygen documentation to all methods in all filter files. Updated unit test with PreferencesSentinel to force OOC behavior. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Add 3-slice rolling buffer for face-neighbor lookups in: - ErodeDilateBadData: Buffer FeatureIds - ErodeDilateMask: Buffer mask + maskCopy - ErodeDilateCoordinationNumber: Buffer FeatureIds (with in-place update) - ReplaceElementAttributesWithNeighborValues: Buffer input comparison array All neighbor reads now come from RAM buffers instead of the OOC store, eliminating chunk thrashing during the per-voxel neighbor scan phase.
…lters Add PreferencesSentinel to force ZarrStore-backed arrays in all existing test cases, and add 200x200x200 benchmark test cases for performance measurement. Filters updated: - ErodeDilateBadData (Erode + Dilate) - ErodeDilateMask (Dilate + Erode) - ErodeDilateCoordinationNumber - ReplaceElementAttributesWithNeighborValues
caa307b to
69fb0ae
Compare
There was a problem hiding this comment.
PR 1554 Code Review
Source: review of commit
69fb0ae. Issues grouped by severity.
Memory / Lifetime Issues
-
m_BestNeighborcopies entire vector per parallel task
File:NeighborOrientationCorrelation.cpp(NeighborOrientationCorrelationTransferDataImpl)
The memberstd::vector<int64> m_BestNeighborstores thebestNeighborvector by value. SinceParallelTaskAlgorithmcreates one functor instance per DataArray, each task copies the full vector. For a 200³ dataset that's ~61 MB per task copy. Consider storing aconst std::vector<int64>&instead — the outer scope'sbestNeighboroutlives all tasks sinceparallelTask.execute()blocks until completion.
CPU / Algorithm Efficiency
-
bestNeighborvector not cleared between cleanup levels
File:NeighborOrientationCorrelation.cpp(operator())
bestNeighboris initialized to -1 once but never reset between thecurrentLeveliterations. A voxel whose CI was raised above the threshold in a prior level retains its stalebestNeighborentry, causing a redundant (but idempotent) copy in the transfer phase. Addingstd::fill(bestNeighbor.begin(), bestNeighbor.end(), -1)at the top of each level would eliminate these redundant copies. (Pre-existing behavior preserved by this PR.) -
std::vector<bool>bit-packing overhead in tight inner loop
File:ErodeDilateMask.cpp(operator())
maskSlicesandmaskCopySlicesusestd::vector<bool>, which stores bits rather than bytes. Each access requires bit extraction/insertion. For 200×200 per-slice sequential access in a tight 3-deep loop,std::vector<uint8_t>would give direct byte access at the cost of ~8× more memory per slice (~40 KB → ~320 KB, still negligible). (Deferred — profile first to see if this is measurable.) -
DataArrayCopyTupleFunctorcopies DataArray by value
File:ErodeDilateCoordinationNumber.cpp(DataArrayCopyTupleFunctor::operator())
DataArrayType outputArray = dynamic_cast<DataArrayType&>(outputIDataArray);creates a by-value copy of the DataArray. Should beDataArrayType& outputArray. (Pre-existing bug surfaced by this PR.) -
computeValidFaceNeighborscalled twice per voxel
File:ErodeDilateCoordinationNumber.cpp(operator())
The second call (for the featureCount reset loop) computes the same result as the first call. Remove the second call. (Pre-existing behavior preserved by this PR.)
Naming Consistency
- Benchmark constants use
kDimXinstead ofk_DimX
Files: All 5 test files (benchmark TEST_CASEs)
Block-scopedconstexprconstants likekDimX,kDimY,kDimZ,kBlockSize,kBlocksPerDim,kTotalVoxelsuse thekprefix without underscore. The project convention isk_prefix for constants (e.g.,k_DimX).
Const-Correctness
No issues found.
Readability
- Comment says "highest similarity count" but code picks last positive
File:NeighborOrientationCorrelation.cpp(operator(), best-neighbor selection loop)
The comment reads "Find the best neighbor (last valid face with highest similarity count)" but the code selects the last face neighbor with anyneighborSimCount > 0, not the one with the highest count. The comment should say "last valid face with positive similarity count." (Pre-existing algorithm behavior.)
UX / Human Interface Guidelines
No UI changes in this PR.
Robustness / Defensive
-
dynamic_cast<IDataArray&>may throw for non-array AttributeMatrix children
File:ReplaceElementAttributesWithNeighborValues.cpp(ExecuteTemplate::operator())
If any child of the AttributeMatrix is not an IDataArray, this throwsstd::bad_cast. A guard usingdynamic_cast<IDataArray*>with a null check +continuewould be more defensive. (Pre-existing code, not introduced by this PR.)
Bugs
No bugs found.
Documentation
-
Doxygen comments added to
.cppfiles — skill requires.hpponly
Files:NeighborOrientationCorrelation.cpp,NeighborOrientationCorrelationFilter.cpp
Thedoxygen-commentsconvention states: "Do NOT add comments to.cppfiles -- only.hppheaders." This PR adds 4 Doxygen blocks toNeighborOrientationCorrelation.cpp(classNeighborOrientationCorrelationTransferDataImpl, its constructor, itsoperator(), and the main algorithmoperator()()) and 11 Doxygen blocks toNeighborOrientationCorrelationFilter.cpp(one per method:name(),className(),uuid(),humanName(),defaultTags(),parameters(),parametersVersion(),clone(),preflightImpl(),executeImpl(),FromSIMPLJson()). All.cpp-file Doxygen should be removed; the documentation belongs on the declarations in the corresponding.hppheaders. -
Single-line Doxygen format on destructor
File:NeighborOrientationCorrelation.hpp(destructor)
/** @brief Default destructor. */uses single-line format. Convention requires multi-line format for ALL Doxygen comments:/** * @brief Default destructor. */
-
Inconsistent Doxygen coverage across modified algorithm files
The NeighborOrientationCorrelation.hppfiles received comprehensive Doxygen. The other 4 algorithm.hppfiles (ErodeDilateBadData.hpp,ErodeDilateCoordinationNumber.hpp,ErodeDilateMask.hpp,ReplaceElementAttributesWithNeighborValues.hpp) were not modified by this PR so this is not blocking — but the.cppimplementations were substantially rewritten. If the intent is to document these algorithms, the Doxygen should go on the.hppdeclarations. -
Pre-existing wrong
@classtag inErodeDilateBadData.hpp
File:ErodeDilateBadData.hpp
The existing Doxygen reads@class ConditionalSetValueFilterbut the class isErodeDilateBadData. (Pre-existing bug surfaced by this PR.) -
No user-facing documentation changes needed — confirmed. Filter behavior (inputs, outputs, results) is unchanged; only internal algorithm performance was optimized.
Confirmed Correct (no action needed)
-
Z-slice buffer rotation and write-back timing in ErodeDilateMask — slot 0 (z-1) is written back only after z has been fully processed, ensuring erode modifications from current z are captured. Final slice write-back handles both dims[2]==1 and dims[2]>1 correctly.
-
FeatureIds processed last in ErodeDilateBadData transfer — sequential transfer correctly processes all non-FeatureIds arrays first, then FeatureIds last, because the conditional check depends on original unmodified FeatureIds values.
-
In-place buffer update after
copyTuplein ErodeDilateCoordinationNumber —featureIdSlices[1][inSlice] = featureIds[voxelIndex]correctly re-reads the modified voxel from the backing store into the buffer, ensuring subsequent neighbor lookups see the updated FeatureId. -
Sequential per-array transfer pattern — All 5 filters correctly changed from parallel multi-array transfer to sequential per-array transfer, preventing OOC chunk cache thrashing.
-
currentLevel = currentLevel - 1double-decrement — The for-loop decrement plus the explicit decrement produces a net decrement of 2 per iteration (6→4→2), matching the original algorithm's cleanup level progression. -
Benchmark tests properly reload from .dream3d — All 5 benchmarks write to disk then reload via
LoadDataStructure, forcing ZarrStore backing when OOC is enabled. Temp file cleanup viafs::remove()is correct.
Adds AlgorithmDispatch.hpp include and ForceOocAlgorithmGuard(GENERATE(false)) to all Group C morphological filter test cases. This ensures in-core tests only run the in-core path and OOC tests only run the OOC path.
Replace ParallelTaskAlgorithm transfer in ErodeDilateBadData with sequential per-array loop, and reorder ReplaceElementAttributes transfer from per-voxel-per-array to per-array-per-voxel. Processing one array at a time prevents multiple arrays' chunks from evicting each other in the OOC cache. ErodeDilateBadData OOC benchmark improved from 21s to 16s. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Group C filters use a single shared algorithm (Z-slice rolling buffer) for both in-core and OOC modes. DispatchAlgorithm is not used, so ForceOocAlgorithmGuard + GENERATE is unnecessary — the same code path is always tested regardless of build configuration. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Adds a dedicated benchmark test case with programmatic block-based EBSD data (25^3 grain blocks, low CI at boundaries, cubic crystal structure). Benchmark results: IC 2.19s→1.74s (1.26x), OOC 67.94s→18.62s (3.65x). Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
- Store bestNeighbor as const reference in transfer functor to avoid 61 MB per-task vector copies - Clear bestNeighbor between cleanup levels to eliminate stale entries - Fix DataArrayCopyTupleFunctor by-value DataArray copy to use reference - Remove redundant computeValidFaceNeighbors call in coordination number - Replace std::vector<bool> with std::vector<uint8> in ErodeDilateMask - Add null-check guard for dynamic_cast<IDataArray&> in ReplaceElementAttributesWithNeighborValues - Rename benchmark constants from kDimX to k_DimX convention (all 5 tests) - Remove Doxygen comments from .cpp files (belongs in .hpp only) - Fix single-line Doxygen to multi-line format on destructor - Fix wrong @Class tags in algorithm headers - Fix misleading "highest similarity count" comment
8db9373 to
3d2b969
Compare
3d2b969 to
3a20273
Compare
Remove dead RUN_TASK macros and unused faceNeighborInternalIdx variable from NeighborOrientationCorrelation. Use getDataRefAs instead of getDataAs in ErodeDilateBadData. Mark getCancel() const across all four SimplnxCore algorithm classes. Add Doxygen for operator() in algorithm headers. Improve transfer-strategy comments for OOC rationale.
Summary
Optimizes all 5 Group C (Multi-Iteration Morphological / Neighbor Replacement) filters for out-of-core performance using Z-slice rolling buffers and sequential per-array data transfer.
Algorithm: Z-Slice Rolling Buffer
A 3-slice rolling buffer holds the current and adjacent Z-slices in local
std::vectors. All 6 face-neighbor reads come from RAM buffers instead of the OOC DataStore. Single shared codepath for both in-core and OOC — no DispatchAlgorithm needed. The buffers are lightweight enough (~2-64 MB depending on dataset) that they benefit or are neutral to in-core performance.Per-iteration sweep:
Data transfer optimization: Changed from
ParallelTaskAlgorithm(multiple arrays processed concurrently, causing chunk evictions) to sequential per-array-per-voxel processing. Each array's chunks stay cached while all voxels are processed before moving to the next array.Per-Filter Changes
NeighborOrientationCorrelation
neighborDiffCountcomputation (computed but never read)ErodeDilateBadData
ParallelTaskAlgorithmto sequential per-array loopErodeDilateCoordinationNumber
DataArrayCopyTupleFunctorby-value DataArray copy to use referencecomputeValidFaceNeighborscallErodeDilateMask
std::vector<bool>(bit-packed) withstd::vector<uint8>for direct byte accessReplaceElementAttributesWithNeighborValues
dynamic_cast<IDataArray&>on AttributeMatrix childrenCode Quality Fixes
.cppfiles (documentation belongs in.hppheaders only)@classtags in algorithm headers (ErodeDilateBadData, ErodeDilateCoordinationNumber, ErodeDilateMask)kDimXtok_DimXconvention (all 5 test files)Performance (200x200x200 benchmarks)
Why This Algorithm Was Chosen
Tradeoffs
Optimization Ceiling Analysis
The filter-level optimizations are at their ceiling. The remaining OOC overhead is dominated by per-element
operator[]overhead inZarrStore. EachgetValue()/setValue()call acquires and releases a mutex lock (std::lock_guard<std::mutex>), plus performs a per-element Zarr chunk lookup (converting flat index to N-D chunk position, scanning the 6-slot FIFO cache, indexing into the chunk). Per-element overhead: ~55-75ns vs ~1ns for in-coreDataStore(raw pointer access). This is intrinsic to theZarrStoreimplementation and cannot be avoided by any filter-level algorithm change.Infrastructure-level changes that would improve further:
Bulk read/write API on
AbstractDataStore: AddgetValues(startIndex, count, T* dest)andsetValues(startIndex, count, const T* src)virtual methods.DataStoreimplements withstd::memcpy.ZarrStoreimplements with a single mutex lock around a bulk loop — eliminating millions of redundant mutex lock/unlock cycles in the transfer phases.Chunk-level bulk transfer in FileCore: A truly bulk Zarr API that reads/writes contiguous ranges directly from/to the chunk's internal buffer via
memcpy, bypassing the per-element chunk lookup entirely. Estimated 3-5x additional improvement on transfer phases. Requires changes deep inside the FileCore library'sIArray/Blockclasses.Larger or per-array FIFO cache: The current 6-slot global FIFO means sequential per-array transfer is required to avoid cache thrashing. Per-array cache isolation or a larger cache would allow parallel array processing again.
These are infrastructure-level changes that would benefit all OOC-optimized filters, not just this group.
Test Plan