Skip to content

ENH: OoC optimizations for CCL/Segmentation filters#1557

Closed
joeykleingers wants to merge 3 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-FilterOOCOptimizations
Closed

ENH: OoC optimizations for CCL/Segmentation filters#1557
joeykleingers wants to merge 3 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-FilterOOCOptimizations

Conversation

@joeykleingers
Copy link
Contributor

@joeykleingers joeykleingers commented Mar 4, 2026

Depends on: #1545 (AlgorithmDispatch infrastructure) — merge #1545 first, then rebase this PR onto develop.

Summary

Optimize 5 CCL/Segmentation filters for out-of-core performance using BFS/CCL algorithm dispatch:

  • IdentifySample: Split into BFS flood fill (in-core) and scanline CCL with union-find (OOC, 2-slice rolling buffer)
  • FillBadData: Split into BFS (in-core) and CCL with on-disk deferred fill (OOC, O(slice) memory)
  • ScalarSegmentFeatures: Add executeCCL() to shared SegmentFeatures base with 2-slice rolling buffer + union-find
  • EBSDSegmentFeatures: CCL dispatch via isValidVoxel/areNeighborsSimilar overrides
  • CAxisSegmentFeatures: CCL dispatch via isValidVoxel/areNeighborsSimilar overrides

All filters use DispatchAlgorithm from #1545 to select the optimal path at runtime. Tests updated with ForceOocAlgorithmGuard + GENERATE(false, true) to exercise both algorithm paths, plus 200x200x200 benchmark test cases.

Algorithm Details

Original Algorithm: DFS Flood-Fill

All five filters used a depth-first search (DFS) flood-fill to find connected components:

  1. Scan forward to find the next unlabeled valid voxel (the "seed")
  2. Push the seed onto a stack, pop voxels, check 6 neighbors via determineGrouping()
  3. If a neighbor matches, push it onto the stack
  4. Repeat until the stack is empty (entire component labeled)

Why it's slow OOC: Each stack pop accesses an arbitrary voxel, then reads 6 scattered neighbors. With chunked storage, each jump may evict the current chunk and load a new one from disk, causing 50x-621x slowdown.

Optimized Algorithm: Chunk-Sequential CCL with Union-Find

A two-phase scanline algorithm that processes the grid in strict Z-Y-X order, never accessing data out of sequence.

Phase 1 — Forward Labeling with Rolling Buffer

  • Iterate every voxel in Z-Y-X order, checking only backward neighbors (-X, -Y, -Z)
  • A rolling 2-slice buffer (size = 2 × dimX × dimY) holds labels for current and previous Z-slices
  • Union-Find tracks label equivalences with path-halving compression and union-by-rank

Phase 2 — Resolution and Relabeling

  1. Flatten the Union-Find (single O(K) pass)
  2. Sequential pass to build provisional-to-final label mapping (preserves seed-discovery order)
  3. Sequential pass to replace provisional labels with final IDs

Dispatch Strategy

Each filter checks IsOutOfCore(*featureIdsArray) || ForceOocAlgorithm():

  • TrueexecuteCCL() (chunk-sequential CCL)
  • Falseexecute() (original DFS flood-fill)

Per-Filter Notes

Filter isValidVoxel() areNeighborsSimilar()
ScalarSegmentFeatures Mask check Type-dispatched CompareFunctor::compare() (11 data types)
EBSDSegmentFeatures Mask + phase > 0 Same phase + quaternion misorientation via LaueOps
CAxisSegmentFeatures Mask + phase > 0 Same phase + c-axis angle (handles directional ambiguity)
IdentifySample Uses base CCL + optional hole-filling phase
FillBadData Own 4-phase CCL: negative labels for bad-data, Union-Find, size classification, iterative morphological dilation with on-disk deferred fill

Tradeoffs

Aspect Original DFS Optimized CCL
In-core speed Excellent (good cache locality) Good (~5-10% overhead from buffer management)
OOC speed Catastrophic (50x-621x slowdown) Excellent (strictly sequential I/O)
RAM usage All arrays must fit in RAM Rolling buffer = O(2 slices) + Union-Find = O(features)
Code complexity Simple DFS loop (~70 lines) Three-phase algorithm + Union-Find (~300 lines in base class)
Feature ID ordering Deterministic seed-discovery order Matches DFS ordering via Phase 2 remapping

Performance (200x200x200 programmatic datasets)

Per-Filter Results

IdentifySample — BFS (in-core) / scanline CCL with union-find (OOC)

Config Before After Speedup
In-core 0.23s 0.16s (BFS) 1.4x
OOC 841s 4.14s (CCL) 203x

ScalarSegmentFeatures — base executeCCL() with type-dispatched comparator

Config Before (DFS) After (CCL) Speedup
In-core 0.36s 0.23s 1.6x
OOC >1500s (timeout) 12.9s >115x

EBSDSegmentFeatures — base executeCCL() with quaternion misorientation

Config Before (DFS) After (CCL) Speedup
In-core 0.77s 0.62s 1.2x
OOC >1500s (timeout) 35.9s >42x

CAxisSegmentFeatures — base executeCCL() with c-axis angle

Config Before (DFS) After (CCL) Speedup
In-core 0.60s 0.55s ~1.1x
OOC >1500s (timeout) 32.8s >46x

FillBadData — BFS (in-core) / 4-phase CCL with on-disk deferred fill (OOC)

Config Before (BFS) After (CCL) Speedup Notes
In-core 0.18s 0.28s 0.6x CCL adds ~0.1s overhead; BFS still used for in-core
OOC 6.02s 6.05s ~1.0x Equivalent speed, O(slice) RAM instead of O(N)

FillBadData's OOC baseline was already fast (6s), so the optimization is primarily RAM reduction (O(N) → O(slice)) rather than speed.

Group Summary

Filter OOC Speedup In-Core Impact Key Benefit
IdentifySample 203x 1.4x faster Eliminated random access flood-fill
ScalarSegmentFeatures >115x 1.6x faster Chunk-sequential CCL replaces random DFS
EBSDSegmentFeatures >42x 1.2x faster Same CCL base, misorientation math dominates
CAxisSegmentFeatures >46x ~1.1x faster Same CCL base, c-axis angle math dominates
FillBadData ~1.0x 0.6x slower (CCL) RAM: O(N) → O(slice); BFS still used in-core

Test Plan

  • All existing correctness tests pass on both in-core and OOC configurations
  • Both BFS and CCL algorithm paths tested via ForceOocAlgorithmGuard + GENERATE(false, true)
  • 200x200x200 benchmark tests pass on both configurations
  • OOC verified via "chunk shape:" printouts in verbose test output

Add reusable AlgorithmDispatch.hpp utility with IsOutOfCore(),
AnyOutOfCore(), ForceOocAlgorithm(), ForceOocAlgorithmGuard, and
DispatchAlgorithm<InCore, OOC>() so filters can dispatch to separate
in-core and out-of-core algorithm implementations at runtime.

Includes documentation in docs/AlgorithmDispatch.md.

No filters are using this infrastructure yet — it is provided as
reusable scaffolding for future OOC optimization work.
Consolidate OOC filter optimizations from identify-sample-optimizations worktree:

- Add AlgorithmDispatch.hpp and UnionFind.hpp utilities
- SegmentFeatures: Add executeCCL() with 2-slice rolling buffer + Union-Find
- ScalarSegmentFeatures: CCL dispatch + CompareFunctor::compare()
- EBSDSegmentFeatures: CCL dispatch + isValidVoxel/areNeighborsSimilar
- CAxisSegmentFeatures: CCL dispatch + isValidVoxel/areNeighborsSimilar
- Tests: PreferencesSentinel, ForceOocAlgorithmGuard, 200^3 benchmarks
Update IdentifySample and FillBadData to use the AlgorithmDispatch
BFS/CCL split pattern instead of monolithic inlined algorithms.
Add BFS/CCL split files, update tests with ForceOocAlgorithmGuard,
PreferencesSentinel, and 200x200x200 benchmark test cases.
@imikejackson imikejackson changed the title ENH: OOC optimizations for Group D (CCL/Segmentation filters) ENH: OOC optimizations for CCL/Segmentation filters Mar 4, 2026
@imikejackson imikejackson changed the title ENH: OOC optimizations for CCL/Segmentation filters ENH: OoC optimizations for CCL/Segmentation filters Mar 4, 2026
@joeykleingers joeykleingers deleted the worktree-FilterOOCOptimizations branch March 5, 2026 01:10
@joeykleingers
Copy link
Contributor Author

Reopening this under a different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant