Skip to content

Comments

kaizen: fix duplicate states during closure traversal#494

Open
sayrer wants to merge 20 commits intotimbray:mainfrom
sayrer:dedup_fix
Open

kaizen: fix duplicate states during closure traversal#494
sayrer wants to merge 20 commits intotimbray:mainfrom
sayrer:dedup_fix

Conversation

@sayrer
Copy link
Contributor

@sayrer sayrer commented Feb 16, 2026

Here's what dedup_fix does relative to main:

Problem: During FA merges (especially with shell-style * patterns),
the same smallTable can end up referenced by multiple faState
nodes. When epsilon closures are computed, these duplicate table
pointers cause the same logical state to appear multiple times,
leading to redundant work during NFA traversal.

Solution — table-pointer dedup in epsilon closure
(epsi_closure.go):

  • Adds atableSeen map[*smallTable]bool to closureBuffers, tracked
    alongside the existing closureSet (which is keyed by *faState).
  • In traverseEpsilons, before adding a non-epsilon-only state to the
    closure, it checks whether that state's *smallTable has already
    been seen. If so, the state is skipped as a duplicate.
  • This eliminates duplicate NFA states at the source — the closure
    itself is smaller, so every downstream consumer benefits.

Consequence — removal of runtime dedup in traverseNFA (nfa.go):

  • The len(nextStates) > 500 sort+compact block is removed. This was
    a safety valve for "toxically-complex regexps" where epsilon
    loops caused huge nextStates buildups. With duplicates eliminated
    at closure-computation time, this is no longer needed.
  • The cmp and slices imports are dropped accordingly.

In short: instead of deduplicating at traversal time
(expensive, per-event), this branch deduplicates at build time
(once, when patterns are added) by recognizing that distinct
faState pointers sharing the same smallTable are logically
identical.

    Timing (sec/op)                                                                                                                                                                                                                        
  ┌──────────────────────────────────────────┬─────────┬───────────┬──────────────────────────────────────┐                                                                                                                              
  │                Benchmark                 │  main   │ dedup_fix │                change                │                                                                                                                              
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤                                                                                                                              
  │ CityLots                                 │ 5.878µs │ 5.774µs   │ -1.78% (p=0.002)                     │                                                                                                                              
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤                                                                                                                              
  │ JsonFlattener_LastField                  │ 9.648µs │ 9.428µs   │ -2.27% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_LastField         │ 9.777µs │ 9.535µs   │ -2.47% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_ContextFields     │ 431.4ns │ 425.6ns   │ -1.33% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ JsonFlattener_Evaluate_MiddleNestedField │ 9.154µs │ 9.029µs   │ -1.37% (p=0.002)                     │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ ShellstyleMultiMatch                     │ 32.94µs │ 32.49µs   │ ~1.4% faster (not stat. significant) │
  ├──────────────────────────────────────────┼─────────┼───────────┼──────────────────────────────────────┤
  │ Others                                   │         │           │ no significant change                │
  └──────────────────────────────────────────┴─────────┴───────────┴──────────────────────────────────────┘
  
  Memory: CityLots dropped from 15 to 14 B/op (-6.67%). Everything
  else unchanged. Zero allocations across the board (as before).

  Summary: Statistically significant 1.3-2.5% speedups across most
  benchmarks, with the geomean improvement at -0.66%. The gains are
  modest but consistent — the table-pointer dedup is eliminating
  redundant work in epsilon closures. The biggest wins are in the
  flattener benchmarks that exercise deeper automata. No regressions
  anywhere.

sayrer and others added 6 commits February 13, 2026 12:25
Nested quantifiers like (([abc]?)*)+ create epsilon loops that cause
duplicate states to compound exponentially. The previous fix used
sort+compact at a threshold of 500 states. This replaces it with an
O(n) in-place dedup using a per-traversal generation counter, lowering
the threshold to 64 to catch growth earlier. Zero overhead for the
common case since the dedup only activates when nextStates exceeds 64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in epsilon closure

Two faState pointers can share the same *smallTable (e.g. the + quantifier
wraps an inner state's table in a new faState). During epsilon closure, both
get added as distinct states producing duplicate destinations that compound
exponentially. Dedup by tracking seen tables in closureBuffers, skipping
states whose table was already visited. This removes the need for the
runtime generation-counter dedup in traverseNFA.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nter dedup

Replace tableSeen map in closureBuffers with a global generation counter
(tableSeenGeneration) and a per-smallTable tableSeenGen field, eliminating
the map allocation and clear overhead.

Cache the active level's set pointer in transmap.activeSet on push(),
removing the tm.levels[tm.depth].set indirection from the hot path in
transmap.add(). This was causing a ~7% regression in ShellstyleMultiMatch
due to the slice index + struct field dereference on every call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve transmap struct conflict: take main's flat [][]*fieldMatcher
design with separate fieldSet on nfaBuffers, consistent with the
auto-resolved traverseNFA and n2dNode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timbray
Copy link
Owner

timbray commented Feb 17, 2026

Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written.

sayrer and others added 2 commits February 17, 2026 15:44
Track the representative faState per smallTable instead of just a bool.
When a second state shares the same table, compare fieldTransitions:
if they match, skip (fast path); if they differ, include the state
in the closure to preserve correctness.

Add TestTablePointerDedupPreservesFieldTransitions to cover the case
where two faState nodes share a smallTable with different field
matchers, verifying epsilon closure, NFA traversal, and DFA conversion
all preserve both matchers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain in closureBuffers why tableRep exists, what invariant it
relies on (same *smallTable implies same state in current merge
paths), and how the defense works when the invariant is violated
(include the state, skip recursion). Add function-level doc to
traverseEpsilons describing the dedup and fallback behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 18, 2026

Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written.

Yeah, they aren't good. But they do show we're in the right department, and there is a new test at least. I was looking in the context of DFA conversion https://github.com/sayrer/quamina/tree/lazy_dfa. (I will keep hunting tomorrow, but negative results are valuable too)

sayrer and others added 5 commits February 17, 2026 17:46
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merges 12 multi-wildcard shellstyle + 6 pathological regexp patterns on
the same field, exercising the table-pointer dedup in traverseEpsilons.
Shows 69% speedup vs main (27.7µs → 8.5µs per op).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies match results for the merged shell-style + regexp pattern mix
that exercises table-pointer dedup. Ground truth confirmed identical on
both main and dedup_fix branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the removed >500 sort+compact with a map-based dedup that fires
when nextStates exceeds 64 entries. This prevents combinatorial blowup
when many closure states step to the same next faState, without adding
overhead in the common case.

Also adds heavy-pattern correctness and timeout tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestBreak500Limit exercises 2925 overlapping wildcard patterns with
varied input strategies. TestMeasureNextStates instruments per-byte
NFA traversal to measure state expansion and dedup ratios. Both skip
under -short due to pattern build time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 18, 2026

  ┌──────────────────────┬─────────┬───────────┬────────────────────────┐                                                                                    
  │      Benchmark       │  main   │ dedup_fix │         change         │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ CityLots             │ 5.937µs │ 5.774µs   │ -2.75% (p=0.026)       │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ 8259Example          │ 5.631µs │ 5.660µs   │ ~ (p=0.093, no change) │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ ShellstyleMultiMatch │ 32.17µs │ 32.30µs   │ ~ (no change)          │                                                                                    
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤                                                                                    
  │ PathologicalEpsilon  │ 27.70µs │ 8.54µs    │ -69.18% (p=0.002)      │
  ├──────────────────────┼─────────┼───────────┼────────────────────────┤

TestBreak500Limit and TestMeasureNextStates build 2925 patterns,
which is too slow for CI under the race detector. Run them with
go test -tags stress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 18, 2026

Got to this summary for BenchmarkShellstyleMultiMatch, so I think we want to take some of these benchmarks with the ones in #492, and then revisit. I think this and any nfa2dfa approach will stack up. I don't think we need all the tests in this PR right now, just showing my work to see whether this matters, and I think it does.

"This benchmark only has ~25 patterns. The automaton is small enough that the epsilon closures are trivially sized — no shared-table dedup opportunities.

The table-pointer dedup is targeting a specific scenario: after many merges, spinner states create splice states pointing to the same underlying tables. With 25 patterns there aren't enough merges to create significant duplication.

To find a real win, we'd need a benchmark with hundreds of merged shell-style patterns on the same field. That's what lazy_dfa's ShellstyleManyMatchers does — and it showed gains at higher pattern counts. This branch doesn't have
that benchmark though.

The honest conclusion: the table-pointer dedup on dedup_fix is a correctness/safety improvement (with the defense) but doesn't produce measurable throughput gains on any existing benchmark. The build-time closure optimization is
real but the closures in these tests are already small. Want to add a stress benchmark, or is this branch good as-is?"

@timbray
Copy link
Owner

timbray commented Feb 18, 2026 via email

sayrer and others added 3 commits February 19, 2026 15:31
Tests closure sizes and matching speed across workloads with nested
quantifier regexps and overlapping character classes, where the
table-pointer dedup has the most impact.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 20, 2026

Now we're getting somewhere with this test I added called TestEpsilonClosureSizes:

  ┌──────────────────────────┬────────────────────────┬─────────────────────────┬────────────┬─────────────┬────────────────────┬─────────────────────┬─────────┐
  │         Workload         │ Closure Entries (main) │ Closure Entries (dedup) │ Max (main) │ Max (dedup) │ Matches/sec (main) │ Matches/sec (dedup) │ Speedup │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ 6-regexp + 12-shell      │ 4,392                  │ 4,371                   │ 19         │ 13          │ 240K               │ 380K                │ 1.6x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ 20 nested regexps        │ 334                    │ 261                     │ 60         │ 40          │ 1.1M               │ 3.4M                │ 3.1x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ deeply nested            │ 342                    │ 220                     │ 46         │ 26          │ 681K               │ 3.2M                │ 4.7x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ overlapping char classes │ 240                    │ 156                     │ 48         │ 24          │ 1.6M               │ 4.5M                │ 2.9x    │
  ├──────────────────────────┼────────────────────────┼─────────────────────────┼────────────┼─────────────┼────────────────────┼─────────────────────┼─────────┤
  │ shell + deep overlap     │ 3,488                  │ 3,410                   │ 37         │ 21          │ 204K               │ 297K                │ 1.5x    │
  └──────────────────────────┴────────────────────────┴─────────────────────────┴────────────┴─────────────┴────────────────────┴─────────────────────┴─────────┘

  The deeply-nested workload ((((a?)*b?)*c?)* patterns) shows the biggest win:
  max closure drops from 46→26, closure entries from 342→220 (36% fewer), and
  matching is 4.7x faster. The overlapping-char-classes workload((([abc]?)*)+,(
  ([bcd]?)*)+, etc.) is nearly 3x faster with max closure cut in half (48→24).

  The pattern that triggers dedup_fix most strongly is nested quantifiers over
  overlapping character classes — (([abc]?)*)+ merged with (([bcd]?)*)+ creates
  many epsilon paths converging on shared tables.

Remove exploratory/proof-of-concept test files (dedup_500_test.go,
dedup_bench_test.go, dedup_measure_test.go, dedup_timeout_test.go).

Replace TestEpsilonClosureSizes with TestTablePointerDedup which
asserts max closure bounds and expected match counts. Add
BenchmarkTablePointerDedup using testing.B.Loop() for the same
workloads.

Retained: TestPathologicalCorrectness (dedup_correctness_test.go)
and TestTablePointerDedupPreservesFieldTransitions (epsi_closure_test.go).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sayrer
Copy link
Contributor Author

sayrer commented Feb 20, 2026

Here are the comparison results (median of 3 runs, 3s benchtime each):

go test -bench=BenchmarkTablePointerDedup -benchtime=3s -count=3 -run='^$'

  ┌──────────────────────────┬──────────────┬───────────────────┬─────────┐
  │         Workload         │ main (ns/op) │ dedup_fix (ns/op) │ Speedup │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤
  │ 6-regexps-12-shell       │        8,874 │             7,900 │   1.12x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ 20-nested-regexps        │        2,006 │               902 │   2.22x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ deeply-nested            │        3,349 │               941 │   3.56x │                                                                                                                                                 
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤                                                                                                                                                 
  │ overlapping-char-classes │        1,520 │               674 │   2.25x │
  ├──────────────────────────┼──────────────┼───────────────────┼─────────┤
  │ shell+deep-overlap       │       12,243 │            10,260 │   1.19x │
  └──────────────────────────┴──────────────┴───────────────────┴─────────┘


  All workloads are 0 allocs on both branches. The dedup_fix branch shows
  significant improvement on nested quantifier regexp workloads:

  - deeply-nested: 3.56x faster (3,349 → 941 ns/op)
  - 20-nested-regexps: 2.22x faster (2,006 → 902 ns/op)
  - overlapping-char-classes: 2.25x faster (1,520 → 674 ns/op)
  - Mixed workloads with shell patterns show more modest 12-19% gains since
    shell patterns have zero table sharing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestTablePointerDedup already covers the same patterns and verifies
both match correctness and structural properties.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timbray
Copy link
Owner

timbray commented Feb 22, 2026

This PR is starting to feel like buying a PC in the 1990s - wait a little while and next month's model will be way better. My Quamina time has been rather limited, so I haven't been keeping a close eye on this. Let me know when you think you've got it where you want it.

@timbray
Copy link
Owner

timbray commented Feb 22, 2026

OK, I pulled down the latest and ran the (old-version) ShellStyleBuildTest and #494 slows down both the patterns/sec and events/sec by a lot. On my other fave, Benchmark8259, no difference. I guess I should try your newer benchmark too.

@timbray
Copy link
Owner

timbray commented Feb 22, 2026

(Starting to think this is a hard problem.)

@sayrer
Copy link
Contributor Author

sayrer commented Feb 22, 2026

  ┌──────────────┬─────────┬───────────┬──────────────┐                                                                                                                                                                     
  │              │  main   │ dedup_fix │    delta     │                                                                                                                                                                     
  ├──────────────┼─────────┼───────────┼──────────────┤                                                                                                                                                                     
  │ ns/op        │ ~47,700 │ ~116,100  │ +2.4x slower │                                                                                                                                                                     
  ├──────────────┼─────────┼───────────┼──────────────┤                                                                                                                                                                     
  │ B/op         │ ~797    │ 4,243     │ +5.3x        │                                                                                                                                                                     
  ├──────────────┼─────────┼───────────┼──────────────┤                                                                                                                                                                     
  │ allocs/op    │ 0       │ 4         │ +4           │                                                                                                                                                                     
  ├──────────────┼─────────┼───────────┼──────────────┤                                                                                                                                                                     
  │ patterns/sec │ ~3,150  │ ~2,460    │ -22%         │                                                                                                                                                                     
  ├──────────────┼─────────┼───────────┼──────────────┤                                                                                                                                                                     
  │ events/sec   │ ~41,900 │ ~17,200   │ -59%         │
  └──────────────┴─────────┴───────────┴──────────────┘
  The dedup_fix branch is about 2.4x slower on build time and ~59% slower on event matching in this benchmark.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 22, 2026

I'll look more tomorrow, here's the summary:

Two separate costs from the dedup_fix changes:

  1. traverseNFA nextStates dedup — the big one (~430ms new map cost)

In nfa.go, the old code on main used slices.SortFunc + slices.Compact (threshold >500) to deduplicate nextStates. dedup_fix replaced this with a map[*faState]bool (threshold >64). The profile shows:

  ┌──────────────────────────────────────┬───────────┬──────┐                                                                                                                                                               
  │                 Line                 │ dedup_fix │ main │                                                                                                                                                               
  ├──────────────────────────────────────┼───────────┼──────┤
  │ clear(stateSet) (307)                │ 40ms      │ n/a  │
  ├──────────────────────────────────────┼───────────┼──────┤
  │ if !stateSet[s] (310) — mapaccess    │ 160ms     │ n/a  │
  ├──────────────────────────────────────┼───────────┼──────┤
  │ stateSet[s] = true (311) — mapassign │ 230ms     │ n/a  │
  ├──────────────────────────────────────┼───────────┼──────┤
  │ Total map overhead                   │ ~430ms    │ 0    │
  └──────────────────────────────────────┴───────────┴──────┘

The old sort+compact on main doesn't even appear in the profile because it fires rarely (threshold >500). The new map approach fires at >64 and the per-access cost of map[*faState]bool (hashing pointers, probing
buckets) is substantial. The map is cleared every iteration of the outer byte loop, which is also expensive.

  1. closureForStateWithBufs — the tableRep map (~30ms)

The extra clear(bufs.tableRep) at line 80 adds ~30ms, and the tableRep lookups in traverseEpsilons add more. This is the smaller of the two costs.

  1. The irony: smaller closures → fewer table.step calls, but map overhead eats the savings

The dedup successfully reduces closure sizes (main spends 460ms in table.step vs dedup_fix's 230ms), but that 230ms saving is more than consumed by the 430ms of map operations in the nextStates dedup loop.

Bottom line: The regression is almost entirely from replacing sort+compact (rarely triggered at >500) with a map-based dedup (triggered at >64) in traverseNFA. The tableRep map in epsilon closure is a minor secondary
cost.

@sayrer
Copy link
Contributor Author

sayrer commented Feb 22, 2026

❯ what if we change the threshold to 500 on dedup_fix as well?                                                                                                                                                              

That recovers it completely:                                                                                                                                                                                              
  ┌────────────┬─────────┬───────────────┬────────────────┐                                                                                                                                                                 
  │            │  main   │ dedup_fix @64 │ dedup_fix @500 │                                                                                                                                                                 
  ├────────────┼─────────┼───────────────┼────────────────┤                                                                                                                                                                 
  │ ns/op      │ ~48,000 │ ~116,000      │ ~49,000        │                                                                                                                                                                 
  ├────────────┼─────────┼───────────────┼────────────────┤                                                                                                                                                                 
  │ events/sec │ ~41,900 │ ~17,200       │ ~41,000        │                                                                                                                                                                 
  ├────────────┼─────────┼───────────────┼────────────────┤                                                                                                                                                                 
  │ B/op       │ ~797    │ 4,243         │ ~830           │                                                                                                                                                                 
  └────────────┴─────────┴───────────────┴────────────────┘
  Essentially at parity with main. The map-based dedup just never fires for this workload
  at the 500 threshold, same as the old sort+compact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants