kaizen: fix duplicate states during closure traversal#494
kaizen: fix duplicate states during closure traversal#494sayrer wants to merge 20 commits intotimbray:mainfrom
Conversation
Nested quantifiers like (([abc]?)*)+ create epsilon loops that cause duplicate states to compound exponentially. The previous fix used sort+compact at a threshold of 500 states. This replaces it with an O(n) in-place dedup using a per-traversal generation counter, lowering the threshold to 64 to catch growth earlier. Zero overhead for the common case since the dedup only activates when nextStates exceeds 64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in epsilon closure Two faState pointers can share the same *smallTable (e.g. the + quantifier wraps an inner state's table in a new faState). During epsilon closure, both get added as distinct states producing duplicate destinations that compound exponentially. Dedup by tracking seen tables in closureBuffers, skipping states whose table was already visited. This removes the need for the runtime generation-counter dedup in traverseNFA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nter dedup Replace tableSeen map in closureBuffers with a global generation counter (tableSeenGeneration) and a per-smallTable tableSeenGen field, eliminating the map allocation and clear overhead. Cache the active level's set pointer in transmap.activeSet on push(), removing the tm.levels[tm.depth].set indirection from the hot path in transmap.add(). This was causing a ~7% regression in ShellstyleMultiMatch due to the slice index + struct field dereference on every call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve transmap struct conflict: take main's flat [][]*fieldMatcher design with separate fieldSet on nfaBuffers, consistent with the auto-resolved traverseNFA and n2dNode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Have the benchmark output been updated following on all those additional pushes? Because they're kind of disappointing as written. |
Track the representative faState per smallTable instead of just a bool. When a second state shares the same table, compare fieldTransitions: if they match, skip (fast path); if they differ, include the state in the closure to preserve correctness. Add TestTablePointerDedupPreservesFieldTransitions to cover the case where two faState nodes share a smallTable with different field matchers, verifying epsilon closure, NFA traversal, and DFA conversion all preserve both matchers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain in closureBuffers why tableRep exists, what invariant it relies on (same *smallTable implies same state in current merge paths), and how the defense works when the invariant is violated (include the state, skip recursion). Add function-level doc to traverseEpsilons describing the dedup and fallback behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Yeah, they aren't good. But they do show we're in the right department, and there is a new test at least. I was looking in the context of DFA conversion https://github.com/sayrer/quamina/tree/lazy_dfa. (I will keep hunting tomorrow, but negative results are valuable too) |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merges 12 multi-wildcard shellstyle + 6 pathological regexp patterns on the same field, exercising the table-pointer dedup in traverseEpsilons. Shows 69% speedup vs main (27.7µs → 8.5µs per op). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies match results for the merged shell-style + regexp pattern mix that exercises table-pointer dedup. Ground truth confirmed identical on both main and dedup_fix branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the removed >500 sort+compact with a map-based dedup that fires when nextStates exceeds 64 entries. This prevents combinatorial blowup when many closure states step to the same next faState, without adding overhead in the common case. Also adds heavy-pattern correctness and timeout tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestBreak500Limit exercises 2925 overlapping wildcard patterns with varied input strategies. TestMeasureNextStates instruments per-byte NFA traversal to measure state expansion and dedup ratios. Both skip under -short due to pattern build time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
TestBreak500Limit and TestMeasureNextStates build 2925 patterns, which is too slow for CI under the race detector. Run them with go test -tags stress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Got to this summary for "This benchmark only has ~25 patterns. The automaton is small enough that the epsilon closures are trivially sized — no shared-table dedup opportunities. The table-pointer dedup is targeting a specific scenario: after many merges, spinner states create splice states pointing to the same underlying tables. With 25 patterns there aren't enough merges to create significant duplication. To find a real win, we'd need a benchmark with hundreds of merged shell-style patterns on the same field. That's what lazy_dfa's ShellstyleManyMatchers does — and it showed gains at higher pattern counts. This branch doesn't have The honest conclusion: the table-pointer dedup on dedup_fix is a correctness/safety improvement (with the defense) but doesn't produce measurable throughput gains on any existing benchmark. The build-time closure optimization is |
|
The benchmark that most stresses out the epsilons
is TestShellStyleBuildTime(); glance at it and you can see how it’s easy to
adjust the size/complexity up and down. The one that actually forced me
add that dedupe-at-500 count is TestToxicStack. -T
…On Feb 18, 2026 at 11:00:32 AM, RS ***@***.***> wrote:
*sayrer* left a comment (timbray/quamina#494)
<#494 (comment)>
Got to this summary for BenchmarkShellstyleMultiMatch, so I think we want
to take some of these benchmarks with the ones in #492
<#492>, and then revisit. I think
this and any nfa2dfa approach will stack up:
"This benchmark only has ~25 patterns. The automaton is small enough that
the epsilon closures are trivially sized — no shared-table dedup
opportunities.
The table-pointer dedup is targeting a specific scenario: after many
merges, spinner states create splice states pointing to the same underlying
tables. With 25 patterns there aren't enough merges to create significant
duplication.
To find a real win, we'd need a benchmark with hundreds of merged
shell-style patterns on the same field. That's what lazy_dfa's
ShellstyleManyMatchers does — and it showed gains at higher pattern counts.
This branch doesn't have
that benchmark though.
The honest conclusion: the table-pointer dedup on dedup_fix is a
correctness/safety improvement (with the defense) but doesn't produce
measurable throughput gains on any existing benchmark. The build-time
closure optimization is
real but the closures in these tests are already small. Want to add a
stress benchmark, or is this branch good as-is?"
—
Reply to this email directly, view it on GitHub
<#494 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEJE3RHDSQXY3PAMSHAYL4MSZFBAVCNFSM6AAAAACVJZXHOSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTSMRSGU4TSNZWGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Tests closure sizes and matching speed across workloads with nested quantifier regexps and overlapping character classes, where the table-pointer dedup has the most impact. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Now we're getting somewhere with this test I added called |
Remove exploratory/proof-of-concept test files (dedup_500_test.go, dedup_bench_test.go, dedup_measure_test.go, dedup_timeout_test.go). Replace TestEpsilonClosureSizes with TestTablePointerDedup which asserts max closure bounds and expected match counts. Add BenchmarkTablePointerDedup using testing.B.Loop() for the same workloads. Retained: TestPathologicalCorrectness (dedup_correctness_test.go) and TestTablePointerDedupPreservesFieldTransitions (epsi_closure_test.go). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Here are the comparison results (median of 3 runs, 3s benchtime each):
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestTablePointerDedup already covers the same patterns and verifies both match correctness and structural properties. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This PR is starting to feel like buying a PC in the 1990s - wait a little while and next month's model will be way better. My Quamina time has been rather limited, so I haven't been keeping a close eye on this. Let me know when you think you've got it where you want it. |
|
OK, I pulled down the latest and ran the (old-version) ShellStyleBuildTest and #494 slows down both the patterns/sec and events/sec by a lot. On my other fave, Benchmark8259, no difference. I guess I should try your newer benchmark too. |
|
(Starting to think this is a hard problem.) |
|
|
I'll look more tomorrow, here's the summary: Two separate costs from the dedup_fix changes:
In nfa.go, the old code on main used slices.SortFunc + slices.Compact (threshold >500) to deduplicate nextStates. dedup_fix replaced this with a map[*faState]bool (threshold >64). The profile shows: The old sort+compact on main doesn't even appear in the profile because it fires rarely (threshold >500). The new map approach fires at >64 and the per-access cost of map[*faState]bool (hashing pointers, probing
The extra clear(bufs.tableRep) at line 80 adds ~30ms, and the tableRep lookups in traverseEpsilons add more. This is the smaller of the two costs.
The dedup successfully reduces closure sizes (main spends 460ms in table.step vs dedup_fix's 230ms), but that 230ms saving is more than consumed by the 430ms of map operations in the nextStates dedup loop. Bottom line: The regression is almost entirely from replacing sort+compact (rarely triggered at >500) with a map-based dedup (triggered at >64) in traverseNFA. The tableRep map in epsilon closure is a minor secondary |
|
Here's what dedup_fix does relative to main:
Problem: During FA merges (especially with shell-style * patterns),
the same smallTable can end up referenced by multiple faState
nodes. When epsilon closures are computed, these duplicate table
pointers cause the same logical state to appear multiple times,
leading to redundant work during NFA traversal.
Solution — table-pointer dedup in epsilon closure
(epsi_closure.go):
alongside the existing closureSet (which is keyed by *faState).
closure, it checks whether that state's *smallTable has already
been seen. If so, the state is skipped as a duplicate.
itself is smaller, so every downstream consumer benefits.
Consequence — removal of runtime dedup in traverseNFA (nfa.go):
a safety valve for "toxically-complex regexps" where epsilon
loops caused huge nextStates buildups. With duplicates eliminated
at closure-computation time, this is no longer needed.
In short: instead of deduplicating at traversal time
(expensive, per-event), this branch deduplicates at build time
(once, when patterns are added) by recognizing that distinct
faState pointers sharing the same smallTable are logically
identical.