Skip to content

Speed up nfa2dfa#500

Merged
timbray merged 7 commits intotimbray:mainfrom
sayrer:intern_state_lists
Feb 28, 2026
Merged

Speed up nfa2dfa#500
timbray merged 7 commits intotimbray:mainfrom
sayrer:intern_state_lists

Conversation

@sayrer
Copy link
Contributor

@sayrer sayrer commented Feb 28, 2026

These are the numbers for BenchmarkNfa2Dfa. This took a little bit of iteration, but it was fun. About 30-40 minutes, but the benchmarks were running at -count=6, so lots of waiting. All of the optimization ideas were good, which is frequently not the case.

I have to go watch the Lakers play the Warriors, so I might not respond to comments until tomorrow or Monday.

  Main vs Optimized                                                                                                                                                            
  ┌─────────────┬──────────────┬───────────────────┬─────────┬─────────────┬────────────┬─────────────────┐                                                                    
  │   Subcase   │ Main (ns/op) │ Optimized (ns/op) │ Speedup │ Main allocs │ Opt allocs │ Alloc reduction │                                                                    
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤                                                                    
  │ single_star │ 922k         │ 63k               │ 14.6x   │ 9,555       │ 115        │ -99%            │                                                                    
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤                                                                    
  │ two_stars   │ 1,966k       │ 151k              │ 13.0x   │ 21,279      │ 220        │ -99%            │                                                                    
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤
  │ three_stars │ 626k         │ 35k               │ 17.8x   │ 6,122       │ 75         │ -99%            │
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤
  │ five_stars  │ 879k         │ 48k               │ 18.3x   │ 8,566       │ 96         │ -99%            │
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤
  │ eight_stars │ 1,256k       │ 66k               │ 19.0x   │ 12,223      │ 123        │ -99%            │
  └─────────────┴──────────────┴───────────────────┴─────────┴─────────────┴────────────┴─────────────────┘

sayrer and others added 7 commits February 28, 2026 12:40
Replace the dedup map with a generation counter (reusing
faState.closureSetGen), and add reusable scratch buffers for the
sorted-uniques slice and key bytes. On cache hits the compiler's
string(bytes) map-lookup optimization avoids the key allocation
entirely. Only cache misses allocate (key string + stored slice).

Reduces nfa2Dfa allocations by ~35-40% and wall time by ~12%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of calling addByteStep (unpack, set one byte, repack) for
each of up to 256 byte values, unpack the DFA table once, set all
transitions into the unpacked table, then pack once at the end.

Also adds BenchmarkNfa2Dfa to measure the nfa2Dfa conversion cost
across patterns with varying wildcard counts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the byte-by-byte append loop (8 appends per state, each
with bounds checks) with a pre-sized buffer and a single
binary.LittleEndian.PutUint64 per state. ~20% faster in nfa2Dfa.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace separate lists and dfaStates maps with a single map of
internEntry structs, eliminating the second map lookup on cache
hits. ~9-18% faster in nfa2Dfa.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Hoist the rawStates slice above the 256-iteration byte loop and
reset with [:0] each iteration instead of allocating a new slice.
Eliminates ~95% of nfa2Dfa allocations, ~35-48% faster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cover the same patterns used in BenchmarkNfa2Dfa to verify
correctness of the optimized intern() and n2dNode paths with
larger epsilon closures and heavier dedup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@timbray
Copy link
Owner

timbray commented Feb 28, 2026

Oh well, since my fingers are hovering over the keyboard on memory_cost and it involves changes to nfa2dfa, I better do this first.

@timbray timbray merged commit 3b9069b into timbray:main Feb 28, 2026
7 checks passed
@sayrer
Copy link
Contributor Author

sayrer commented Mar 1, 2026

There's another tougher one lurking. I wondered why the two_stars one looked more expensive than the others. It's because the literals are longer like "abc" or "foo" instead of "a". They transition on each byte, but they don't necessarily need to. Not for this patch, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants