Speed up nfa2dfa by sayrer · Pull Request #500 · timbray/quamina

sayrer · 2026-02-28T21:23:07Z

These are the numbers for BenchmarkNfa2Dfa. This took a little bit of iteration, but it was fun. About 30-40 minutes, but the benchmarks were running at -count=6, so lots of waiting. All of the optimization ideas were good, which is frequently not the case.

I have to go watch the Lakers play the Warriors, so I might not respond to comments until tomorrow or Monday.

  Main vs Optimized                                                                                                                                                            
  ┌─────────────┬──────────────┬───────────────────┬─────────┬─────────────┬────────────┬─────────────────┐                                                                    
  │   Subcase   │ Main (ns/op) │ Optimized (ns/op) │ Speedup │ Main allocs │ Opt allocs │ Alloc reduction │                                                                    
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤                                                                    
  │ single_star │ 922k         │ 63k               │ 14.6x   │ 9,555       │ 115        │ -99%            │                                                                    
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤                                                                    
  │ two_stars   │ 1,966k       │ 151k              │ 13.0x   │ 21,279      │ 220        │ -99%            │                                                                    
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤
  │ three_stars │ 626k         │ 35k               │ 17.8x   │ 6,122       │ 75         │ -99%            │
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤
  │ five_stars  │ 879k         │ 48k               │ 18.3x   │ 8,566       │ 96         │ -99%            │
  ├─────────────┼──────────────┼───────────────────┼─────────┼─────────────┼────────────┼─────────────────┤
  │ eight_stars │ 1,256k       │ 66k               │ 19.0x   │ 12,223      │ 123        │ -99%            │
  └─────────────┴──────────────┴───────────────────┴─────────┴─────────────┴────────────┴─────────────────┘

Replace the dedup map with a generation counter (reusing faState.closureSetGen), and add reusable scratch buffers for the sorted-uniques slice and key bytes. On cache hits the compiler's string(bytes) map-lookup optimization avoids the key allocation entirely. Only cache misses allocate (key string + stored slice). Reduces nfa2Dfa allocations by ~35-40% and wall time by ~12%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of calling addByteStep (unpack, set one byte, repack) for each of up to 256 byte values, unpack the DFA table once, set all transitions into the unpacked table, then pack once at the end. Also adds BenchmarkNfa2Dfa to measure the nfa2Dfa conversion cost across patterns with varying wildcard counts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the byte-by-byte append loop (8 appends per state, each with bounds checks) with a pre-sized buffer and a single binary.LittleEndian.PutUint64 per state. ~20% faster in nfa2Dfa. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace separate lists and dfaStates maps with a single map of internEntry structs, eliminating the second map lookup on cache hits. ~9-18% faster in nfa2Dfa. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Hoist the rawStates slice above the 256-iteration byte loop and reset with [:0] each iteration instead of allocating a new slice. Eliminates ~95% of nfa2Dfa allocations, ~35-48% faster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cover the same patterns used in BenchmarkNfa2Dfa to verify correctness of the optimized intern() and n2dNode paths with larger epsilon closures and heavier dedup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

timbray · 2026-02-28T22:25:35Z

Oh well, since my fingers are hovering over the keyboard on memory_cost and it involves changes to nfa2dfa, I better do this first.

sayrer · 2026-03-01T01:35:48Z

There's another tougher one lurking. I wondered why the two_stars one looked more expensive than the others. It's because the literals are longer like "abc" or "foo" instead of "a". They transition on each byte, but they don't necessarily need to. Not for this patch, though.

sayrer and others added 7 commits February 28, 2026 12:40

Use PutUint64 for intern() key encoding

288402d

Replace the byte-by-byte append loop (8 appends per state, each with bounds checks) with a pre-sized buffer and a single binary.LittleEndian.PutUint64 per state. ~20% faster in nfa2Dfa. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge intern() maps into single entries map

b04a2f5

Replace separate lists and dfaStates maps with a single map of internEntry structs, eliminating the second map lookup on cache hits. ~9-18% faster in nfa2Dfa. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add multi-star test cases to TestNfa2Dfa

137fe99

Cover the same patterns used in BenchmarkNfa2Dfa to verify correctness of the optimized intern() and n2dNode paths with larger epsilon closures and heavier dedup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add inline comments explaining each optimization

c2f9c61

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

timbray merged commit 3b9069b into timbray:main Feb 28, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up nfa2dfa#500

Speed up nfa2dfa#500
timbray merged 7 commits intotimbray:mainfrom
sayrer:intern_state_lists

sayrer commented Feb 28, 2026

Uh oh!

timbray commented Feb 28, 2026

Uh oh!

Uh oh!

sayrer commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sayrer commented Feb 28, 2026

Uh oh!

timbray commented Feb 28, 2026

Uh oh!

Uh oh!

sayrer commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants