Strange time complexity for regexset building #1309
-
| Hi! I'm currently looking into an issue where configuration reloads for a service is taking too long. I've narrowed the issue down to some fairly large regexsets being build. The regexsets are constructed in one of three ways. 
 All of them are built with  I created a quick benchmark for measuring the # items vs construction time. The data is not very realistic and synthetic, but the numbers seem to line up with what we see in production. That case insensitivity would take longer than case sensitivity is not that surprising, but the magnitude of the difference and shape of the curve is. Is this magnitude difference and time complexity expected? Starts with, case sensitive  Starts with, case insensitive  Ends with, case sensitive(that bump is there no matter how many times I'm rerunning -- and always at the same place) Ends with, case insensitive | 
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
| Thank you for the nice repro! So the optimizations in play here are complicated and some patterns get different optimization strategies compared to others. That can make one regex compile longer than another. The baffling thing can be that, if you're not familiar with  Then doing the same for your  A lot of this is pretty inside baseball, but you can probably immediately see a very large difference: the  This comes from this part of the code: regex/regex-automata/src/meta/strategy.rs Lines 85 to 104 in 5ea3eb1 This skipping only applies to regexes anchored at the start. The comment there is a very hand wavy description of why I did that based on my perception of typical usage patterns. If that special skipping logic is removed, then your  The reason why the case insensitive benchmarks exacerbate this is because of the combinatorial explosion of literals generated by emitting (what seems like) all possible literals given the case insensitive toggle. Now that combinatorial explosion sounds like a violation of time complexity, but there's a cheat: extraction of those literals is supposed to be limited to some small number. That is: regex/regex-syntax/src/hir/literal.rs Lines 334 to 388 in 5ea3eb1 And there are some tests for this case: regex/regex-syntax/src/hir/literal.rs Lines 2819 to 2857 in 5ea3eb1 But the logs above clearly show a very large number of literals being extracted. They are definitely being culled (because not literally every possible string is generated), but perhaps not aggressively enough. A separate but related issue here is that the "reverse anchored" strategy is selected. (That means the regex will be executed in reverse on every haystack instead of starting the search from the beginning.) Which means the prefix literals extracted from the regex won't even be used. So building the Aho-Corasick automaton is completely wasted work. My assumption was that the number of literals would always be pretty small and building such things would be rather cheap. But that's not true here. We should probably avoid building a prefilter unless we're certain we'll use it. That will require some refactoring. | 
Beta Was this translation helpful? Give feedback.


Thank you for the nice repro!
So the optimizations in play here are complicated and some patterns get different optimization strategies compared to others. That can make one regex compile longer than another. The baffling thing can be that, if you're not familiar with
regexinternals, it can be difficult to predict when optimization is in play. To debug this yourself in the future, you can enable theloggingfeature ofregex, then use a crate likeenv_loggerwith the log level set totraceto get a bunch of details about regex compilation. For example, adapting yourendswithcase insensitive benchmark to use an input of10, we get this: