Skip to content

Conversation

@rutefig
Copy link
Member

@rutefig rutefig commented Oct 30, 2025

Problem

When users create DecomposedRegexConfig with Pattern parts (non-public patterns), they can accidentally use capturing groups (...) which breaks the circuit's expectation of capture group numbering.

Example:

DecomposedRegexConfig {
    parts: vec![
        RegexPart::Pattern("(a|b)prefix:".to_string()),  // Creates capture group 1
        RegexPart::PublicPattern("(.+?)".to_string()),   // Expected to be group 1, actually group 2
    ]
}

The PublicPattern should be capture group 1 for the circuit, but the capturing group in the Pattern part shifts it to group 2, causing circuit verification failures.

This PR fixes this issue by converting capture groups inside of private patterns into non capture groups while still preserving special groups or even non capture groups if the user intentionally writes them.

Summary by CodeRabbit

  • Refactor

    • Internal control-flow was clarified in graph traversal and epsilon-transition handling for clearer, more maintainable logic.
  • Bug Fixes

    • Regex compilation now treats certain capturing groups as non-capturing when composing generated patterns, preventing unintended captures in produced expressions.
  • Tests

    • Added unit and end-to-end tests covering non-capturing conversions, lookarounds, nested groups, and generated-pattern validation.

@rutefig rutefig self-assigned this Oct 30, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 30, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Refactors control-flow in two IR modules by splitting combined conditional guards into nested if blocks without changing behavior. Introduces a private helper to convert bare capturing groups to non-capturing ones, applies it when composing regex parts, and adds comprehensive unit and end-to-end tests validating regex conversion and capture counts.

Changes

Cohort / File(s) Summary
IR control-flow refactors
compiler/src/ir/graph.rs, compiler/src/ir/intermediate.rs
Split combined conditional guards into separate nested if blocks (preserving original semantics); only formatting/control-flow structure changed.
Regex capturing-group normalization & tests
compiler/src/utils.rs, compiler/src/utils_test.rs, Cargo.toml, compiler/.../gen_from_decomposed*
Added private convert_capturing_to_non_capturing(pattern: &str) -> String; updated decomposed_to_composed_regex() to run conversion on RegexPart::Pattern entries; added unit tests for edge cases (escaped parens, classes, lookarounds, named captures, nesting) and end-to-end tests that assert generated NFA regex strings and capture counts reflect non-capturing conversions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Review the scanner/state logic in convert_capturing_to_non_capturing() for correctness with nested groups, escapes, character classes, and lookarounds.
  • Verify decomposed_to_composed_regex() only converts intended private Pattern parts and preserves capture semantics where required.
  • Run and inspect new unit and end-to-end tests to ensure expected NFA regex strings and capture counts.
  • Confirm IR changes in graph.rs / intermediate.rs are purely structural with no behavioral regressions.

Poem

🐇 I nudged a branch and changed a brace,
Turned greedy captures into a gentler place.
Tests sang softly, NFAs twirled,
A rabbit’s tweak delights the world. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Rutefig/reg 574 fix prevent capture groups inside of private patterns" directly corresponds to the main objective of the changeset. According to the PR objectives, the fix converts capturing groups inside private Pattern parts into non-capturing groups to prevent them from shifting global capture group numbering. The title accurately captures this intent—preventing capture groups in private patterns—which is the primary change across all modified files, particularly the new logic in compiler/src/utils.rs that implements the convert_capturing_to_non_capturing function. While the title includes a branch prefix that adds some formatting noise, the core message is clear, specific, and directly related to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch rutefig/reg-574-fix-prevent-capture-groups-inside-of-private-patterns

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between faeabb4 and dfe8085.

📒 Files selected for processing (1)
  • compiler/src/utils.rs (3 hunks)
🔇 Additional comments (4)
compiler/src/utils.rs (4)

37-111: Excellent implementation—past concerns fully addressed!

The character-by-character scanning with state tracking correctly handles all the edge cases raised in previous reviews:

  • Escaped parentheses: The escaped flag properly preserves \( and \) as literal characters
  • Character classes: The in_char_class flag prevents processing ( inside [...]
  • Named captures: Both PCRE (?<name>...) and Rust (?P<name>...) styles are correctly converted to (?:...)
  • Special groups: Lookaheads, lookbehinds, and other (?...) constructs are preserved

The look-ahead approach (checking chars[i + 1] while incrementing i) correctly skips over named group identifiers without off-by-one errors.

Minor note: The comments on lines 78 and 91 say "Skip '?' and '<'" but the code actually moves i to point at < (not past it), then the while loop skips the name characters. The behavior is correct, just the comment is slightly imprecise—but this doesn't affect functionality.


140-142: Perfect integration of the conversion function.

Applying convert_capturing_to_non_capturing to private Pattern parts before concatenation ensures that only PublicPattern parts create capture groups. This directly solves the issue described in the PR objectives where private pattern captures were shifting public pattern group indices.


202-387: Comprehensive test coverage validates the implementation.

The 24 unit tests thoroughly cover:

  • Basic transformations (bare groups, multiple groups)
  • Preservation of special groups (lookaheads, lookbehinds, non-capturing, comments)
  • Named captures (both PCRE (?<name>...) and Rust (?P<name>...) styles)
  • Critical edge cases: escaped parentheses (line 301), character classes (lines 309, 317), and mixed scenarios (line 325)

The tests directly address the concerns raised in past reviews and provide strong confidence in correctness.


441-463: End-to-end test confirms the fix resolves the stated problem.

This test validates the exact scenario from the PR description:

  • Private Pattern("(a|b)prefix:") → converted to (?:a|b)prefix:
  • Public PublicPattern((".+?", 20)) → becomes capture group 1 (not 2)
  • Resulting NFA regex is (?:a|b)prefix:(.+?) with num_capture_groups = 1

This directly confirms that the fix prevents private pattern captures from shifting public capture group indices, which was causing circuit verification failures.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rutefig
Copy link
Member Author

rutefig commented Oct 30, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Oct 30, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cursor[bot]

This comment was marked as outdated.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09b8a86 and bb530c5.

📒 Files selected for processing (3)
  • compiler/src/ir/graph.rs (1 hunks)
  • compiler/src/ir/intermediate.rs (2 hunks)
  • compiler/src/utils.rs (3 hunks)

…group conversion

The convert_capturing_to_non_capturing function was naively converting
all '(' to '(?:' without considering:
- Escaped parentheses like \( which match literal parens
- Parentheses inside character classes like [()] which are literals

This caused regex semantics to break. For example:
- \(foo\) would become \(?:foo\) (wrong)
- [()] would become [(?:)] (matches extra chars)

Fixed by tracking two states during string scan:
- in_char_class: are we inside [...]?
- escaped: was previous char a backslash?

Only convert ( to (?:) when it's not escaped, not in a char class,
and not already followed by ?.

Added 5 new tests covering escaped parens, character classes, and
mixed cases. All 23 tests pass.

Addresses PR review feedback: #109 (comment)
@rutefig
Copy link
Member Author

rutefig commented Oct 30, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Oct 30, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Extended convert_capturing_to_non_capturing to handle both PCRE-style
(?<name>...) and Rust-style (?P<name>...) named capture groups, not
just bare capturing groups. This prevents named captures in private
patterns from interfering with public capture group numbering.

Key changes:
- Detect and convert both named capture syntaxes to (?:...)
- Distinguish lookbehind assertions (?<=, ?<!) from named captures
- Updated test_preserve_named_groups to test_convert_pcre_named_groups
  with correct expected behavior
- Added 7 new tests covering both named capture styles and edge cases

Fixes code review finding that named captures were being preserved
instead of converted, which would cause capture group numbering
mismatches in generated circuits.
@rutefig
Copy link
Member Author

rutefig commented Oct 30, 2025

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Oct 30, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@rutefig rutefig merged commit 3d37c31 into feat/new-compiler Oct 31, 2025
6 checks passed
@rutefig rutefig deleted the rutefig/reg-574-fix-prevent-capture-groups-inside-of-private-patterns branch October 31, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants