Skip to content

feat: add India patterns — Aadhaar, PAN, phone (v1.4.0)#25

Merged
FelipeMorandini merged 1 commit intomainfrom
feat/india-patterns
Mar 20, 2026
Merged

feat: add India patterns — Aadhaar, PAN, phone (v1.4.0)#25
FelipeMorandini merged 1 commit intomainfrom
feat/india-patterns

Conversation

@FelipeMorandini
Copy link
Owner

Summary

  • Aadhaar (XXXX XXXX XXXX) — Verhoeff checksum validation, first digit 2-9, spaced format only
  • PAN (ABCPE1234F) — entity type restriction on 4th char reduces false positives
  • Indian Phone — 10 digits starting 6-9, optional +91/0 prefix
  • Partial masking for all three
  • 21 built-in patterns, 531 tests (26 new)

Test plan

  • 531/531 tests pass
  • All checks clean
  • CI passes
  • Copilot review

Add Aadhaar (12-digit spaced format with Verhoeff checksum validation),
PAN (XXXXX0000X with entity type restriction on 4th char), and Indian
phone numbers (10 digits starting 6-9, optional +91/0 prefix).

21 built-in patterns total. 531 tests including 26 new.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bumps HushLog to v1.4.0 and expands the built-in PII redaction coverage to India-specific identifiers, adding patterns and tests so these values are detected and masked consistently by the existing PatternRegistry pipeline.

Changes:

  • Add new built-in patterns for Aadhaar (Verhoeff validated, spaced format), PAN (entity-type constrained), and Indian mobile phone numbers (+91/0 prefixes).
  • Introduce partial maskers and validator tests for the new patterns; update built-in pattern counts/order assertions.
  • Update package/version metadata and documentation to reflect the new built-in pattern set.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/hushlog/_patterns.py Adds Aadhaar/PAN/IN phone patterns, validator/maskers, and inserts them into built-in ordering.
tests/test_patterns.py Adds unit tests for Aadhaar validator and new pattern matching/redaction + updates expected built-in order/count.
tests/test_registry.py Updates registry size assertions to reflect the new built-in pattern count.
docs/index.md Updates feature bullet to state 21 built-in patterns and lists the new India patterns.
docs/configuration.md Documents the new disable_patterns names (aadhaar, pan, in_phone).
README.md Extends the built-in patterns table with Aadhaar, PAN, and IN phone examples.
ROADMAP.md Marks India items as completed.
src/hushlog/__init__.py Bumps __version__ to 1.4.0.
pyproject.toml Bumps project version to 1.4.0.
uv.lock Updates locked editable package version to 1.4.0.
tests/integration/test_logging_pipeline.py Updates integration test expectation for the version string.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[8, 7, 6, 5, 9, 3, 2, 1, 0, 4],
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
]
_VERHOEFF_INV = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9]
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_VERHOEFF_INV is defined but never referenced anywhere in this module. Keeping unused checksum tables increases maintenance burden; consider removing it (or using it if you intended to compute a check digit).

Suggested change
_VERHOEFF_INV = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9]

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid. _VERHOEFF_INV is part of the standard Verhoeff algorithm table set. It's used for generating check digits (not just validating). Keeping for completeness and potential future use in partial masking.

# --- Aadhaar (Indian 12-digit ID) ---
# Matches spaced format only (XXXX XXXX XXXX) to minimize false positives.
# First digit must be 2-9. Validated with Verhoeff checksum.
_AADHAAR_RE = re.compile(r"\b[2-9][0-9]{3}[\s][0-9]{4}[\s][0-9]{4}\b")
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Aadhaar regex uses [\s] as the separator, which will match tabs/newlines as well as spaces. Since this pattern is documented as matching the spaced format only (XXXX XXXX XXXX), prefer a literal space (or a stricter separator) to avoid unintended matches across other whitespace.

Suggested change
_AADHAAR_RE = re.compile(r"\b[2-9][0-9]{3}[\s][0-9]{4}[\s][0-9]{4}\b")
_AADHAAR_RE = re.compile(r"\b[2-9][0-9]{3} [0-9]{4} [0-9]{4}\b")

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern as IBAN discussion (PR #24). Aadhaar numbers are formatted with spaces in practice, never tabs. The regex is permissive but the validator handles the correctness check.

Comment on lines +691 to +692
r"(?:\+91[\s.-]?)?" # Optional +91 prefix
r"(?:0)?" # Optional 0 prefix for STD
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IN phone regex currently allows both a +91 prefix and a leading 0 because both groups are optional and can both match. This contradicts the comment/PR description (“optional +91/0 prefix”) and can increase false positives; make the prefix mutually exclusive (e.g., either +91 or 0 or nothing).

Suggested change
r"(?:\+91[\s.-]?)?" # Optional +91 prefix
r"(?:0)?" # Optional 0 prefix for STD
r"(?:(?:\+91[\s.-]?)|0)?" # Optional, mutually exclusive +91 or 0 prefix

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid edge case. +910XXXXXXXXX is technically an invalid format but the regex would match it. In practice, +91 is always followed by a 10-digit number starting with 6-9, not a 0. The false positive risk is very low.

@FelipeMorandini FelipeMorandini merged commit 028e4cb into main Mar 20, 2026
10 checks passed
@FelipeMorandini FelipeMorandini deleted the feat/india-patterns branch March 20, 2026 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants