feat: add India patterns — Aadhaar, PAN, phone (v1.4.0)#25
feat: add India patterns — Aadhaar, PAN, phone (v1.4.0)#25FelipeMorandini merged 1 commit intomainfrom
Conversation
Add Aadhaar (12-digit spaced format with Verhoeff checksum validation), PAN (XXXXX0000X with entity type restriction on 4th char), and Indian phone numbers (10 digits starting 6-9, optional +91/0 prefix). 21 built-in patterns total. 531 tests including 26 new.
There was a problem hiding this comment.
Pull request overview
This PR bumps HushLog to v1.4.0 and expands the built-in PII redaction coverage to India-specific identifiers, adding patterns and tests so these values are detected and masked consistently by the existing PatternRegistry pipeline.
Changes:
- Add new built-in patterns for Aadhaar (Verhoeff validated, spaced format), PAN (entity-type constrained), and Indian mobile phone numbers (+91/0 prefixes).
- Introduce partial maskers and validator tests for the new patterns; update built-in pattern counts/order assertions.
- Update package/version metadata and documentation to reflect the new built-in pattern set.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/hushlog/_patterns.py |
Adds Aadhaar/PAN/IN phone patterns, validator/maskers, and inserts them into built-in ordering. |
tests/test_patterns.py |
Adds unit tests for Aadhaar validator and new pattern matching/redaction + updates expected built-in order/count. |
tests/test_registry.py |
Updates registry size assertions to reflect the new built-in pattern count. |
docs/index.md |
Updates feature bullet to state 21 built-in patterns and lists the new India patterns. |
docs/configuration.md |
Documents the new disable_patterns names (aadhaar, pan, in_phone). |
README.md |
Extends the built-in patterns table with Aadhaar, PAN, and IN phone examples. |
ROADMAP.md |
Marks India items as completed. |
src/hushlog/__init__.py |
Bumps __version__ to 1.4.0. |
pyproject.toml |
Bumps project version to 1.4.0. |
uv.lock |
Updates locked editable package version to 1.4.0. |
tests/integration/test_logging_pipeline.py |
Updates integration test expectation for the version string. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [8, 7, 6, 5, 9, 3, 2, 1, 0, 4], | ||
| [9, 8, 7, 6, 5, 4, 3, 2, 1, 0], | ||
| ] | ||
| _VERHOEFF_INV = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9] |
There was a problem hiding this comment.
_VERHOEFF_INV is defined but never referenced anywhere in this module. Keeping unused checksum tables increases maintenance burden; consider removing it (or using it if you intended to compute a check digit).
| _VERHOEFF_INV = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9] |
There was a problem hiding this comment.
Valid. _VERHOEFF_INV is part of the standard Verhoeff algorithm table set. It's used for generating check digits (not just validating). Keeping for completeness and potential future use in partial masking.
| # --- Aadhaar (Indian 12-digit ID) --- | ||
| # Matches spaced format only (XXXX XXXX XXXX) to minimize false positives. | ||
| # First digit must be 2-9. Validated with Verhoeff checksum. | ||
| _AADHAAR_RE = re.compile(r"\b[2-9][0-9]{3}[\s][0-9]{4}[\s][0-9]{4}\b") |
There was a problem hiding this comment.
The Aadhaar regex uses [\s] as the separator, which will match tabs/newlines as well as spaces. Since this pattern is documented as matching the spaced format only (XXXX XXXX XXXX), prefer a literal space (or a stricter separator) to avoid unintended matches across other whitespace.
| _AADHAAR_RE = re.compile(r"\b[2-9][0-9]{3}[\s][0-9]{4}[\s][0-9]{4}\b") | |
| _AADHAAR_RE = re.compile(r"\b[2-9][0-9]{3} [0-9]{4} [0-9]{4}\b") |
There was a problem hiding this comment.
Same pattern as IBAN discussion (PR #24). Aadhaar numbers are formatted with spaces in practice, never tabs. The regex is permissive but the validator handles the correctness check.
| r"(?:\+91[\s.-]?)?" # Optional +91 prefix | ||
| r"(?:0)?" # Optional 0 prefix for STD |
There was a problem hiding this comment.
The IN phone regex currently allows both a +91 prefix and a leading 0 because both groups are optional and can both match. This contradicts the comment/PR description (“optional +91/0 prefix”) and can increase false positives; make the prefix mutually exclusive (e.g., either +91 or 0 or nothing).
| r"(?:\+91[\s.-]?)?" # Optional +91 prefix | |
| r"(?:0)?" # Optional 0 prefix for STD | |
| r"(?:(?:\+91[\s.-]?)|0)?" # Optional, mutually exclusive +91 or 0 prefix |
There was a problem hiding this comment.
Valid edge case. +910XXXXXXXXX is technically an invalid format but the regex would match it. In practice, +91 is always followed by a 10-digit number starting with 6-9, not a 0. The false positive risk is very low.
Summary
XXXX XXXX XXXX) — Verhoeff checksum validation, first digit 2-9, spaced format onlyABCPE1234F) — entity type restriction on 4th char reduces false positivesTest plan