Detect adversarial Unicode in pull requests before it reaches your codebase. Language-aware, diff-aware, policy-driven. 19 rules, single-pass scanner, SARIF output.
Catches bidi attacks (Trojan Source), invisible character injection, homoglyph spoofing, confusable identifier collisions (Unicode TR39), variation selector abuse (Glassworm), and more.
Zero config. Language-agnostic. No dependencies.
name: Unicode Safety
on: [pull_request]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: dcondrey/unicode-safety-check@v2That's it. Findings appear as inline PR annotations. Only changed lines get the full check; critical checks (bidi, tag chars) run on all lines.
| Rule | Category | What it catches |
|---|---|---|
| USC001 | Bidi controls | U+202A-202E, U+2066-2069, U+200E/F, U+061C |
| USC002 | Invisible format chars | ZWS, ZWNJ, ZWJ, word joiner, soft hyphen |
| USC007 | Control characters | Any Cc character except tab/LF/CR |
| USC008 | Invalid encoding | Non-UTF-8 byte sequences |
| USC009 | Misplaced BOM | Byte-order mark after start of file |
| USC010 | Variation selectors | 3+ per line (Glassworm-style payload encoding) |
| USC011 | Private Use Area | U+E000-F8FF, supplementary PUA planes |
| USC012 | Tag characters | U+E0001-E007F (Glassworm payload encoding) |
| USC013 | Deprecated format | U+206A-206F, U+FFF0-FFF8 |
| USC014 | Annotation anchors | U+FFF9-FFFB |
| USC015 | Bidi pairing | Unbalanced embedding/override/isolate controls |
| Rule | Category | What it catches |
|---|---|---|
| USC003 | Mixed-script identifier | paylоad mixing Latin + Cyrillic in one identifier |
| USC004 | Confusable collision | scope and scоpe in the same file |
| USC016 | Default-ignorable | Code points that render as nothing |
| USC017 | Homoglyphs | Cyrillic/Greek/Armenian chars that look like Latin |
| Rule | Category | What it catches |
|---|---|---|
| USC005 | Suspicious spacing | NBSP, figure space, thin space, ideographic space |
| USC006 | Normalization drift | Text that changes under NFC/NFKC |
| USC018 | Mixed line endings | CRLF + LF in the same file |
| USC019 | Non-ASCII identifier | Non-ASCII in identifiers when policy is ascii-only |
Rules apply differently based on where the character appears:
| Context | Behavior | Example |
|---|---|---|
| Identifiers | Strictest. Mixed-script, confusable collisions, invisible chars all fail. | paylоad = True |
| Comments | Medium. Most things warn, confusable collisions ignored. | # TODO: fix lаter |
| Strings | Loosest. Only invisible chars warn. | "caf\u00e9" |
20+ languages supported: Python, JavaScript/TypeScript, Go, Rust, Java, C/C++, C#, Ruby, PHP, Shell, YAML, SQL, Swift, Kotlin, Zig, and more.
Implements Unicode TR39 confusable skeleton computation. If two identifiers in the same file collapse to the same skeleton, the check fails with both locations:
HIGH [USC004 confusable-collision] src/auth.py:42:0
confusable with 'scope'
'scоpe' has same skeleton as 'scope' (line 12:0)
near: "scоpe"
Only newly added or modified lines get the full 19-rule check. Critical checks (bidi, tag chars, PUA) always run on all lines. This makes the action usable on existing codebases without noise.
| Risk | File types | Behavior |
|---|---|---|
| High | .py, .js, .go, .rs, .yml, Dockerfile, etc. |
Critical + high findings fail |
| Medium | .md, .html, .css, .xml |
Only critical findings fail |
| Low | .po, locales/** |
Only critical findings fail |
Every finding shows the code point, Unicode name, escaped representation, and surrounding snippet:
CRITICAL [USC001 bidi-control] src/auth.py:118:5
U+202E RIGHT-TO-LEFT OVERRIDE
Bidirectional control character U+202E RIGHT-TO-LEFT OVERRIDE in identifier
near: "isAdmin\u{202E} \u{2066}// check later\u{2069}"
Findings appear in the Security > Code scanning tab:
name: Unicode Safety
on: [pull_request, push]
permissions:
security-events: write
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: dcondrey/unicode-safety-check@v2
id: scan
with:
sarif-file: unicode-safety.sarif
continue-on-error: true
- uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: unicode-safety.sarif- uses: dcondrey/unicode-safety-check@v2
with:
scan-mode: all- uses: dcondrey/unicode-safety-check@v2
with:
policy-file: .unicode-safety.yml- uses: dcondrey/unicode-safety-check@v2
id: scan
- run: |
echo "Findings: ${{ steps.scan.outputs.findings }}"
echo "Critical: ${{ steps.scan.outputs.critical }}"
echo "High: ${{ steps.scan.outputs.high }}"| Input | Default | Description |
|---|---|---|
scan-mode |
changed |
changed scans PR diff only; all scans entire repo |
policy-file |
Path to .unicode-safety.yml policy file |
|
exclude-patterns |
Newline-separated path patterns to exclude | |
fail-on-warn |
false |
Treat medium/low findings as errors |
disable-annotations |
false |
Skip inline PR annotations |
sarif-file |
Path to write SARIF output |
| Output | Description |
|---|---|
findings |
Total finding count |
files-scanned |
Number of files scanned |
critical |
Critical severity count |
high |
High severity count |
sarif-file |
Path to SARIF file (if generated) |
Copy policy.default.yml to .unicode-safety.yml in your repo and customize:
version: 1
encoding: utf-8-only
identifier_policy: ascii-only # or: latin-extended, permitted-scripts
# Allowlist: path + character + reason (all required)
allow:
- paths: ["locales/ar/**/*.json"]
characters: [ZWJ, ZWNJ]
reason: "Arabic text shaping requires ZWJ/ZWNJ"
- paths: ["docs/**/*.md"]
characters: [NBSP]
reason: "Non-breaking spaces in documentation"
# Per-context rules: fail, warn, or ignore
contexts:
identifier:
mixed-script: fail
confusable-collision: fail
invisible-format: fail
comment:
mixed-script: warn
invisible-format: warn
string:
invisible-format: warnStrict defaults. Explicit exceptions. Diff-aware checks. Useful failure messages.
- A ZWJ in an Arabic localization string is not the same as a ZWJ in a variable name.
- Allowlists require a path, character, and justification.
- Critical/high findings fail the check. Medium/low findings warn. Teams can tune via policy.
- Single-pass character scanner for performance. No line is iterated more than once.
This action shuts down the class of attacks that rely on hidden controls, invisibles, and homoglyph collisions. It will not solve malicious but fully visible text, parser disagreements downstream, or font-level confusables in every editor.
python3 src/main.py --all # scan everything
python3 src/main.py src/auth.py # scan specific files
python3 src/main.py --all --policy .unicode-safety.yml # with policy
python3 src/main.py --all --sarif-file results.sarif # SARIF outputactions/checkout@v4withfetch-depth: 0- Python 3.8+ (pre-installed on all GitHub runners)
- No pip dependencies;
pyyamlauto-installed only if using a YAML policy file