Skip to content

feat: Add compatibility with chardet 6.0.0+ and fix encoding issues#1141

Merged
nathan-stender merged 4 commits intomainfrom
fix/chardet-6-compatibility
Mar 12, 2026
Merged

feat: Add compatibility with chardet 6.0.0+ and fix encoding issues#1141
nathan-stender merged 4 commits intomainfrom
fix/chardet-6-compatibility

Conversation

@nathan-stender
Copy link
Collaborator

Summary

  • Updates allotropy to be compatible with chardet 6.0.0+ (including chardet 7.x)
  • Fixes encoding detection issues that were causing test failures
  • Corrects mojibake in test data files

Changes

  • Updated chardet requirement from < 6.0.0 to >= 6.0.0 in pyproject.toml
  • Improved encoding detection logic in src/allotropy/parsers/utils/encoding.py:
    • Added fallback to Windows-1252 for very low confidence detections (< 0.3)
    • Better handling of single-byte special characters (en dash \x96, registered trademark \xae, micro symbol \xb5)
    • Added BOM (Byte Order Mark) stripping for UTF-16 and UTF-8 files
  • Fixed test data that contained mojibake from incorrect encoding detection:
    • Corrected "®" to "®" in Luminex test JSON files
    • Fixed "�" (replacement character) to proper "µl" in test expectations

Background

Chardet 6.0.0 was released on February 22, 2026, followed by 7.0.x releases in March 2026. These versions introduced breaking changes in how they detect character encodings, particularly for:

  • Short byte sequences with Windows-1252 characters
  • Files with UTF-16 LE encoding (used by SoftMax Pro)
  • Single-byte special characters like en dash, registered trademark, and micro symbol

Testing

  • All 815 tests pass with chardet 7.0.1
  • Specifically tested:
    • Luminex Intelliflex parser (UTF-8 with special characters)
    • Luminex xPONENT parser (Windows-1252 µ character)
    • MolDev SoftMax Pro parser (UTF-16 LE with BOM)
    • Encoding utility tests for Windows-1252 detection

🤖 Generated with Claude Code

nathan-stender and others added 2 commits March 11, 2026 17:26
- Update pyproject.toml to require chardet >= 6.0.0
- Improve encoding detection logic to handle chardet 7.x behavior changes:
  - Add fallback to windows-1252 for very low confidence detections (<0.3)
  - Better handling of single-byte special characters (en dash, ®, µ)
- Add BOM (Byte Order Mark) stripping for UTF-16 and UTF-8 files
- Fix test data that contained mojibake from incorrect encoding detection
  - Corrected "®" to "®" in expected JSON files
  - Fixed "�" (replacement character) to proper "µ" symbol

These changes ensure proper handling of various file encodings including
UTF-16 LE (used by SoftMax Pro), Windows-1252, and UTF-8 with BOM.

Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
@nathan-stender nathan-stender requested review from a team and slopez-b as code owners March 11, 2026 21:37
nathan-stender and others added 2 commits March 11, 2026 17:50
- Changed chardet requirement from >= 6.0.0 to >= 5.2.0 to allow consumers flexibility
- Enhanced encoding detection to handle differences between chardet versions:
  - Always try UTF-8 first when Latin-1 family encodings are detected
  - This prevents mojibake when chardet 5.x misdetects UTF-8 as ISO-8859-1
- Tests now pass with both chardet 5.2.0 and 7.0.1

This allows consumers to use any chardet version >= 5.2.0 without being forced to upgrade.

Co-Authored-By: Claude Opus 4.1 <noreply@anthropic.com>
@nathan-stender nathan-stender merged commit 37f3d19 into main Mar 12, 2026
7 checks passed
@nathan-stender nathan-stender deleted the fix/chardet-6-compatibility branch March 12, 2026 16:42
nathan-stender added a commit that referenced this pull request Mar 17, 2026
### Added

- Cytiva Biacore Insight - Add support for Affinity and Concentration
analysis files (#1137)
- Add compatibility with chardet 6.0.0+ and fix encoding issues (#1141)

### Fixed

- Fix Perkin Elmer Envision parser to recognize A450 labels as
absorbance (#1152)
- Optimize test encoding detection for 4x speedup (#1143)
- Fix GitHub Actions hatch/virtualenv compatibility (#1140)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants