Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices.

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices.

The Unicode specification says:

   >An increasing number of implementations are adopting the handling of
   >ill-formed subsequences as specified in the W3C standard for encoding
   >to achieve consistent U+FFFD replacements.

See:

* [Unicode 14.0](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf) -- Unicode 14.0 Sp\
ecification -- Conformance page 126, section 3.9.
* [w3.org Encoding](http://www.w3.org/TR/encoding/) -- w3.org encoding

For example, the hex byte sequence:

<e0 80 7f>

gets encoded as:

<ef bf bd 7f>

instead of:

<ef bf bd ef bf bd 7f>

Here are a few more examples:

Perl decode: e0 80 80
expected: ef bf bd ef bf bd ef bf bd
     got: ef bf bd

Perl decode: f0 80 80 80
expected: ef bf bd ef bf bd ef bf bd ef bf bd
     got: ef bf bd

Perl decode: ed ae 80 ed b0 80
expected: ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
     got: ef bf bd ef bf bd

See https://github.com/flenniken/utf8tests for more information.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions