Skip to content

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices. #166

@flenniken

Description

@flenniken

Perl 5.30.2 replaces some invalid UTF-8 byte sequences inconsistent with current best practices.

The Unicode specification says:

An increasing number of implementations are adopting the handling of
ill-formed subsequences as specified in the W3C standard for encoding
to achieve consistent U+FFFD replacements.

See:

For example, the hex byte sequence:

<e0 80 7f>

gets encoded as:

<ef bf bd 7f>

instead of:

<ef bf bd ef bf bd 7f>

Here are a few more examples:

Perl decode: e0 80 80
expected: ef bf bd ef bf bd ef bf bd
got: ef bf bd

Perl decode: f0 80 80 80
expected: ef bf bd ef bf bd ef bf bd ef bf bd
got: ef bf bd

Perl decode: ed ae 80 ed b0 80
expected: ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
got: ef bf bd ef bf bd

See https://github.com/flenniken/utf8tests for more information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions