Skip to content

Detect invalid UTF-8 data at end of file when using PerlIO :encoding(utf-8) #59

@hakonhagland

Description

@hakonhagland

PerlIO layer :encoding(utf-8) seems to fail to report malformed data at the end of a file.
Suppose a file $fn contains valid UTF-8, except for the final character in the file. The last character in the file has an invalid UTF-8 encoding. I would like to have a warning printed to STDERR about invalid UTF-8 when reading this file, but strangely it seems not possible to achieve.
For example:

use feature qw(say);
use strict;
use warnings;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $bytes = "\x{61}\x{E5}";  # 2 bytes in iso 8859-1: aå
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;

now $fn contains invalid UTF-8 (the last byte). If I now try to read the file using PerlIO layer :encoding(utf-8):

my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'";

the output is

Read string: 'a'

Note, that there is no warning "\xE5" does not map to Unicode in this case.

However, if I read the file as bytes and then use Encode::decode() on the raw data, the warnings is printed:

open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
$raw_data = do { local $/; <$fh> };
close $fh;
my $str2 = decode( 'utf-8', $raw_data, Encode::FB_WARN | Encode::LEAVE_SRC );
# warning is printed to STDERR

Why cannot the same thing be achieved with PerlIO::encoding? Is it a bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions