Skip to content

Conversation

@andreww
Copy link
Owner

@andreww andreww commented Apr 3, 2018

Skip any Byte Order Marks and keep trying. We may be able to read some files.

andreww and others added 2 commits April 3, 2018 17:06
Try reading two files with Byte Order Marks.
test_sax_fsm_1_utf8_bom.in is UTF8 encoded
(but this is not mentioned in the XML header).
We should be able to read this as long as we don't
trip up with the BOM. test_sax_fsm_1_utf16_bom.in
is UTF16 encoded with a BOM and encoding declared
in the XML. We should not be able to read this
(we should get a non-well-formed error).
For a UTF8-encoded XML file with a Byte Order Mark and
characters that are also ascii characters, we should be
able to read the file. If the first character is not-
recognisable assume we are dealing with a BOM, skip it,
and carry on. We'll then either read the file OK or we
end up with something that is not well-formed (e.g. because
it is a different encoding).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants