Skip to content

Handle CDATA with UTF-8 characters when partial parsing#133

Open
tomtaylor wants to merge 1 commit intoqcam:masterfrom
tomtaylor:cdata-fix
Open

Handle CDATA with UTF-8 characters when partial parsing#133
tomtaylor wants to merge 1 commit intoqcam:masterfrom
tomtaylor:cdata-fix

Conversation

@tomtaylor
Copy link
Copy Markdown

@tomtaylor tomtaylor commented Aug 5, 2024

Follows on from #122.

In partial mode, UTF-8 encoded characters might be split across multiple chunks. When this happens for a character such as £, which is encoded as <<0xC2, 0xA3>>, the 0xC2 is neither an ASCII character (<= 127), nor does it match the <<codepoint::utf-8>> clause, and Saxy throws a parser error.

This fixes that by just parsing all the bytes inside a CDATA element regardless of their code point. It drops the UTF-8 character optimisation, but I suspect that's probably a minor performance improvement for most documents.

@qcam is this a more prevalent issue than my use case? I can see why matching on UTF-8 codepoint and swallowing the whole character is a nice optimisation, but I wonder if it might cause issues in other places when partial parsing.

Don't assume that we're always seeing a full UTF-8 character. In partial mode,
UTF-8 encoded characters might be split across multiple chunks.
@tomtaylor tomtaylor changed the title Handle CDATA containing partial UTF-8 characters Handle CDATA with UTF-8 characters when partial parsing Aug 5, 2024
@tomtaylor
Copy link
Copy Markdown
Author

@qcam any thoughts on this?

element_cdata(rest, more?, original, pos, state, len + 1)

<<codepoint::utf8>> <> rest ->
element_cdata(rest, more?, original, pos, state, len + Utils.compute_char_len(codepoint))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can the same way how dangling UTF-8 fragments is handled

For example https://github.com/qcam/saxy/blob/master/lib/saxy/parser/builder.ex#L540-L541

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow I'm afraid. Would you prefer to take this PR over if it's a quick fix at your end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants