Skip to content
This repository was archived by the owner on Apr 26, 2020. It is now read-only.

Remove controls characters from input HTML#14

Open
jagermesh wants to merge 1 commit intoTobiaszCudnik:masterfrom
jagermesh:master
Open

Remove controls characters from input HTML#14
jagermesh wants to merge 1 commit intoTobiaszCudnik:masterfrom
jagermesh:master

Conversation

@jagermesh
Copy link

There is a problem with control characters for server with libxml 2.6.7 (most of current Linux) servers. In some cases HTML become incorrect (extra closing/opening body/html tags added):

Input:

string(128) "<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head></body><b>BEL</b><b>normal</b></body></html>"

Output:

string(250) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"></head>
<body><b></b></body>
<html><b>normal</b></html>
</html>
"

There is a problem with control characters for server with libxml 2.6.7
(most of current Linux) servers. In some cases HTML become incorrect
(extra closing/opening body/html tags added):

Input:

string(128) "<html><head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"
></head></body><b>BEL</b><b>normal</b></body></html>"

Output:

string(250) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type"
content="text/html;charset=UTF-8"></head>
<body><b></b></body>
<html><b>normal</b></html>
</html>
"
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments