Skip to content
This repository was archived by the owner on Jan 9, 2025. It is now read-only.
This repository was archived by the owner on Jan 9, 2025. It is now read-only.

Handle non-ascii characters in url #193

@kolesar-andras

Description

@kolesar-andras

Zombie driver fails when url contains "high bytes", non-ascii characters. The following example contains a valid Hungarian with accented characters.

https://hu.wikipedia.org/wiki/Műemlék

Desktop browsers and Mink Goutte driver translate the high bytes correctly:

https://hu.wikipedia.org/wiki/M%C5%B1eml%C3%A9k

Zombie driver sends string as-is to javascript, then bytes above 0x7f go wrong somewhere in Zombie:

https://hu.wikipedia.org/wiki/Mqeml\xe9k

It's a bit strange how characters are truncated:

  • letter é becomes \xe9 that is character code in ISO-8859-1
  • letter ű becomes q because this character does not exists in that code page

Characters that don't exist in ISO-8859-1 encoding are represented with regular letters, for example q, damage is irreversible.

Example shows that desktop browsers translate non-asci characters to percent-encoded bytes using their UTF-8 character codes:

  • letter é becomes %C3%A9
  • letter ű becomes %C5%B1

That's correct, web servers expect urls in this way.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions