Skip to content

Feedback: Unicode is a mess #17

@tommai78101

Description

@tommai78101

Credits to /u/Ladis_Wascheharuum for providing me constructive feedback.

Sorry, but I don't like this at all. If this is meant to be primer, you need to introduce the concepts in a way that someone new can understand them. Instead, you jump in deep and kinda go all over the place. Plus there are a few errors that make it more confusing.

The Unicode Standard defines the information of a Unicode character, namely the Unicode Transformation Formats (UTF),

What? The "information" of a Unicode character would be its code point, class, decomposition, etc. A UTF is not a property (or "information") of any character, it's an encoding format that applies to code points generally.

To briefly explain what UCS-2 is, this scheme uses a single “code value” containing one or more “code points” assigned to the “code space” between 0 and 65,535 for each character, and allows 2 bytes, or 1 16-bit word, to represent that value. Thus, the “2” in UCS-2 refers to the “2-byte encoding” scheme.

This is headache-inducing to anyone who isn't already familiar with all these terms. It's also technically wrong. UCS-2 code values correspond directly to code points, one-to-one. The "or more" applies to UTF-16, not UCS-2.

Then you have a history lesson about East Asian in the UTF-16 section. If you're explaining UTF-16, you should explain it as a method of expressing more code points in 16-bit code units. Save the history lesson for another section, talking about how the code space was expanded because the original 65K was deemed inadequate.

Okay:

In general, you need to lay this out so that each section introduces a solid concept, then each following section builds on that knowledge. The way I'd do it is:

  • A short history of characters (ASCII, extended ASCII, code pages). Seriously, keep it brief.
  • Unicode invented as a way to encode all characters in all languages. (Define "unicode character" and "abstract character" here). Each character is assigned a code point. Code points are stable. Code points are just numbers.
  • Mention code space; original 16-bit, then expanded.
  • Introduce UTFs as a means of encoding code points in binary data. (Mention UCS-2 was designed for the original historic 16-bit code space) Talk about pros and cons of each, get into technical explanations of each one here.

The big mistake people explaining Unicode make is trying to explain UTFs before explaining code points. Code points are just numbers; an stable index of characters from all written languages. This is the heart of Unicode and the most important thing. UTFs are just ways of storing code points in data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is neededinfoUseful information is provided

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions