Skip to content

\x85 is an important part of international characters #43

@dreamyguy

Description

@dreamyguy

\x85 is, unfortunately, a hexadecimal escape sequence that refer to a code point shared by many international characters. It's all in the encoding but since I'm digging into some legacy code I could not avoid getting ISO-8859-1 strings from ending up being UTF-8-ised.

The example below illustrates that:

// UTF-8-ized Latin-1/ISO-8859-1 strings
var str0 = 'Nguy�n Thái Ng�c Duy',
    str1 = 'Adam Pi�tyszek',
    str2 = '��',
    str3 = '�彦',
    str4 = '����',
    str5 = '���',
    str6 = 'QQ�� ��K� QQ空� QQ',
    str7 = '�亨財���';

// decode UTF-8-ized Latin-1/ISO-8859-1 to UTF-8
var decode = function(str) {
  var s;
  try {
    // if the string is UTF-8, this will work and not throw an error.
    s = decodeURIComponent(escape(str));
  } catch(e) {
    // if it isn't, an error will be thrown, and we can asume that we have an ISO string.
    s = str;
  }
  return s;
};

console.log('str0: ' + decode(str0)); // str0: Nguyễn Thái Ngọc Duy
console.log('str1: ' + decode(str1)); // str1: Adam Piątyszek
console.log('str2: ' + decode(str2)); // str2: 즈눅
console.log('str3: ' + decode(str3)); // str3: 元彦
console.log('str4: ' + decode(str4)); // str4: 入门教程
console.log('str5: ' + decode(str5)); // str5: 陈光远
console.log('str6: ' + decode(str6)); // str6: QQ音乐 全民K歌 QQ空间 QQ
console.log('str7: ' + decode(str7)); // str7: 鉅亨財經新聞

PS: It seems that the (\x85) character is omitted while I'm entering text on Github's editor... so I don't know if the code above will run correctly.

This refers to the change introduced by this line. I'm sticking to v4.2.2 for now, great stuff! 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions