28 votes

The Unlikely Story of UTF-8: The Text Encoding of the Web

4 comments

  1. [4]
    beon
    Link
    I've had mad respect for UTF-8 ever since I had to encode it by hand on paper in an exam. Funny how that all came to be. There is also a more detailed recalling by Rob Pike himself here.

    I've had mad respect for UTF-8 ever since I had to encode it by hand on paper in an exam. Funny how that all came to be. There is also a more detailed recalling by Rob Pike himself here.

    7 votes
    1. [2]
      Akir
      Link Parent
      I also love UTF-8. The web could really suck before everyone decided on using UTF-8, especially in Japan where there were two or three options for text encoding and browsers frequently didn't know...

      I also love UTF-8. The web could really suck before everyone decided on using UTF-8, especially in Japan where there were two or three options for text encoding and browsers frequently didn't know what to do with them. I've even seen webpages with mixed encoding before! The amount of dumb bugfixes for badly coded webpages is truly mindboggling, which is part of why I kind of liked that XHTML Strict was a thing.

      The only problem I have with Unicode in general is that we squeezed emojis into it. :P

      5 votes
      1. Moonchild
        Link Parent
        I dunno, han unification? Four different normal forms? :P

        The only problem I have with Unicode in general is that we squeezed emojis into it

        I dunno, han unification? Four different normal forms? :P

        1 vote
    2. Moonchild
      Link Parent
      Utf8 has, unfortunately, a serious mistake. It was considered desirable for all codepoints to have a unique encoding, so overlong encodings are considered illegal. But this is a very easy thing to...

      Utf8 has, unfortunately, a serious mistake. It was considered desirable for all codepoints to have a unique encoding, so overlong encodings are considered illegal. But this is a very easy thing to miss when making a decoder. A minor tweak (add 128 to everything encoded with two bytes, etc.) would have avoided the problem altogether, and had the nice side effect of increasing density somewhat. This retains all of the other desirable features of the encoding.

      The rib is that unicode ended up not enforcing canonicalisation, so in the end, it's a little bit of a moot point. But I'm certain there are some vulnerabilities lurking because two entities at different levels of the stack disagree over what's validly encoded.

      1 vote