28 votes

The Unlikely Story of UTF-8: The Text Encoding of the Web

Posted July 28, 2023 by beon

Tags: unicode, ken thompson, rob pike, author.bryan lunduke

https://lunduke.locals.com/post/4188788/the-unlikely-story-of-utf-8-the-text-encoding-of-the-web

Link information

This data is scraped automatically and may be incorrect.

Authors: Bryan Lunduke
Word count: 1461 words

4 comments

[4]
beon (OP)
July 28, 2023
Link
I've had mad respect for UTF-8 ever since I had to encode it by hand on paper in an exam. Funny how that all came to be. There is also a more detailed recalling by Rob Pike himself here.

I've had mad respect for UTF-8 ever since I had to encode it by hand on paper in an exam. Funny how that all came to be. There is also a more detailed recalling by Rob Pike himself here.

7 votes
1. [2]
  Akir
  July 28, 2023
  Link Parent
  I also love UTF-8. The web could really suck before everyone decided on using UTF-8, especially in Japan where there were two or three options for text encoding and browsers frequently didn't know...
  
  I also love UTF-8. The web could really suck before everyone decided on using UTF-8, especially in Japan where there were two or three options for text encoding and browsers frequently didn't know what to do with them. I've even seen webpages with mixed encoding before! The amount of dumb bugfixes for badly coded webpages is truly mindboggling, which is part of why I kind of liked that XHTML Strict was a thing.
  
  The only problem I have with Unicode in general is that we squeezed emojis into it. :P
  
  5 votes
  1. Moonchild
    July 29, 2023
    Link Parent
    I dunno, han unification? Four different normal forms? :P
    
    The only problem I have with Unicode in general is that we squeezed emojis into it
    
    I dunno, han unification? Four different normal forms? :P
    
    1 vote
2. Moonchild
  July 29, 2023
  Link Parent
  Utf8 has, unfortunately, a serious mistake. It was considered desirable for all codepoints to have a unique encoding, so overlong encodings are considered illegal. But this is a very easy thing to...
  
  Utf8 has, unfortunately, a serious mistake. It was considered desirable for all codepoints to have a unique encoding, so overlong encodings are considered illegal. But this is a very easy thing to miss when making a decoder. A minor tweak (add 128 to everything encoded with two bytes, etc.) would have avoided the problem altogether, and had the nice side effect of increasing density somewhat. This retains all of the other desirable features of the encoding.
  
  The rib is that unicode ended up not enforcing canonicalisation, so in the end, it's a little bit of a moot point. But I'm certain there are some vulnerabilities lurking because two entities at different levels of the stack disagree over what's validly encoded.
  
  1 vote