24 votes

It’s Not Wrong that "🤦🏼‍♂️".length == 7

Posted October 19, 2023 by skybrian

Tags: strings, unicode, language design, long read, emojis

https://hsivonen.fi/string-length/

Link information

This data is scraped automatically and may be incorrect.

Authors: Henri Sivonen
Word count: 8392 words

7 comments

skybrian (OP)
October 19, 2023
Link
This is a ridiculously thorough analysis of the issues involved in designing a good string API, how various programming languages have gotten it wrong, and why some arguments about how it should...

This is a ridiculously thorough analysis of the issues involved in designing a good string API, how various programming languages have gotten it wrong, and why some arguments about how it should be done are wrong. (Saved for future reference.)

Text is very complicated.

15 votes
winther
October 19, 2023
Link
Oh man speaking of string length. I work with SMS and we have our own fun with determining the length of a single SMS. A single message is 160 characters, but if the total message is longer there...

Oh man speaking of string length. I work with SMS and we have our own fun with determining the length of a single SMS. A single message is 160 characters, but if the total message is longer there are only 153 in a single message because some extra bytes is needed to combine the messages together. To make it even more complicated we use a special 7-bit encoding scheme with a very limited character set called GSM 03.38. Add to that some special characters like € and [ counts as two characters because they use more bits to encode. Always fun to explain that to customers.

12 votes
[2]
arqalite
October 19, 2023
Link
This was immensely informative. I don't work with strings at such a low-level to actually need this information, but I loved learning how complex Unicode really is. Also the fact that the...

This was immensely informative. I don't work with strings at such a low-level to actually need this information, but I loved learning how complex Unicode really is.

Also the fact that the Declaration of Human Rights isn't properly translated in a lot of languages is weird.

7 votes
1. Akir
  October 19, 2023
  Link Parent
  Language is hard, and Unicode is kind of an abomination because it tries to be the one way to represent every language and idea, and that's why we get "dumb" things like emoji that are actually...
  
  Language is hard, and Unicode is kind of an abomination because it tries to be the one way to represent every language and idea, and that's why we get "dumb" things like emoji that are actually made up of multiple characters. But the alternatives are just dumb for different reasons. Representing multiple languages is hard and requires some level of compromise to implement, and the nice thing about Unicode is that it basically enforces the use of the same set of compromises across everything that uses it.
  
  6 votes
[3]
xk3
October 19, 2023 (edited October 19, 2023)
Link
Fantastic in-depth article! Thanks for sharing. Recently I found it funny that len('…'.encode()) == len('...'.encode()). The first one is ellipses, the second are three dots. This is obvious but...

Fantastic in-depth article! Thanks for sharing.

Recently I found it funny that len('…'.encode()) == len('...'.encode()). The first one is ellipses, the second are three dots.

This is obvious but for those that don't know if you encode a string to bytes then it should have the same number across all languages.

ie. Python will say len('🤦🏼‍♂️'.encode()) == 17

5 votes
1. [2]
  skybrian (OP)
  October 19, 2023
  Link Parent
  In which language?
  
  In which language?
  
  1 vote
  1. xk3
    October 19, 2023
    Link Parent
    As far as I know, both those snippets should be the same in all languages. For example in JavaScript: const utf8Encoder = new TextEncoder(); utf8Encoder.encode('…').length ===...
    
    As far as I know, both those snippets should be the same in all languages.
    
    For example in JavaScript:
    
    const utf8Encoder = new TextEncoder(); utf8Encoder.encode('…').length === utf8Encoder.encode('...').length; utf8Encoder.encode('🤦🏼‍♂️').length === 17;
    
    This and the above is counting bytes length not string length
    
    3 votes