24 votes

It’s Not Wrong that "🤦🏼‍♂️".length == 7

7 comments

  1. skybrian
    Link
    This is a ridiculously thorough analysis of the issues involved in designing a good string API, how various programming languages have gotten it wrong, and why some arguments about how it should...

    This is a ridiculously thorough analysis of the issues involved in designing a good string API, how various programming languages have gotten it wrong, and why some arguments about how it should be done are wrong. (Saved for future reference.)

    Text is very complicated.

    15 votes
  2. winther
    Link
    Oh man speaking of string length. I work with SMS and we have our own fun with determining the length of a single SMS. A single message is 160 characters, but if the total message is longer there...

    Oh man speaking of string length. I work with SMS and we have our own fun with determining the length of a single SMS. A single message is 160 characters, but if the total message is longer there are only 153 in a single message because some extra bytes is needed to combine the messages together. To make it even more complicated we use a special 7-bit encoding scheme with a very limited character set called GSM 03.38. Add to that some special characters like € and [ counts as two characters because they use more bits to encode. Always fun to explain that to customers.

    12 votes
  3. [2]
    arqalite
    Link
    This was immensely informative. I don't work with strings at such a low-level to actually need this information, but I loved learning how complex Unicode really is. Also the fact that the...

    This was immensely informative. I don't work with strings at such a low-level to actually need this information, but I loved learning how complex Unicode really is.

    Also the fact that the Declaration of Human Rights isn't properly translated in a lot of languages is weird.

    7 votes
    1. Akir
      Link Parent
      Language is hard, and Unicode is kind of an abomination because it tries to be the one way to represent every language and idea, and that's why we get "dumb" things like emoji that are actually...

      Language is hard, and Unicode is kind of an abomination because it tries to be the one way to represent every language and idea, and that's why we get "dumb" things like emoji that are actually made up of multiple characters. But the alternatives are just dumb for different reasons. Representing multiple languages is hard and requires some level of compromise to implement, and the nice thing about Unicode is that it basically enforces the use of the same set of compromises across everything that uses it.

      6 votes
  4. [3]
    xk3
    (edited )
    Link
    Fantastic in-depth article! Thanks for sharing. Recently I found it funny that len('…'.encode()) == len('...'.encode()). The first one is ellipses, the second are three dots. This is obvious but...

    Fantastic in-depth article! Thanks for sharing.

    Recently I found it funny that len('…'.encode()) == len('...'.encode()). The first one is ellipses, the second are three dots.

    This is obvious but for those that don't know if you encode a string to bytes then it should have the same number across all languages.

    ie. Python will say len('🤦🏼‍♂️'.encode()) == 17

    5 votes
    1. [2]
      skybrian
      Link Parent
      In which language?

      In which language?

      1 vote
      1. xk3
        Link Parent
        As far as I know, both those snippets should be the same in all languages. For example in JavaScript: const utf8Encoder = new TextEncoder(); utf8Encoder.encode('…').length ===...

        As far as I know, both those snippets should be the same in all languages.

        For example in JavaScript:

        const utf8Encoder = new TextEncoder(); 
        utf8Encoder.encode('…').length === utf8Encoder.encode('...').length;
        utf8Encoder.encode('🤦🏼‍♂️').length === 17;
        

        This and the above is counting bytes length not string length

        3 votes