20 votes

UTF-8 Everywhere

13 comments

  1. [13]
    onyxleopard
    (edited )
    Link
    Dealing with different programming languages’ ideas of what a basic unit of text, such as a 'character', is is rather annoying (esp. if you want to do any textual processing involving character...

    Dealing with different programming languages’ ideas of what a basic unit of text, such as a 'character', is is rather annoying (esp. if you want to do any textual processing involving character offsets that treat strings as arrays of characters).

    I learned last year that str.__len__ in Python 3.2 (including Python 2.x) and below gives you a different answer than for Python 3.3+ strings including characters outside of UCS-2. Whereas, in Python 3.3+, PEP 393 is implemented, which gives UCS-4 support. Yes, there is a compilation flag to toggle this (and Python builds compiled with this option on/off are called 'wide' vs. 'narrow' builds), and yes, if you use unicode objects for Python 2.x instead of strs, you’ll be OK.

    Java Strings, for instance, are natively just UTF-16 strings, so '🐨' is 2 code units (thus has length 2).

    Java
    Javascript
    Python 2.7
    Python 3.6

    While Unicode is a great idea, and we’ve come along way, we still have a ways to go before text is a solved domain. It’s painfully awkward when a language allows you to index into the middle of a character. And it’s extremely frustrating if you have some program that relies on character offsets when your input is coming from a program where you may not even know what language or compilation flags the build of the compiler/interpreter for that original program was run with.

    8 votes
    1. [2]
      Comment deleted by author
      Link Parent
      1. onyxleopard
        Link Parent
        This sounds like a pretty progressive approach to me. I have not had time to look into Rust, but this makes me want to!

        This sounds like a pretty progressive approach to me. I have not had time to look into Rust, but this makes me want to!

        1 vote
    2. [8]
      unknown user
      Link Parent
      I think you could probably remove the word "programming" from that sentence, and it would still be perfectly true!

      Dealing with different programming languages’ ideas of what a basic unit of text, such as a 'character', is is rather annoying

      I think you could probably remove the word "programming" from that sentence, and it would still be perfectly true!

      1 vote
      1. [7]
        onyxleopard
        Link Parent
        But, humans engineer programming languages. Natural languages are not engineered, and nobody gets to decide how their writing systems work. The UTC gets to decide how we represent the writing...

        But, humans engineer programming languages. Natural languages are not engineered, and nobody gets to decide how their writing systems work. The UTC gets to decide how we represent the writing systems for computers, and font designers get to decide how they appear on screen, but the systems themselves are just an emergent representation of the symbols that were found useful. When smart people go about engineering programming languages, we should expect better models of text.

        1. [6]
          Pilgrim
          Link Parent
          The Japanese would like a word with you lol EDIT: And there are a couple Esperanto speakers in line behind them :)

          Natural languages are not engineered, and nobody gets to decide how their writing systems work.

          The Japanese would like a word with you lol

          EDIT: And there are a couple Esperanto speakers in line behind them :)

          1. [5]
            onyxleopard
            Link Parent
            Esperanto isn’t a natural language ;P

            Esperanto isn’t a natural language ;P

            1. Pilgrim
              Link Parent
              Ha! You got me there :)

              Ha! You got me there :)

            2. [3]
              mrbig
              Link Parent
              Esperanto IS a natural language. Its development is the product of natural human historical processes. The fact that this historical process was initiated by a single individual does not change...

              Esperanto IS a natural language. Its development is the product of natural human historical processes. The fact that this historical process was initiated by a single individual does not change that. 132 years after that, according to different researchers, Esperanto has between 63 thousand to 2 million estimated speakers all over the world. Either way, 63 thousand people is enough to consider a language alive and "natural".

              Traditional "natural" languages are not entirely "natural" either.

              1. onyxleopard
                Link Parent
                You need to provide your definition of ’natural’. I am using the term ’natural’ to distinguish certain languages from the set of 'constructed' languages (or conlangs). Esperanto is considered a...

                Esperanto IS a natural language.

                You need to provide your definition of ’natural’.

                I am using the term ’natural’ to distinguish certain languages from the set of 'constructed' languages (or conlangs). Esperanto is considered a conlang.

                2 votes
              2. onyxleopard
                Link Parent
                I didn’t mean to come off as combative. I studied Linguistics, so I am predisposed to use the term ’natural language’ as the term of art within that domain. Programming languages are subject to...

                I didn’t mean to come off as combative. I studied Linguistics, so I am predisposed to use the term ’natural language’ as the term of art within that domain.

                Programming languages are subject to natural processes over time as well, so by your definition, Python is a natural language. But, programming languages (like Esperanto) are still constructed languages. Another important distinction is orthographic systems vs. languages. A lot of orthographic systems are 'unnatural' in a sense because they were formalized by Linguists, or sometimes by lay people, who wanted to document languages they came in contact with where the native speakers had no native writing system. Orthographies also undergo natural processes—or unnatural processes. For instance, the set of characters we typically use in English today was heavily influenced by contact with the Normans and then the advent of typesetting books for print, mostly in Germany.

                1 vote
    3. [3]
      Pilgrim
      Link Parent
      I haven't got through the entire article yet, but this speaks to me deeply as someone who does ETL engineering. What language do you find is best for working with UTF-8? Our IDE only allows for...

      I haven't got through the entire article yet, but this speaks to me deeply as someone who does ETL engineering. What language do you find is best for working with UTF-8? Our IDE only allows for JavaScript (ECMA) which seems to do a pretty good job.

      I've been tripped up before by FTP clients changing the EOL characters when transferring between Linux and Windows boxes. I had to find a client that would transfer in binary (WinSCP works well for that, CoreFTP can suck it!).

      1 vote
      1. onyxleopard
        Link Parent
        Oh, I almost forgot, if you are interested in text, I highly recommend this Python package: https://github.com/alvinlindstam/grapheme In [1]: import grapheme IIn [2]: rainbow_flag = '🏳️🌈' In [3]:...

        Oh, I almost forgot, if you are interested in text, I highly recommend this Python package: https://github.com/alvinlindstam/grapheme

        In [1]: import grapheme                                                                                                                                  
        
        IIn [2]: rainbow_flag = '🏳️‍🌈'                                                                                                                           
        
        In [3]: [c for c in rainbow_flag]                                                                                                                        
        Out[3]: ['🏳', '️', '\u200d', '🌈']
        
        In [4]: [c for c in grapheme.graphemes(rainbow_flag)]                                                                                                    
        Out[4]: ['🏳️\u200d🌈']
        
        In [5]: len(rainbow_flag)                                                                                                                                
        Out[5]: 4
        
        In [7]: len(list(grapheme.graphemes(rainbow_flag)))                                                                                                      
        Out[7]: 1
        
        3 votes
      2. onyxleopard
        (edited )
        Link Parent
        I am very partial to Python, but even it is not ideal. I don’t know of any language that has a great model of text or an ideal string primitive. At least in recent Python 3 versions, you can’t...

        I am very partial to Python, but even it is not ideal. I don’t know of any language that has a great model of text or an ideal string primitive. At least in recent Python 3 versions, you can’t index into the middle of characters. I recently learned that open has a newline parameter for text mode, though. Go have a read what the default behavior is and reel in horror at how many subtle bugs there must be in the wild because of it. (Though, to be fair, there may be other cases where the default is saving people a lot of headache as well.)

        1 vote