23 votes

The regex [,-.]

Tags: regex

9 comments

  1. [6]
    balooga
    Link
    Interesting, I've only seen ranges like [a-z], [A-Z], and [0-9]. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were...

    Interesting, I've only seen ranges like [a-z], [A-Z], and [0-9]. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were just hard-coded into regex or something. Turns out it actually uses consecutive ASCII values! Seems like all manner of weird tricks might be possible with that knowledge... Thanks for sharing!

    6 votes
    1. [5]
      onyxleopard
      (edited )
      Link Parent
      It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for...

      It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for Unicode-aware regex implementations).

      E.g., for Python 3's re:

      In [1]: import re
      
      In [2]: pattern = re.compile('[πŸ€-🐬]')
      
      In [3]: pattern.findall('πŸ™πŸ―')
      Out[3]: ['πŸ™']
      
      7 votes
      1. archevel
        Link Parent
        My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree...

        My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree of all species that there are represented in the standard. πŸ˜‰

        4 votes
      2. [3]
        Moonchild
        Link Parent
        Unicode is a superset of the ASCII character set...

        don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching

        Unicode is a superset of the ASCII character set...

        1 vote
        1. [2]
          onyxleopard
          Link Parent
          Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, you’re making an inclusive selection within an ordinal range. The fact...

          Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, you’re making an inclusive selection within an ordinal range. The fact that ASCII comprises the bottom 128 ordinals in Unicode (for backward compatibility reasons) is incidental. Ranges of ordinals work because there is a total ordering over β„• (aka the set of ordinal numbers).

          3 votes
          1. Moonchild
            Link Parent
            The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: β€˜does the range ,-. contain the...

            The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: β€˜does the range ,-. contain the characters β€˜,’, β€˜-’, and β€˜.’ because those characters are contiguous in the unicode character set or because they are contiguous in the ascii character set?’

            A fairly trivial question if you ask me, but I maintain that β€˜ascii’ is the better response. Because:

            1. It is a more general response

            2. Following 1, the identity would still hold in a regex implementation which did not support unicode, but rather ascii, or another unspecified superset of ascii with another unspecified encoding

            3. The characters have the order that they do in unicode because they have that order in ascii

            1 vote
  2. [3]
    Pistos
    Link
    The main reason you should not do this (or not do it on purpose, at least) is to avoid wasting your teammates' time (going down the rabbit hole).

    The main reason you should not do this (or not do it on purpose, at least) is to avoid wasting your teammates' time (going down the rabbit hole).

    4 votes
    1. [2]
      onyxleopard
      Link Parent
      My takeaway is a bit different. This is OK as long as one or both of the following hold: You add comments with sufficient explanation of your intention. You have sufficient test coverage (though,...

      My takeaway is a bit different. This is OK as long as one or both of the following hold:

      1. You add comments with sufficient explanation of your intention.
      2. You have sufficient test coverage (though, test cases for regexes is a really gnarly rabbit hole in and of itself).

      (Ideally you do both 1 and 2.)

      3 votes
      1. Pistos
        (edited )
        Link Parent
        Right, though I think my suggestion stands because the comment would be something like one of these: # Isn't it cool that [explanation about ASCII tables, etc.] // See if you can figure this out...

        Right, though I think my suggestion stands because the comment would be something like one of these:

        • # Isn't it cool that [explanation about ASCII tables, etc.]
        • // See if you can figure this out ;)
        • /* This was Mike in 2018 trying to be clever */

        which differs from the situation of having something difficult to grok, or potentially misconstruable, but which provides tangible value (e.g. improved performance), where the comment elucidates the matter and provides justification for the chosen implementation.

        8 votes