23 votes

The regex [,-.]

Posted May 11, 2022 by onyxleopard

Tags: regex

https://pboyd.io/posts/comma-dash-dot/

Link information

This data is scraped automatically and may be incorrect.

Word count: 226 words

9 comments

[6]
balooga
May 11, 2022
Link
Interesting, I've only seen ranges like [a-z], [A-Z], and [0-9]. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were...

Interesting, I've only seen ranges like [a-z], [A-Z], and [0-9]. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were just hard-coded into regex or something. Turns out it actually uses consecutive ASCII values! Seems like all manner of weird tricks might be possible with that knowledge... Thanks for sharing!

6 votes
1. [5]
  onyxleopard (OP)
  May 11, 2022 (edited May 11, 2022)
  Link Parent
  It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for...
  
  It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for Unicode-aware regex implementations).
  
  E.g., for Python 3's re:
  
  In [1]: import re In [2]: pattern = re.compile('[🐀-🐬]') In [3]: pattern.findall('🐙🐯') Out[3]: ['🐙']
  
  7 votes
  1. archevel
    May 12, 2022
    Link Parent
    My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree...
    
    My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree of all species that there are represented in the standard. 😉
    
    4 votes
  2. [3]
    Moonchild
    May 12, 2022
    Link Parent
    Unicode is a superset of the ASCII character set...
    
    don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching
    
    Unicode is a superset of the ASCII character set...
    
    1 vote
    
    [2]
    onyxleopard (OP)
    May 12, 2022
    Link Parent
    Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, you’re making an inclusive selection within an ordinal range. The fact...
    
    Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, you’re making an inclusive selection within an ordinal range. The fact that ASCII comprises the bottom 128 ordinals in Unicode (for backward compatibility reasons) is incidental. Ranges of ordinals work because there is a total ordering over ℕ (aka the set of ordinal numbers).
    
    3 votes
    
    Moonchild
    May 12, 2022
    Link Parent
    The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: ‘does the range ,-. contain the...
    
    The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: ‘does the range ,-. contain the characters ‘,’, ‘-’, and ‘.’ because those characters are contiguous in the unicode character set or because they are contiguous in the ascii character set?’
    
    A fairly trivial question if you ask me, but I maintain that ‘ascii’ is the better response. Because:
    
    It is a more general response
    
    Following 1, the identity would still hold in a regex implementation which did not support unicode, but rather ascii, or another unspecified superset of ascii with another unspecified encoding
    
    The characters have the order that they do in unicode because they have that order in ascii
    
    1 vote
[3]
Pistos
May 11, 2022
Link
The main reason you should not do this (or not do it on purpose, at least) is to avoid wasting your teammates' time (going down the rabbit hole).

The main reason you should not do this (or not do it on purpose, at least) is to avoid wasting your teammates' time (going down the rabbit hole).

4 votes
1. [2]
  onyxleopard (OP)
  May 11, 2022
  Link Parent
  My takeaway is a bit different. This is OK as long as one or both of the following hold: You add comments with sufficient explanation of your intention. You have sufficient test coverage (though,...
  
  My takeaway is a bit different. This is OK as long as one or both of the following hold:
  
  You add comments with sufficient explanation of your intention.
  
  You have sufficient test coverage (though, test cases for regexes is a really gnarly rabbit hole in and of itself).
  
  (Ideally you do both 1 and 2.)
  
  3 votes
  1. Pistos
    May 12, 2022 (edited May 12, 2022)
    Link Parent
    Right, though I think my suggestion stands because the comment would be something like one of these: # Isn't it cool that [explanation about ASCII tables, etc.] // See if you can figure this out...
    
    Right, though I think my suggestion stands because the comment would be something like one of these:
    
    # Isn't it cool that [explanation about ASCII tables, etc.]
    
    // See if you can figure this out ;)
    
    /* This was Mike in 2018 trying to be clever */
    
    which differs from the situation of having something difficult to grok, or potentially misconstruable, but which provides tangible value (e.g. improved performance), where the comment elucidates the matter and provides justification for the chosen implementation.
    
    8 votes