Interesting, I've only seen ranges like [a-z], [A-Z], and [0-9]. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were...
Interesting, I've only seen ranges like [a-z], [A-Z], and [0-9]. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were just hard-coded into regex or something. Turns out it actually uses consecutive ASCII values! Seems like all manner of weird tricks might be possible with that knowledge... Thanks for sharing!
It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for...
It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for Unicode-aware regex implementations).
My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree...
My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree of all species that there are represented in the standard. π
Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, youβre making an inclusive selection within an ordinal range. The fact...
Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, youβre making an inclusive selection within an ordinal range. The fact that ASCII comprises the bottom 128 ordinals in Unicode (for backward compatibility reasons) is incidental. Ranges of ordinals work because there is a total ordering over β (aka the set of ordinal numbers).
The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: βdoes the range ,-. contain the...
The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: βdoes the range ,-. contain the characters β,β, β-β, and β.β because those characters are contiguous in the unicode character set or because they are contiguous in the ascii character set?β
A fairly trivial question if you ask me, but I maintain that βasciiβ is the better response. Because:
It is a more general response
Following 1, the identity would still hold in a regex implementation which did not support unicode, but rather ascii, or another unspecified superset of ascii with another unspecified encoding
The characters have the order that they do in unicode because they have that order in ascii
My takeaway is a bit different. This is OK as long as one or both of the following hold: You add comments with sufficient explanation of your intention. You have sufficient test coverage (though,...
My takeaway is a bit different. This is OK as long as one or both of the following hold:
You add comments with sufficient explanation of your intention.
You have sufficient test coverage (though, test cases for regexes is a really gnarly rabbit hole in and of itself).
Right, though I think my suggestion stands because the comment would be something like one of these: # Isn't it cool that [explanation about ASCII tables, etc.] // See if you can figure this out...
Right, though I think my suggestion stands because the comment would be something like one of these:
# Isn't it cool that [explanation about ASCII tables, etc.]
// See if you can figure this out ;)
/* This was Mike in 2018 trying to be clever */
which differs from the situation of having something difficult to grok, or potentially misconstruable, but which provides tangible value (e.g. improved performance), where the comment elucidates the matter and provides justification for the chosen implementation.
Interesting, I've only seen ranges like
[a-z]
,[A-Z]
, and[0-9]
. I'd wondered in passing how the intermediary chars were determined but never looked into it, assumed letter and number ranges were just hard-coded into regex or something. Turns out it actually uses consecutive ASCII values! Seems like all manner of weird tricks might be possible with that knowledge... Thanks for sharing!It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for Unicode-aware regex implementations).
E.g., for Python 3's
re
:My main objection to this is that an octopus is not between a mouse and a dolphin in genetic terms. We definitely need to restructure unicode so that it accurately reflects the phylogenetic tree of all species that there are represented in the standard. π
Unicode is a superset of the ASCII character set...
Right, but it has nothing to do with ASCII in particular. When you use
-
between two characters in a regex character class, youβre making an inclusive selection within an ordinal range. The fact that ASCII comprises the bottom 128 ordinals in Unicode (for backward compatibility reasons) is incidental. Ranges of ordinals work because there is a total ordering over β (aka the set of ordinal numbers).The unicode code points are a further subset of N. So the unicode characters are totally ordered, yes; as are the ascii characters. So the question becomes: βdoes the range ,-. contain the characters β,β, β-β, and β.β because those characters are contiguous in the unicode character set or because they are contiguous in the ascii character set?β
A fairly trivial question if you ask me, but I maintain that βasciiβ is the better response. Because:
It is a more general response
Following 1, the identity would still hold in a regex implementation which did not support unicode, but rather ascii, or another unspecified superset of ascii with another unspecified encoding
The characters have the order that they do in unicode because they have that order in ascii
The main reason you should not do this (or not do it on purpose, at least) is to avoid wasting your teammates' time (going down the rabbit hole).
My takeaway is a bit different. This is OK as long as one or both of the following hold:
(Ideally you do both 1 and 2.)
Right, though I think my suggestion stands because the comment would be something like one of these:
# Isn't it cool that [explanation about ASCII tables, etc.]
// See if you can figure this out ;)
/* This was Mike in 2018 trying to be clever */
which differs from the situation of having something difficult to grok, or potentially misconstruable, but which provides tangible value (e.g. improved performance), where the comment elucidates the matter and provides justification for the chosen implementation.