onyxleopard's recent activity

  1. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    Ok so this is useful for cloud service providers who want to convince customers to send their data to the provider for processing while remaining encrypted. For individual people, then, this seems...

    Ok so this is useful for cloud service providers who want to convince customers to send their data to the provider for processing while remaining encrypted. For individual people, then, this seems like it isn’t interesting. Really only useful for situations where your data is proprietary and you don’t own your own machine big enough to do useful things with it yourself.

    4 votes
  2. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    I guess I’m not convinced that problems that end-to-end models can currently solve can always be decomposed like you are aiming for. I don’t disagree that getting reusable components out of a...

    I guess I’m not convinced that problems that end-to-end models can currently solve can always be decomposed like you are aiming for. I don’t disagree that getting reusable components out of a project wouldn’t be worthwhile. I just think that most problems aren’t necessarily decomposable in a way that the parts would make any sense to humans who might want to reuse them.

    The domain I’m most familiar with is NLP. The approaches that are going out of style currently are more in line with what you’re asking for. You would train a segmenter (or use rules), you would set up sequence tagging models like POS taggers, you’d have normalization or morphological analyzers (usually FSTs), etc. But these tasks require a lot of linguistic knowledge and somehow humans seem to understand natural language without knowing how to do all those sub tasks themselves. Thus you have the newer style end-to-end models that can solve a lot of tasks fairly well even though they aren’t decomposable like the older style pipelines (unless you train them on the subtasks explicitly). There is a lot of research (“BERTology” etc.) that is making the case that these transformer models may actually be learning to do something like the pipelines you’re interested in, but it’s really difficult to probe them in the right ways to suss that out.

    Have you seen any work on probing end-to-end models to extract reusable components in other domains? If nobody’s working on it, it’s not likely to materialize in the next 15 years. 😝

    2 votes
  3. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    Based on current research with transformers, they seem to scale linearly with data set size and the number of parameters. That said, they tend to memorize a lot from their training sets, and both...

    Based on current research with transformers, they seem to scale linearly with data set size and the number of parameters. That said, they tend to memorize a lot from their training sets, and both training and inference on models with too many parameters becomes limited by hardware and availability of clean data.

    The large language models (where petabytes of text are not infeasible to collect) that are currently being developed are scaling up with no limits in sight except for dollar costs for compute. Practically speaking, those limits are real and important, but theoretically with unlimited time/space budgets, the sky’s the limit.

    3 votes
  4. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    That’s just one example within an infinite space of additional classifiers you’d end up requiring in your glue/pseudo code, though. My point is that trying to build common sense, bottom-up, from...

    For example, I could build a classifier that detects billboards and other depictions within an image.

    That’s just one example within an infinite space of additional classifiers you’d end up requiring in your glue/pseudo code, though. My point is that trying to build common sense, bottom-up, from if blocks is a fools errand. Even people who are intimately familiar with a given problem are usually incapable of unpacking all the potential externalities that they just gloss over because they are intelligent human beings who possess common sense. What your pseudo-code is doing is basically a start to creating task-specific guidelines. And as someone who has experience writing such guidelines, they are usually at minimum 50 or so pages—sometimes hundreds—to handle edge cases.

    The point of all this is abstraction, which imo fundamentally underpins all our advances in hardware and software design, because humans tend to intuitively abstract away unnecessary detail. We should do the same when building AIs, but our current workflows make that prohibitively hard.

    I don’t think the workflows make it prohibitively hard. If you align on reasonable APIs and data formats, Python makes excellent “glue”. The actual formulation of well-defined problems is the prohibitively hard part from my perspective. I think actual successes with machine learning end up being those that are applied to problem spaces where humans have actually been able to specify tasks with sufficient depth and breadth so as to make all the detail explicit.

    3 votes
  5. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    The problem is that “summing up some numerical properties over N records” is trivial for computers. You don’t need AI for that. You need an Excel formula, or equivalent. The “how many cars are in...

    The problem is that “summing up some numerical properties over N records” is trivial for computers. You don’t need AI for that. You need an Excel formula, or equivalent. The “how many cars are in this photo, and what is the estimated fuel efficiency of each one” is a completely different kind of problem, fraught with all kinds weedy ill-defined rabbit holes that humans can skip over due to having common sense, but single purpose machine learning models can’t. Like, a human tasked with this isn’t going to count the car on a poster or billboard that happened to be captured in the background of the photo, but a computer vision model very well might. A human who recognizes the photo is a screenshot from a movie is going to reject the input as out of domain. A computer vision model isn’t necessarily capable of doing that (even if it can emit low confidence scores on detection and classification tasks). A human might see a motorcycle in the photo and count that (or not based on context), too. A computer vision model, even if capable of detecting and classifying these vehicle types, won’t have the common sense to include or exclude them in the downstream steps of the task.

    What you need for this kind of task is AGI capable of understanding the world and learning common sense, which I think we’re further than 15 years away from. It’s possible there will be end-to-end multimodal models that might be capable of this within 15 years, but I think the sheer amount of resources needed to train them and validate them will not pay off the debt of creating them. Like, it might just be cheaper, on balance, holistically, to pay humans to do this work. Or maybe just invest the money you would have spent in a fund so you can pay more for the fuel. (And I’m of the opinion that if we do achieve AGI it probably won’t be any more interested in this kind of work than humans are, so it would probably be immoral to force an AGI agent to spend all its time on tasks like this.)

    2 votes
  6. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    Why can’t they use regular encryption for that?

    Why can’t they use regular encryption for that?

    1 vote
  7. Comment on What programming/technical projects have you been working on? in ~comp

    onyxleopard
    (edited )
    Link
    I've been hacking on a method for measuring the similarity of phonemic transcriptions using simhashing (a variety of locality sensitive hash or LSH). The idea is to compare the bitwise-similarity...

    I've been hacking on a method for measuring the similarity of phonemic transcriptions using simhashing (a variety of locality sensitive hash or LSH). The idea is to compare the bitwise-similarity of the simhashes of pairs of 2D matrices formed by stacking phonemic feature vectors from the PHOIBLE project's phonological inventories. I have a Jupyter notebook here that shows the progress (you can see the ranked pairs by similarity in the last output cell at the bottom). LSH is nice for this kind of pairwise comparison because there are tricks you can use to avoid running comparisons on the exhaustive set of pairwise combinations by sorting the LHS integer values and bitwise rotating them through every bit position and then only comparing adjacent pairs in those sorted lists.

    There's also a CLI to play with the parameters on a small toy dataset:

    $ ./simphon.py -n 3 -b 128 -w 10 | tabulate -1s $'\t' -F '0.3f'      
    ranking pairs by bitwise similarity: 100%|█████████████████████████████████████████████████| 703/703 [00:00<00:00, 1446098.93pair/s]
         a                                    b                                      simhash difference (in bits)    similarity score
    ---  -----------------------------------  -----------------------------------  ------------------------------  ------------------
      0  (eng) Zach /z æ k/                   (eng) Zak /z æ k/                                                 0               1.000
      1  (eng) Catherine /k æ θ ə ɹ ə n/      (eng) Catherine /k æ θ ə ɹ ɪ n/                                  61               0.762
      2  (eng) Brad /b ɹ æ d/                 (eng) Brett /b ɹ ɛ t/                                            74               0.711
      3  (eng) Jenny /d̠ʒ ɛ n i/               (eng) Johnny /d̠ʒ ɑ n i/                                          78               0.695
      4  (eng) Catherine /k æ θ ə ɹ ɪ n/      (eng) Zachary /z æ k ə ɹ i/                                      79               0.691
      5  (eng) Matt /m æ t/                   (eng) Nate /n eɪ t/                                              84               0.672
      6  (eng) Alexander /æ l ə k z æ n d ɚ/  (eng) Alexis /ə l ɛ k s ɪ s/                                     85               0.668
      7  (eng) Alex /æ l ə k s/               (eng) Alexander /æ l ə k z æ n d ɚ/                              89               0.652
      8  (eng) Catherine /k æ θ ə ɹ ə n/      (eng) Jonathan /d̠ʒ ɑ n ə θ ə n/                                  89               0.652
      9  (eng) Alexi /ə l ɛ k s i/            (eng) Alexis /ə l ɛ k s ɪ s/                                     90               0.648
     10  (eng) Brad /b ɹ æ d/                 (eng) Bradley /b ɹ æ d l i/                                      90               0.648
    

    There are some false positives, but it's neat to see this data-driven, linguistically grounded, information-theoretic approach somewhat work.

    2 votes
  8. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    Yeah, as far as speech synthesis, the neural text to speech (NTSS) models I've seen, esp. from Microsoft, are making me think that there will be a lot more YouTube channels that will be...

    Yeah, as far as speech synthesis, the neural text to speech (NTSS) models I've seen, esp. from Microsoft, are making me think that there will be a lot more YouTube channels that will be autogenerating content from text and stock photos/videos.

    Some of MS's previews here are really good.

    8 votes
  9. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link
    It's founded by George Church (who has his hand in just about every bio-* related venture under the sun), so take it with a grain of salt, but there's a company called Colossal Biosciences that is...

    It's founded by George Church (who has his hand in just about every bio-* related venture under the sun), so take it with a grain of salt, but there's a company called Colossal Biosciences that is aiming to deextinct the Woolly Mammoth by 2027. I don't know if it's actually achievable or not, but it doesn't seem like this should be impossible with sufficient funding and effort by genetic engineers.

    7 votes
  10. Comment on What's an achievable technological, scientific, or computational breakthrough that you're really looking forward in the next fifteen years? in ~tech

    onyxleopard
    Link Parent
    I've heard about this. IIRC IBM had an implementation of this they were touting a few years ago. I'm having difficulty understanding what a real-world application would like that would benefit...

    I've heard about this. IIRC IBM had an implementation of this they were touting a few years ago. I'm having difficulty understanding what a real-world application would like that would benefit from this. I.e., I have some personal data, say my health records from my PCP. I homomorphically encrypt this data. Now who do I share it with, and why is it useful for them to perform calculations and processing on this encrypted form? And why should anyone else offer to run such calculations and processing on my data instead of me running it myself? It also raises the question of how would you verify that an application processing such data performed correctly/accurately? Won't you still need unencrypted data to validate this stuff anyway?

    4 votes
  11. Comment on The regex [,-.] in ~comp

    onyxleopard
    Link Parent
    Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, you’re making an inclusive selection within an ordinal range. The fact...

    Right, but it has nothing to do with ASCII in particular. When you use - between two characters in a regex character class, you’re making an inclusive selection within an ordinal range. The fact that ASCII comprises the bottom 128 ordinals in Unicode (for backward compatibility reasons) is incidental. Ranges of ordinals work because there is a total ordering over ℕ (aka the set of ordinal numbers).

    3 votes
  12. Comment on The regex [,-.] in ~comp

    onyxleopard
    Link Parent
    My takeaway is a bit different. This is OK as long as one or both of the following hold: You add comments with sufficient explanation of your intention. You have sufficient test coverage (though,...

    My takeaway is a bit different. This is OK as long as one or both of the following hold:

    1. You add comments with sufficient explanation of your intention.
    2. You have sufficient test coverage (though, test cases for regexes is a really gnarly rabbit hole in and of itself).

    (Ideally you do both 1 and 2.)

    3 votes
  13. Comment on The regex [,-.] in ~comp

    onyxleopard
    (edited )
    Link Parent
    It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for...

    It may depend on the regex implementation you use, but I don't think it has to do with ASCII in particular, but rather the ordinal value of the Unicode code point you're matching (at least for Unicode-aware regex implementations).

    E.g., for Python 3's re:

    In [1]: import re
    
    In [2]: pattern = re.compile('[🐀-🐬]')
    
    In [3]: pattern.findall('🐙🐯')
    Out[3]: ['🐙']
    
    7 votes
  14. Comment on Leaked draft opinion show the Supreme Court has voted to overturn abortion rights in ~news

    onyxleopard
    Link Parent
    Oh absolutely, it predates Trump’s presidency. Trump just happened to have the presidency at the time. I didn’t mean to insinuate that Trump was in any way pivotal to the trajectory of the SC.

    Oh absolutely, it predates Trump’s presidency. Trump just happened to have the presidency at the time. I didn’t mean to insinuate that Trump was in any way pivotal to the trajectory of the SC.

    10 votes
  15. Comment on Leaked draft opinion show the Supreme Court has voted to overturn abortion rights in ~news

    onyxleopard
    Link Parent
    IMO this is just one more data point on a long trend line that has been vectoring toward scary territory for my entire life. The shamelessness of Thomas has been one thing. The shamelessness of...

    IMO this is just one more data point on a long trend line that has been vectoring toward scary territory for my entire life. The shamelessness of Thomas has been one thing. The shamelessness of Kavanaugh and Coney Barrett has just driven home the utter politicization of the SC. While McConnell strangles the legislative branch, the judiciary will now assume the mantle of regressing the US into a Dominionist nightmare. Anyone who saw Trump’s appointments to the SC and thought anything different was in store hasn’t been paying attention.

    9 votes
  16. Comment on Twitter accepts buyout, giving Elon Musk total control of the company in ~tech

    onyxleopard
    Link Parent
    The argument I've heard against this is that Twitter had an interest in inflating their active user count. The argument is as follows: If actually cracking down on bots would make the active user...

    The argument I've heard against this is that Twitter had an interest in inflating their active user count. The argument is as follows:

    If actually cracking down on bots would make the active user count drop significantly, then cracking down on bots would be bad business for Twitter.

    I don't know if it's sound, but it is a valid argument.

    5 votes
  17. Comment on Plain Text - Dylan Beattie - NDC Oslo 2021 in ~comp

    onyxleopard
    Link Parent
    I think many times the risk of data loss at the source is minor—even if some software can’t represent the data to a human in a way that humans understand, the actual bits are still there on disk....

    I think many times the risk of data loss at the source is minor—even if some software can’t represent the data to a human in a way that humans understand, the actual bits are still there on disk. Over the wire or via some protocol, however, sometimes bits may get discarded. I know I’ve seen people get befuddled due to GitHub nuking carriage returns, or naive software that decodes belligerently using the wrong encoding and replaces erroneous data with � U+FFFD.

    2 votes
  18. Comment on What programming/technical projects have you been working on? in ~comp

    onyxleopard
    Link
    I did a bit more work on my grammar for parsing IPA transcriptions. It's still pretty naïve, but it's better than my first iteration, and now tries to parse syllables with required nuclei and...

    I did a bit more work on my grammar for parsing IPA transcriptions. It's still pretty naïve, but it's better than my first iteration, and now tries to parse syllables with required nuclei and optional onsets/codas. (It still can't disambiguate syllable boundaries totally correctly or handle any language-specific phonotactic constraints, so there are likely some edge cases I haven't thought of yet.)

    Project lives here.

    The test cases I pulled out from some Wikipedia pages make a decent demo (and show how slow the parser is):

    (ipa) $ ./tests/run.zsh 
    /mǎi mài mâi mái/ PASS
    /ˈkatən/ PASS
    [ˈkhætn̩] PASS
    [ˈdʒæk|pɹəˌpɛəɹɪŋ ðə ˈweɪ|wɛnt ˈɒn‖] PASS
    [↑bɪn.ðɛɹ↘|↑dɐn.ðæt↘‖] PASS
    [túrán↑tʃí nè] PASS
    [xɤn˧˥ xaʊ˨˩˦] PASS
    [ˈɹɪðm̩] PASS
    [ˈhuːˀsð̩ɣ] PASS
    [ˈsr̩t͡sɛ] PASS
    [ɹ̝̍] PASS
    [ʙ̞̍] PASS
    èlʊ́kʊ́nyá PASS
    huʔ˩˥ PASS
    mā PASS
    nu.jam.ɬ̩ PASS
    a˩˥˥˩˦˥˩˨˧˦˧ PASS
    [u ↑ˈvẽ.tu ˈnɔ.ɾtɯ ku.mɯˈso.ɐ.suˈpɾaɾ.kõˈmũi.tɐ ˩˧fu.ɾiɐ | mɐʃ ↑ˈku̯ɐ̃.tu.maiʃ.su˩˧pɾa.vɐ | maiz ↑u.viɐ↓ˈʒɐ̃.tɯ.si.ɐk.õʃ↓ˈɡa.va.suɐ ˧˩ka.pɐ | ɐˈtɛ ↑kiu ˈvẽ.tu ˈnɔɾ.tɯ ˧˩d̥z̥ʃtiu ǁ] PASS
    ( while read l; do; echo -n "$l " | tee /dev/stderr | ( ./ipa_grammar.py - > )  5.86s user 0.20s system 99% cpu 6.113 total
    
    3 votes