31 votes

Things we learned out about LLMs in 2024

Topic deleted by author

12 comments

  1. [12]
    DawnPaladin
    Link

    A drum I’ve been banging for a while is that LLMs are power-user tools—they’re chainsaws disguised as kitchen knives. They look deceptively simple to use—how hard can it be to type messages to a chatbot?—but in reality you need a huge depth of both understanding and experience to make the most of them and avoid their many pitfalls.

    ...

    What are we doing about this? Not much. Most users are thrown in at the deep end. The default LLM chat UI is like taking brand new computer users, dropping them into a Linux terminal and expecting them to figure it all out.

    Meanwhile, it’s increasingly common for end users to develop wildly inaccurate mental models of how these things work and what they are capable of. I’ve seen so many examples of people trying to win an argument with a screenshot from ChatGPT—an inherently ludicrous proposition, given the inherent unreliability of these models crossed with the fact that you can get them to say anything if you prompt them right.

    There’s a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can’t see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!

    There is so much space for helpful education content here, but we need to do do a lot better than outsourcing it all to AI grifters with bombastic Twitter threads.

    23 votes
    1. [11]
      creesch
      Link Parent
      That one also really stood out to me as something that is important. In fact, I was talking about it yesterday. I think that it does goes hand in hand with LLMs need better criticism The part I...

      That one also really stood out to me as something that is important. In fact, I was talking about it yesterday.

      I think that it does goes hand in hand with LLMs need better criticism

      A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that “LLMs are useful” can be enough to kick off a huge fight.

      I get it. There are plenty of reasons to dislike this technology—the environmental impact, the (lack of) ethics of the training data, the lack of reliability, the negative applications, the potential impact on people’s jobs.

      LLMs absolutely warrant criticism. We need to be talking through these problems, finding ways to mitigate them and helping people learn how to use these tools responsibly in ways where the positive applications outweigh the negative.

      I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.

      If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.

      (If you still don’t think there are any good applications at all I’m not sure why you made it to this point in the article!)

      I think telling people that this whole field is environmentally catastrophic plagiarism machines that constantly make things up is doing those people a disservice, no matter how much truth that represents. > There is genuine value to be had here, but getting to that value is unintuitive and needs guidance.

      Those of us who understand this stuff have a duty to help everyone else figure it out.

      The part I bolded I would personally have worded slightly differently, but in essence still what I think is important in this context.

      14 votes
      1. [10]
        whbboyd
        Link Parent
        Okay. What are they? One of the reasons I'm critical of GNN tech is that I have yet to hear anyone articulate an actually compelling use case for them.¹ Once you take the bullshit into account,...

        we first need to acknowledge that there ARE good applications

        Okay.

        What are they?

        One of the reasons I'm critical of GNN tech is that I have yet to hear anyone articulate an actually compelling use case for them.¹ Once you take the bullshit into account, any output either (a) needs to be checked by a reliable agent, or (b) can't matter. For the second case, obviously, any other use of the resources being sunk into GNNs would be preferable. For the first, usually the effort required to check is going to far outweigh any benefits the model could give. (This is certainly the case for code, though the temptation to spew a bunch of bullshit you don't understand into the codebase is apparently overwhelming for some.) It's conceivable these models could be a net benefit as an input into some other reliable process (for example, as a "rubber duck" to aid in debugging or overcoming writer's block), but this is a pretty niche use case that almost certainly does not support the current scope of the industry. (Also, as noted, the temptation for human agents to just serve as bullshit conduits instead of actually thinking is, apparently, incredibly strong.)

        NN-based classifiers are useful! They have clear pitfalls (e.g. vulnerability to adversarial inputs), but it's possible to mitigate those pitfalls (e.g. use them on raw sensor data, which is relatively difficult to engineer for adversity, rather than arbitrary potentially-hostile user inputs) and in situations where the unreliability is less critical (if parts of my photo library are mistagged, that's not the end of the world; they would be anyway). So I don't find it inconceivable that GNNs could have useful applications. But while I've seen a lot of "cautiously critical" writers insist that they do have useful applications (which aren't chatbots), I haven't seen anyone enumerate concretely what those applications are.


        ¹ Okay, this isn't quite true: the use case of "generate bullshit to fill a bullshit field which will never be read but is required for bureaucratic reasons" is reasonable. But, of course, the question we should ask obviously isn't "how do we efficiently generate bullshit" but "why do we require bullshit here to begin with".

        10 votes
        1. [2]
          creesch
          Link Parent
          I should have expanded a bit more about what I meant with the phrasing. With the current hype train still going strong the people that get to make the decisions are going to want to implement AI...

          I should have expanded a bit more about what I meant with the phrasing. With the current hype train still going strong the people that get to make the decisions are going to want to implement AI regardless. Even if you don't think they are no truly useful applications you can still be informed about what sort of applications at the very least are not harmful or at least harmless. Basically, if you shut down any proposal involving AI while management is going to implement something AI involved anyway. You are giving away your voice to the people that fully drunk the cool-aid and that most certainly is not going to make things better.

          Having said that, I do think there are useful applications.
          Applications where they are tools complementing professionals who have the knowledge and experience to actually use them in that capacity. It also very much depends on the field of work we are looking at. Not material generators in their own right. You get to it yourself later to a certain degree.

          It's conceivable these models could be a net benefit as an input into some other reliable process (for example, as a "rubber duck" to aid in debugging or overcoming writer's block),

          The rubber duck aspect is where LLMs are quite good in technical IT jobs. It is the one application I am most familiar with.

          • Deciphering spaghetti code: LLMs generally are pretty good at picking apart code blocks and generally explaining the functional parts. A while ago I was dealing with code that had lots of methods on single lines with tons of conditions. I put it in LLM chat, asked it to go over it, and it gave me a point by point explanation of all the logic in there. I don't expect it to be perfect here, it doesn't need to be. The way my mind works, once I have the explanation, I can much easier go to the single line mess and follow it along. If the LLM messed up I will see that, but I will also be much further along already with deciphering as I would have been doing it manually.
          • Before creating a PR I might let a LLM go over the changes first and see what it comes up with. Very often there will be a bunch of things that are not helpful but at the same time they have caught small errors and other things that I overlooked.
          • Getting a quick answer about "how do I X again" in a language because I haven't done something in quite a while.
          • Getting started with a well established technology I have personally not worked with. They struggle with truly new things but it is a well established framework or language and I personally haven't just worked with it I often have the documentation open in one window and LLM chat in another. If I encounter something in the documentation that is unclear I simply go over it with the LLM.
          • Setting up boilerplate code when starting on something new with some feedback. Something like "I am going to work on X, with java and this specific technology, I am thinking of dividing it up like this. Anything I am forgetting? If not can you create the empty classes, etc."

          And a lot more like that. Basically their application is a lot of low level smaller actions. Where I previously would need to wade through google search results, spend a lot of time on tasks I tend to find tedious and importantly where the way I am using them doesn't impact the end product directly.

          So what's really important here though is that I never expect perfect output, constantly am critical towards the output and don't copy it over blindly. Which I can because I have the experience to do so. So I wouldn't advocate for just giving the same tools I use to juniors and expect them to have the same reservations and critical eye I have.

          Another field where they are actually quite good is translations, on the condition that there is enough training material for both languages in the model. So for example. Translation between Dutch and English I would trust to be fairly accurate for most models. I would put less trust in them translating between Croatian and English as I know Croatian isn't as well presented in the training material for most models.

          If these applications seem somewhat underwhelming. Yes, compared to the hype they are. But, they are applications I am familiar with and feel comfortable being somewhat authoritative about. But I can only be that because I did experiment with these applications and not actively shunted them. And not because I saw value up front, but because I did see the hype starting up (at this point a few years already) and wanted to be informed and remain informed.

          13 votes
          1. deathinactthree
            Link Parent
            This is the conversation I recently find myself in often with friends and colleagues. I do use LLMs somewhat regularly, for the kinds of examples you mention--basically, tasks that I can do myself...

            If these applications seem somewhat underwhelming. Yes, compared to the hype they are.

            This is the conversation I recently find myself in often with friends and colleagues. I do use LLMs somewhat regularly, for the kinds of examples you mention--basically, tasks that I can do myself just fine but an LLM speeds up the rote part of. It doesn't "think" for me, it doesn't "conceptualize", it sure as hell can't do my job for me or replace me doing it, but me having enough expertise in a field where I know enough to know exactly what to ask it and understand whether the results are usable or not, gives me the ability to save enough time to consider it useful.

            It's not sexy at all though, which is what the AI hype needs you to believe. I usually tell my colleagues that an LLM is like a calculator...as in, I can do long division on my own just fine, but a calculator can do it faster. But also a calculator is useless if you don't know how to set up the problem you're trying to solve with it. It can't do anything at all on its own, so it isn't magic and it can't replace you and it's not going to launch SkyNet or whatever.

            That's how I approach it, and they're usually vaguely disappointed in that answer. They see it as underwhelming. I just see it as another tool, lumped in utility-wise with things like regex and bash scripts.

            9 votes
        2. [3]
          updawg
          Link Parent
          Well ChatGPT has a nearly 100% success rate helping me set up and troubleshoot Excel formulas. Basically the only times it's wrong are because I didn't give it an accurate description of my data....

          Well ChatGPT has a nearly 100% success rate helping me set up and troubleshoot Excel formulas. Basically the only times it's wrong are because I didn't give it an accurate description of my data. It's far faster and more personal than trying to find generic answers online.

          The other thing that I really have a better than 100% success rate (it helps me with more than what I ask) is translation. I can specify how formal I want something to be and it can provide options and tell me how each one is used in different styles of language. It can also correct and teach me. You're literally asking grammar books questions and getting descriptions no normal human could provide.

          7 votes
          1. [2]
            creesch
            Link Parent
            I did mention translation in my comment just now as well. It does depend on the training material and how well both languages were represented in there. Languages like English, German, Dutch and...

            The other thing that I really have a better than 100% success rate (it helps me with more than what I ask) is translation.

            I did mention translation in my comment just now as well. It does depend on the training material and how well both languages were represented in there. Languages like English, German, Dutch and French are all fairly well presented in the data set where I would trust most general LLMs to do fairly well. But the less material was available for training the more the translation will start to break down.

            Having said that, I know that companies like DeepL are pivoting to using LLMs for some translations instead of the sort of models and methods they used before.

            3 votes
            1. V17
              Link Parent
              For Czech language it does not work that well out of the box, when the instruction is just "translate the following:" but it dramatically outperforms other automated translation tools if you need...

              Languages like English, German, Dutch and French are all fairly well presented in the data set where I would trust most general LLMs to do fairly well. But the less material was available for training the more the translation will start to break down.

              For Czech language it does not work that well out of the box, when the instruction is just "translate the following:" but it dramatically outperforms other automated translation tools if you need it to translate something tricky, like context dependent idioms, double entendres etc., and you specifically tell it to explain this tricky part. I've seen it give terrific ideas about translating some very tricky language jokes by Terry Pratchett for example.

              1 vote
        3. [2]
          Wes
          (edited )
          Link Parent
          I use LLMs frequently for all kinds of tasks. Transforming data from one format to another. Editing text to find mistakes or clumsy writing. Writing and understanding code. Summarizing long...

          I use LLMs frequently for all kinds of tasks. Transforming data from one format to another. Editing text to find mistakes or clumsy writing. Writing and understanding code. Summarizing long documents, and allowing me to ask direct questions of the material.

          Other times, I'll use them to give me an introduction to a new topic, or to compare two options I'm uncertain about. More often than I'd like to admit, I'll use them to help me remember a word I can't quite put my finger on (a task basically impossible for Google). They're incredibly good at natural language processing, and can turn vague commands into concrete API calls. Their applications seem endless to me, and I'm still trying to remember to integrate them into my workflow where possible.

          Of course, I understand that LLMs are fallible. I would never ask them a pointed question about an obscure topic, because I know I'm unlikely to get a good answer back. I understand that it reads in great meaning from my query, and is likely to repeat what I say back at me (including any incorrect assumptions). Prompts matter a great deal, and using an LLM is a learned skill just as using Google was 20 years ago.

          I predict that in 5-10 years, knowing how to use LLMs will be considered a computer competency skill for many occupations, like using a browser or Excel. And in 15 years, it won't be, because the systems will be good enough to abstract away the details.

          7 votes
          1. Minori
            Link Parent
            This is where I've personally found the most pitfalls. If it's a field where you're familiar with the subject matter (political science in my case), it can help to compare and contrast a couple...

            I'll use them to give me an introduction to a new topic

            This is where I've personally found the most pitfalls. If it's a field where you're familiar with the subject matter (political science in my case), it can help to compare and contrast a couple topics.

            The problem is I've found there are still frequent, subtle mistakes that are obvious to any subject matter expert. While the same thing can happen when I try to read a bad Wikipedia article on a new subject, it's rarer, and I know how to evaluate sources better than I know how to figure out if an LLM is hallucinating bullet points.

            As a practical example, I've repeatedly had language learners tell me they use ChatGPT to get Japanese grammar explanations. I've tried the same, and they're wrong or extremely misleading about 75% of the time. There are good automatic tools for breaking down Japanese sentences into the grammatical components, but many learners reach for the "all-knowing" chatbot that teaches them wrong.

            2 votes
        4. tibpoe
          Link Parent
          You're holding the technology to unreasonably high bar if you can't see any useful applications. In my own experience, I frequently use it in one of two ways: In programming, where I am an expert,...

          You're holding the technology to unreasonably high bar if you can't see any useful applications. In my own experience, I frequently use it in one of two ways:

          1. In programming, where I am an expert, LLM-based code completion makes things faster to write. It is especially helpful when switching between projects/languages frequently, where I know exactly what I want but do not know how to express it in a different context.
          2. In fields where I am not an expert, they are very helpful to get an overview of the field and learn the vocabulary. It doesn't matter if there's mistakes, because I'll refine my understanding as I continue my research. For example, I was recently looking for a sculptable material that is still flexible after it solidifies. It suggested several unsuitable things, but Sugru and polycaprolactone seem worth trying out.
          5 votes
        5. sparksbet
          Link Parent
          I'm a data scientist, and I mostly worked on non-generative NN-based text classifiers at my last job. We got quite a lot of utility out of using generative LLMs to help augment our training data....

          I'm a data scientist, and I mostly worked on non-generative NN-based text classifiers at my last job. We got quite a lot of utility out of using generative LLMs to help augment our training data. They could be pretty effective at pointing out potentially mis-labelled sentences in our training data or suggesting examples for edge cases that weren't yet covered by our training data. I was nonplussed with all the "AI coding assistants" and other technical tools for general software engineering, but there are definitely places where these generative models can be useful for other NLP projects, especially when the data scientists involved are savvy about their limitations. When you're fine-tuning language models, it can be pretty valuable to have a different very good language model on hand.

          I'm pretty conservative when it comes to these models myself, as I think a lot of people are wildly overselling their abilities, both the pro- and anti-AI crowds alike. I'm very burned out on LLMs atm myself (in no small part bc I'm job hunting and I'm sick of seeing listings that demand impossible amounts of experience with them). But I figured since you can see valuable use cases for non-generative neural networks, you might be interested in how these models can be used to enhance those approaches, since I have firsthand experience there.

          5 votes