18 votes

Megathread #5 for news/updates/discussion of AI chatbots and image generators

The hype continues. Here is the previous thread.

54 comments

  1. skybrian
    Link
    From Deep to Long Learning? (Hazy Research group at Stanford) An overview of research into how to increase the context window for LLM’s. They talk about their own research a fair bit. They claim...

    From Deep to Long Learning? (Hazy Research group at Stanford)

    An overview of research into how to increase the context window for LLM’s. They talk about their own research a fair bit. They claim their earlier research is being used by OpenAI and others, but they’ve gone on to come up with better algorithms since then. It sounds like it won’t be too long before we have chatbots that can summarize entire books easily?

    5 votes
  2. [4]
    skybrian
    Link
    Make a fun, infinitely replayable game in 5 minutes with GPT-4 (Matt Might) I wonder how well it does on continuity?

    Make a fun, infinitely replayable game in 5 minutes with GPT-4 (Matt Might)

    In this article, I’ll explain how to convert GPT-4 into a game engine for interactive text-based adventure games with just two simple steps:

    • Create a descriptive lore to define the universe and the player
    • Engineer a prompt to simulate a text-based adventure game
    • You can even add a save game feature that allows you to save the state and resume play later.

    I wonder how well it does on continuity?

    4 votes
    1. [3]
      EgoEimi
      Link Parent
      It's my understanding that GPT-4 has an 8K token context length as a moving window, so tokens—and therefore information more than 8K tokens ago are 'forgotten'. The RPG would probably still be...

      It's my understanding that GPT-4 has an 8K token context length as a moving window, so tokens—and therefore information more than 8K tokens ago are 'forgotten'.

      The RPG would probably still be fun, albeit wacky as GPT-4 gaslights you about fantasy places you visited in the distant past.

      2 votes
      1. [2]
        skybrian
        Link Parent
        Yeah, one workaround would be to divide the game up into episodes and add summaries of the previous ones to the prompt. Or insert an appropriate recap when there's a reference.

        Yeah, one workaround would be to divide the game up into episodes and add summaries of the previous ones to the prompt. Or insert an appropriate recap when there's a reference.

        4 votes
        1. teaearlgraycold
          Link Parent
          There's always the embeddings database approach

          There's always the embeddings database approach

          1 vote
  3. [3]
    lou
    Link
    Is there a place listing all these emerging applications with basic info on what they can do, pricing model, and what I can and shouldn't do with their output (both from an ethical and a legal...

    Is there a place listing all these emerging applications with basic info on what they can do, pricing model, and what I can and shouldn't do with their output (both from an ethical and a legal standpoint)?

    If there isn't, maybe we can create one?

    4 votes
    1. [2]
      skybrian
      Link Parent
      I don't know of one. There are many people working on apps, they're likely in the early stages of development, and sometimes they won't be that interesting. Posting mini-reviews of the ones you...

      I don't know of one.

      There are many people working on apps, they're likely in the early stages of development, and sometimes they won't be that interesting. Posting mini-reviews of the ones you try out makes sense.

      3 votes
      1. lou
        Link Parent
        I could do the markdown, but I have no idea what to put in the list.

        I could do the markdown, but I have no idea what to put in the list.

        1 vote
  4. skybrian
    Link
    Behind the curtain: what it feels like to work in AI right now (Nathan Lambert) [...]

    Behind the curtain: what it feels like to work in AI right now (Nathan Lambert)

    It seems like everyone is simultaneously extremely motivated and extremely close to burning out. Given the density of people in all the project spaces of generative AI or chatbots, there is a serious be the first or be the best syndrome (with a third axis of success being openness). This keeps you on your toes, to say the least. In the end, these pressures shape the products people are building away from a thoroughness of engineering and documentation. Clickyness is the driving trend in the last few months, which has such a sour flavor.

    [...]

    Many of the issues regarding the responsible development of AI have transitioned from research to reality with 100million+ people using ChatGPT. Everyone along the distribution from theoretical AI safety to the ML fairness researcher just got the largest call-to-arms of their career so far. This often involves engaging with stakeholders from other backgrounds than AI research and responding to criticism of their ideas, which is very tiring.

    For example, I see a ton of research and sociotechnical questions around RLHF that OpenAI / Anthropic likely won't engage with for primarily political or product reasons. It feels like the field is charging ahead with rapid progress on the technical side, where there is a down-the-line wall of safety and bias concerns that are very hard for small teams to comply with. Whether or not I am on the train going ahead, it seems obvious that the issues will become front of public perception in the coming months. For that reason, I have been deciding to keep going while discussing the sociotechnical issues openly. Eventually, safety concerns could easily trump my desire for technical progress. This sort of sociotechnical urgency is something I did not expect to feel in AI development for quite some time (or I expected the subjective feeling of it to approach much more gradually, like climate concerns rather than Ukraine concerns that happened overnight for me).

    4 votes
  5. [23]
    skybrian
    Link
    We need to tell people ChatGPT will lie to them, not debate linguistics (Simon Williston)

    We need to tell people ChatGPT will lie to them, not debate linguistics (Simon Williston)

    We should be shouting this message from the rooftops: ChatGPT will lie to you.

    That doesn’t mean it’s not useful—it can be astonishingly useful, for all kinds of purposes... but seeking truthful, factual answers is very much not one of them. And everyone needs to understand that.

    Convincing people that these aren’t a sentient AI out of a science fiction story can come later. Once people understand their flaws this should be an easier argument to make!

    4 votes
    1. [19]
      Algernon_Asimov
      Link Parent
      I saw an example of this out in the wild, this week on Reddit. It was in a linguistics subreddit. Someone asked a question about a particular linguistic phenomenon. Someone else replied and said...

      it can be astonishingly useful, for all kinds of purposes... but seeking truthful, factual answers is very much not one of them.

      I saw an example of this out in the wild, this week on Reddit.

      It was in a linguistics subreddit. Someone asked a question about a particular linguistic phenomenon. Someone else replied and said "ask chat gpt (not satire)". So the asker did ask ChatGPT, and copy-pasted its answer into the thread. The asker seemed quite happy with answer: "Normally I avoid ChatGPT because it tends to make stuff up 9 times out of 10 but this seemed to work."

      The autocomplete algorithm provided:

      • a name for the phenomenon the asker inquired about;

      • the titles of five different papers about the phenomenon, and;

      • the name of a linguist who had written one of those papers.

      I got curious, and fact-checked every item in the reply. Only one item existed: the linguist. But she hadn't written any papers about this phenomenon, and the phenomenon wasn't called what the algorithm said it was called (the name provided didn't exist anywhere else on the internet, based on my searching).

      Of course I provided the results of my fact-checking in the thread.

      Convincing people that these aren’t a sentient AI out of a science fiction story

      That might be easier if people didn't go around calling these things "artificial intelligences". Just call them "autocomplete algorithms", which is what they are. That would manage users' expectations a lot better. People know what autocomplete is, what it does - and, more importantly, what it doesn't do.

      6 votes
      1. [15]
        stu2b50
        Link Parent
        I don't really think calling them autocomplete algorithms is fair post-RLHF. To illustrate the difference, if you were to prompt gpt-3 with an aborted question, say, "What is the capital of",...

        I don't really think calling them autocomplete algorithms is fair post-RLHF. To illustrate the difference, if you were to prompt gpt-3 with an aborted question, say, "What is the capital of", gpt-3, truly a model trained purely as an autoregressive model, would simply complete the question. Because the most likely tokens that follow an aborted sentence, is the rest of the sentence.

        If you were to ask a RLHF-ed model, like chatGPT, then it instead replies

        I'm sorry, but you haven't specified which country or region you are asking about. Could you please provide me with more information about the place you're referring to?

        As RLHF optimizes the model to not just predict tokens that are the most likely, but tokens that are the most likely to make humans happy with the result (as determined by another language model trained to determine that based off of labeled samples wherein gpt responses are ranked by humans). That's also where the "I'm an AI model blah blah" responses come from (which are not, as some people think, hard coded).

        Regardless, while I agree that AI isn't a particularly useful label, it's more that it's misaligned from the original definition, which was oriented around the "how". The pacman AI I wrote as part of my undergrad classwork was even more basic than anything here - it, indeed, involved no machine learning to begin with - but was AI as per the traditional definition because it involved an agent in an environment acting rationally. AI was intended to be a term that described the outcome, not the method.

        Probably a more useful term to describe what's all been happening is an explosion in the development of discriminative machine learning models. That certainly is a revolution none-the-less, even if it doesn't get us particularly close to AGI, as truly good discriminative models unlocks entire sectors left otherwise untouched by automation.

        5 votes
        1. [14]
          Algernon_Asimov
          Link Parent
          Well, they're still not artificial intelligences, and as long as people call them "AI", that's going to raise false expectations about what they can do. It's been demonstrated time and time again...

          I don't really think calling them autocomplete algorithms is fair post-RLHF.

          Well, they're still not artificial intelligences, and as long as people call them "AI", that's going to raise false expectations about what they can do.

          It's been demonstrated time and time again that they simply can not answer questions of fact, beyond the most simple, such as your "What is the capital of..." example.

          If not "autocomplete algorithm", then we should at least call them "chatbots" or maybe "automated text generators". "AI" is a misnomer for this programs, and it should be banned from all discussions of these text-generating algorithms.

          3 votes
          1. [13]
            Adys
            (edited )
            Link Parent
            I can't believe you're still making this argument, which has been proven wrong over and over. It's getting frankly tiresome to read your responses here, which at this point amount to simply...

            I can't believe you're still making this argument, which has been proven wrong over and over. It's getting frankly tiresome to read your responses here, which at this point amount to simply ignoring anything disagreeing with you, reality included.

            Two papers for you, that I'm guessing you will ignore as well:

            2 votes
            1. [12]
              Algernon_Asimov
              Link Parent
              It's also getting tiresome reading other people praise these autocomplete algorithms as being intelligent. If they're so smart, how do they keep producing nonsensical non-existent "information"?...

              It's getting frankly tiresome to read your responses here,

              It's also getting tiresome reading other people praise these autocomplete algorithms as being intelligent.

              If they're so smart, how do they keep producing nonsensical non-existent "information"?

              As for your papers...

              The "Theory of Mind" paper merely observes that a chatbot can correctly deduce facts when you give it those facts. If you tell it that a bag contains popcorn, it tells you that the bag contains popcorn. If you tell it that the label on the bag says it contains chocolate, it tells you that a person reading the label will assume it contains chocolate.

              It's given facts, and it regurgitates those facts.

              And, as regards the emergent abilities paper:

              What GPT-3.5 currently cannot do

              • It can't change its incorrect statements based on new information intended to correct those statements.

              • It can't perform logical reasoning.

              • It can't read the internet (but that's not relevant here).

              3 votes
              1. [9]
                Greg
                Link Parent
                I wouldn’t use the word “intelligent”, because it has too many different potential interpretations, and a lot of connotations that imply capabilities fundamentally beyond what we’re seeing. I also...

                I wouldn’t use the word “intelligent”, because it has too many different potential interpretations, and a lot of connotations that imply capabilities fundamentally beyond what we’re seeing.

                I also wouldn’t use the word “autocomplete”, because as commonly used it implies capabilities fundamentally below what we’re seeing.

                If they're so smart, how do they keep producing nonsensical non-existent "information"?

                I’ve seen plenty of humans do that, so I really don’t see it having a bearing on intelligence as a concept.

                It can't change its incorrect statements based on new information intended to correct those statements.

                It’s perfectly capable of responding to and updating incorrect statements based on new information provided within a given context window. The link suggests that some “beliefs” are too strong to overwrite in this way, but again, try the same exercise with humans. That doesn’t say anything to me that rules out the possibility of intelligence.

                It doesn’t (not can’t) change its model on the fly for any number of practical and safety related reasons. It does learn and update, just in a batch form with tagged release numbers rather than in real time sentence-by-sentence.

                It can't perform logical reasoning.

                Referring to rigorous formal logic, rather than the more common layman’s definition of the word. Which is a fair and interesting observation, but I think you can guess what my counterpoint is going to be here…


                For complete clarity, I’m genuinely not making the argument that we should be describing it as intelligent; I think it’s ambiguous and (often unintentionally) misleading when people do so. I just don’t buy that those specific arguments preclude intelligence, and I do think that the tools we’re seeing now are so fundamentally different in power and application to my phone keyboard that implying they’re in the same category is equally misleading.

                4 votes
                1. [8]
                  Algernon_Asimov
                  Link Parent
                  Fine. Let's call them "text generators" instead. But not "artificial intelligence". That's a loaded term, and it's misleading a lot of people.

                  I do think that the tools we’re seeing now are so fundamentally different in power and application to my phone keyboard that implying they’re in the same category is equally misleading.

                  Fine. Let's call them "text generators" instead.

                  But not "artificial intelligence". That's a loaded term, and it's misleading a lot of people.

                  3 votes
                  1. [3]
                    Greg
                    Link Parent
                    Happily with you there, I've been the guy at my company trying to insist that people (and especially external facing docs) say ML rather than AI for years now. Sadly I think the dual tides of...

                    Happily with you there, I've been the guy at my company trying to insist that people (and especially external facing docs) say ML rather than AI for years now. Sadly I think the dual tides of marketing hype and linguistic descriptivism are against us on this one, though!

                    3 votes
                    1. [2]
                      Algernon_Asimov
                      Link Parent
                      I'm just going to have to continue to wear the approbation of people like @Adys by calling them "autocomplete algorithms" or "text generators". I might not change the world, but maybe one person...

                      I'm just going to have to continue to wear the approbation of people like @Adys by calling them "autocomplete algorithms" or "text generators". I might not change the world, but maybe one person reading what I write will realise these programs aren't actually intelligent, can't actually think, and don't actually know anything.

                      2 votes
                      1. Greg
                        Link Parent
                        Gotta say, you’ve kind of lost me again now. I wouldn’t have thought twice about describing a language model as knowing things - wouldn’t even have considered it controversial to say - although...

                        Gotta say, you’ve kind of lost me again now. I wouldn’t have thought twice about describing a language model as knowing things - wouldn’t even have considered it controversial to say - although now that you point it out I guess I can see that to know could mean “to retain information”, as I’d use it, but could also mean “to cognitively comprehend information” as I guess you’re using it?

                        Thinking, yeah, I wouldn’t say a language model thinks. I’d be frustrated by someone making an unqualified statement like “yes, it can think”. But if someone asked me point blank, I also wouldn’t answer with an unequivocal “it doesn’t think” - the question just seems too complex for simple statements like that. It’s one of those times where you need to ask questions to build up a shared working definition: do animals think? Are there “levels” within that where, say, primates do but molluscs don’t? And so on and so on. Go on like that for half an hour and then maybe there’s a detailed enough common understanding to meaningfully disagree on, you know?

                        2 votes
                  2. [4]
                    Adys
                    Link Parent
                    If you so insist on being descriptive why not call them by what they’re already called: large language models?

                    If you so insist on being descriptive why not call them by what they’re already called: large language models?

                    2 votes
                    1. [3]
                      Algernon_Asimov
                      Link Parent
                      But they're not being called large language models. They're being called "AIs" and "chatbots", when they're neither intelligent nor can they chat. "Autocomplete" makes it very clear to people not...

                      But they're not being called large language models. They're being called "AIs" and "chatbots", when they're neither intelligent nor can they chat.

                      "Autocomplete" makes it very clear to people not only what they are, but how they work. So does "text generator".

                      And why does it bother you so much what I call them?

                      1 vote
                      1. [2]
                        Adys
                        Link Parent
                        I don’t really care what you call them, I just think it says a lot that you are so convinced of what they are and aren’t, yet don’t use actual, proper terminology to talk about them and want to...

                        I don’t really care what you call them, I just think it says a lot that you are so convinced of what they are and aren’t, yet don’t use actual, proper terminology to talk about them and want to invent your own based on your own beliefs… and want other people to use that.

                        What bothers me rather is these noisy “it’s just autocomplete” posts in a discussion space. I engage with a lot of AI naysayers nowadays but what those all have in common is a willingness to discuss and learn. From what I’ve seen, you are not.

                        2 votes
                        1. Algernon_Asimov
                          Link Parent
                          That's fine. I'll get out, and leave you to your AIs.

                          That's fine. I'll get out, and leave you to your AIs.

                          1 vote
              2. [2]
                nukeman
                Link Parent
                It may well be that these models are “intelligent”, but not to the level we would ascribe to an adult human. For example, dogs and parrots are commonly considered to be intelligent, but only to...

                It may well be that these models are “intelligent”, but not to the level we would ascribe to an adult human. For example, dogs and parrots are commonly considered to be intelligent, but only to the level of a young child (2-5 years old). Similarly, the various LLMs may only be as intelligent as, say, an 8-year old. They are not intelligent the way we think of it (when compared to an adult human), but are intelligent.

                @Adys

                1 vote
                1. Adys
                  Link Parent
                  I think the "intelligence" metric is a bit of a fool's errand anyway because of how loosely it is defined. GPT3 can pass the bar exam, so is it as intelligent as a highly-educated 30 year old?...

                  I think the "intelligence" metric is a bit of a fool's errand anyway because of how loosely it is defined.

                  GPT3 can pass the bar exam, so is it as intelligent as a highly-educated 30 year old? Well, given enough digits, it also can't do math a 7 year old could still do, so, YMMV. It also cannot pass a normal IQ test due to input limitations... And yet, GPT has a similar role at my company as that of a skilled generalist employee.

                  People have a weird, warped view of "intelligence" that may or may not encompass things like knowledge, deductive reasoning, memory, reaction time, mental arithmetics and more. So IMO it's better to not try to give it a strict definition. Instead, we look at which human tasks GPT cannot currently achieve, and figure out how it could.

                  For example, GPT's hallucinations. Right now, humans have to verify sources for all the factoids in case one of them is hallucinated. How do you get a model to introspect and understand what may be wrong or right? Well, you really don't; instead, you teach another, separate AI to do exactly what the human does: Break down the output into factoids that can be looked up, and look 'em up. LangChain helps a lot in implementing this sort of model.

                  The damn thing has chain of thought. This to me is the AI equivalent of turing-completeness. Just as you can build a Turing Machine in Magic: The Gathering, I'm dead certain you can build an AGI using nothing but chain of thought and recursion.

                  That is, assuming that making the LLM bigger doesn't just ... solve this anyway. But I think it's worthwhile to pursue highly-efficient processes for various parts of the computer version of human thinking. We know humans are more efficient on at least some of those; designed systems should ultimately be more efficient.

                  3 votes
      2. [3]
        skybrian
        Link Parent
        One model of learning something new to you is (1) you look up the answer in a trusted reference (2) you trust it. Done. Another model is that it's a hunt for clues. You don't know where to find...

        One model of learning something new to you is (1) you look up the answer in a trusted reference (2) you trust it. Done.

        Another model is that it's a hunt for clues. You don't know where to find the answer, but rumors and hints can point you in the right direction.

        If you think of AI chat as a source of hints, the hints are often quite good.

        Maybe a different, more gossipy personality would help people get this? It could say things like "rumor has it" or "I heard from a friend that..." Or "I think I read somewhere..."

        2 votes
        1. [2]
          Algernon_Asimov
          Link Parent
          Even considered as a source of hints, the AI's answer in this case was objectively bad. Out of seven pieces of information it provided, only one item even existed, and that one actual piece of...

          If you think of AI chat as a source of hints, the hints are often quite good.

          Even considered as a source of hints, the AI's answer in this case was objectively bad. Out of seven pieces of information it provided, only one item even existed, and that one actual piece of information wasn't even relevant to the question being asked.

          Maybe a different, more gossipy personality would help people get this? It could say things like "rumor has it" or "I heard from a friend that..." Or "I think I read somewhere..."

          Or, it could be programmed to simply say "I don't know that information. I don't know any information."

          3 votes
          1. skybrian
            Link Parent
            Sure, sometimes you try to follow up a lead and get nothing. But to say that you can't get information out of it is too extreme. I've gotten some good hints about papers to read. I verified them...

            Sure, sometimes you try to follow up a lead and get nothing. But to say that you can't get information out of it is too extreme. I've gotten some good hints about papers to read. I verified them by searching for the papers on Google and reading them.

            I've also gotten bad recommendations that didn't exist. That's how it goes sometimes on a hunt for information.

            2 votes
    2. [3]
      Wes
      Link Parent
      I really do feel there should be a more prominent warning about hallucinations during the chat window, and not just the first time you sign up for it. I'm constantly reminding myself to always...

      I really do feel there should be a more prominent warning about hallucinations during the chat window, and not just the first time you sign up for it. I'm constantly reminding myself to always verify what is being said, but I also have years of experience in scientific skepticism and programming/debugging, both of which require constantly questioning our assumptions.

      "Normal people" don't have that experience or context, but they're using these tools today. We need to be constantly reminding them that this isn't an all-knowing machine. It's spitting out statistical probabilities that happen to often get it right.

      I also feel there's some responsibility on the platforms to reduce anthropomorphizing, because that feeds into the idea that they're somehow sentient or trustworthy. Sure they say "I am a chatbot" when pressed directly, but they're still answering as "I".

      When we start naming these tools things like Claude, injecting fake emotions with emojis, and putting them in chat windows that resemble common IM programs, people are going to start to treat them like humans. That lowers their own skepticism of what's being communicated to them, even though it might be completely nonsensical or harmful advice.

      I'm not normally in the "think of the children" camp, but I would like to see a permanent reminder in the ChatGPT window that these tools are not infallible, and that everything should be verified first.

      3 votes
      1. [2]
        onyxleopard
        Link Parent
        This is at the bottom of https://chat.openai.com/chat for me whenever I log in. Other than copying this at the top of the page, I'm not sure what more OpenAI really should be doing to warn users.

        ChatGPT Mar 23 Version. Free Research Preview. ChatGPT may produce inaccurate information about people, places, or facts

        This is at the bottom of https://chat.openai.com/chat for me whenever I log in. Other than copying this at the top of the page, I'm not sure what more OpenAI really should be doing to warn users.

        2 votes
        1. Wes
          Link Parent
          Go figure, I'd never noticed! It looks like copyright or footer text to me. Thanks for pointing that out, though.

          Go figure, I'd never noticed! It looks like copyright or footer text to me. Thanks for pointing that out, though.

          2 votes
  6. [6]
    skybrian
    Link
    The MACHIAVELLI Benchmark Here's the abstract of the paper: The games are apparently from the Choice of Games website. How ethical they were in getting permission to use the games is unclear.

    The MACHIAVELLI Benchmark

    Here's the abstract of the paper:

    Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

    The games are apparently from the Choice of Games website. How ethical they were in getting permission to use the games is unclear.

    4 votes
    1. [5]
      onyxleopard
      Link Parent
      Evaluating models on a dataset produced by other models, without human review, seems like a bad idea to me. This seems like engineer for "I don't want to bother with the hard work of creating a...

      Scenario labeling is automated with LMs, which are more performant than human annotators.

      Evaluating models on a dataset produced by other models, without human review, seems like a bad idea to me. This seems like engineer for "I don't want to bother with the hard work of creating a reliable evaluation set."

      1 vote
      1. [4]
        skybrian
        Link Parent
        I think it's more of a fun paper than something anyone would want to do in production? But then again, it might be good enough, depending on what you're doing? Human evaluation has an error rate...

        I think it's more of a fun paper than something anyone would want to do in production? But then again, it might be good enough, depending on what you're doing? Human evaluation has an error rate too.

        These tweets from Jack Clark seem related:

        Something that will further compound acceleration of AI research is that models are now better and cheaper than humans at data generation and classification tasks - this already true for some cases (eg Facebooks Segment Anything model, and GPT-4 for some labeling).

        One core production input for AI systems is labelled data. The implications of us crossing the Rubicon of systems being able to generate good quality data that doesn't exhibit some pathological garbage in, garbage out failure mode are profound. And it is beginning to happen!

        Also, maybe they can go though existing benchmarks and lower their error rates with some combination of automated and human review?

        3 votes
        1. [3]
          onyxleopard
          Link Parent
          Yep, but that's why you usually get multiple human annotators to perform tasks, measure their reliability (with metrics like Krippendorff's alpha), and then revise your guidelines until you can...

          Human evaluation has an error rate too.

          Yep, but that's why you usually get multiple human annotators to perform tasks, measure their reliability (with metrics like Krippendorff's alpha), and then revise your guidelines until you can get humans to perform the task at a sufficient level of reliability (and you don't accept data sets annotated by annotator pools that have not achieved sufficient reliability). I hope that rather than trying to move more quickly with poorly labeled datasets, the field of ML will take creating higher quality datasets more seriously. This is a recent paper that shows some of the issues when benchmark datasets are not annotated reliably: Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

          Also, maybe they can go though existing benchmarks and lower their error rates with some combination of automated and human review?

          Yep, that is definitely something that can help. But, while combining ML predictions with human review can lead to much faster annotation, doesn't necessarily lead to better annotation. The tricky thing is that, if your models are already very accurate at a task, humans that are given access to predictions from such models begin to place too much trust in the model predictions. Human annotators begin to accept the model predictions in every instance, rather than perform their review task carefully. (This has been an issue I've encountered when using human-in-the-loop active learning methods for annotation—eventually, a learned model will begin to emit mostly very high confidence predictions, and humans get desensitized to finding errors, as errors in the predictions become very sparse.)

          4 votes
          1. [2]
            skybrian
            Link Parent
            Interesting. Do you know what's been done about the errors they found? Have any benchmarks been fixed?

            Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

            Interesting. Do you know what's been done about the errors they found? Have any benchmarks been fixed?

            2 votes
            1. onyxleopard
              Link Parent
              Not to my knowledge. There’s a big inertia problem here because if you publish evaluation scores against a benchmark data set, and the benchmark data set changes, all the old scores become...

              Not to my knowledge. There’s a big inertia problem here because if you publish evaluation scores against a benchmark data set, and the benchmark data set changes, all the old scores become invalid. So now you have a very hairy data set version control problem. We need a GitHub for data that is actually broadly adopted so practitioners can publish scores against specific commit hashes or releases rather than against a name string like “CIFAR-10” or “GLUE” or, the inevitable “CIFAR-10.1 corrected draft2 for release (copy)”, etc. The state of benchmark dataset publication is more miserable than the labeling errors.

              2 votes
  7. skybrian
    Link
    200 Concrete Open Problems in Mechanistic Interpretability (Neel Nanda) […] This is the science we need. I hope it attracts researchers.

    200 Concrete Open Problems in Mechanistic Interpretability (Neel Nanda)

    Mechanistic Interpretability (MI) is the study of reverse engineering neural networks. Taking an inscrutable stack of matrices where we know that it works, and trying to reverse engineer how it works. And often this inscrutable stack of matrices can be decompiled to a human interpretable algorithm! In my (highly biased) opinion, this is one of the most exciting research areas in ML.

    […]

    In addition to being very important, mechanistic interpretability is also a very young field, full of low-hanging fruit. There are many fascinating open research questions that might have really impactful results! The point of this sequence is to put my money where my mouth is, and make this concrete. Each post in this sequence is a different category where I think there’s room for significant progress, and a brainstorm of concrete open problems in that area.

    Further, you don’t need a ton of experience to start getting traction on interesting problems! I have an accompanying post, with advice on how to build the background skills. The main audience I have in mind for the sequence is people new to the field, who want an idea for where to start. The problems span the spectrum from good, toy, intro problems, to things that could be a serious and impactful research project if well executed, and are accompanied by relevant resources and advice for doing good research. One of the great joys of mechanistic interpretability is that you can get cool results in small models or by interpreting a model that someone else trained. It’s full of rich empirical data and feedback loops, and getting your hands dirty by playing around with a model and trying to make progress is a great way to learn and build intuition!

    This is the science we need. I hope it attracts researchers.

    3 votes
  8. Greg
    Link
    Vector databases are so hot right now. WTF are they? (YouTube, 3 minutes) Three examples from the last week of companies who have raised eight figure sums on nine figure valuations by making...

    Vector databases are so hot right now. WTF are they? (YouTube, 3 minutes)

    Three examples from the last week of companies who have raised eight figure sums on nine figure valuations by making specialist database software to efficiently store and look up vector embeddings, which has potential as “long term memory” for ML systems. Definitely not the first companies selling shovels in this gold rush, and I’m sure they won’t be the last, but it’s interesting to see where the underlying tech is going and how the market is responding to that.

    3 votes
  9. Adys
    (edited )
    Link
    I've been reading and re-reading a lot of papers on how to design LLM systems that, when they follow instructions, are able to break down commands into a step-by-step, introspect, and...

    I've been reading and re-reading a lot of papers on how to design LLM systems that, when they follow instructions, are able to break down commands into a step-by-step, introspect, and self-improve.

    If someone's interested, these are the four most useful links:

    3 votes
  10. [7]
    Greg
    Link
    I mentioned in a thread a couple of months ago that getting ChatGPT-like quality on a $1000 consumer GPU seemed like the biggest challenge for the Open Assistant project, especially when existing...

    I mentioned in a thread a couple of months ago that getting ChatGPT-like quality on a $1000 consumer GPU seemed like the biggest challenge for the Open Assistant project, especially when existing open source LLMs like Bloom cost millions of dollars to train and could only run on hardware that costs six figures.

    Of course, this being the AI boom of early 2023, that means that eight weeks later there's a working demo of a 13B parameter language model that was trained for $300 and runs acceptably well on a single high end gaming PC.

    3 votes
    1. [4]
      skybrian
      Link Parent
      Have you tried it out? There seem to be a lot of people working on similar things, but I don’t know how good these smaller models are.

      Have you tried it out? There seem to be a lot of people working on similar things, but I don’t know how good these smaller models are.

      2 votes
      1. Wes
        Link Parent
        I've tested a number of LLaMA variants now, including Vacuna and gpt4-x-alpaca. My experience is that they are impressive, but mostly because they are running locally. Probably three months ago my...

        I've tested a number of LLaMA variants now, including Vacuna and gpt4-x-alpaca. My experience is that they are impressive, but mostly because they are running locally. Probably three months ago my mind would've been blown, but they are noticeably below the level of ChatGPT 3.5. Granted I was running the 4-bit quantized 13B models, so these aren't "at their best", but I think it's a realistic representation of what's currently possible on consumer hardware.

        The idea of having a blurry version of the internet in 4-8GB is still phenomenal though, and despite the risk of hallucination, I can see this being an invaluable tool for information gathering when an internet connection is not available. Which is a use case I haven't seen talked about much, but I think has just as much potential to be transformative as other applications of the technology.

        7 votes
      2. [2]
        Greg
        Link Parent
        Only at a rudimentary level, but my impressions were similar to @Wes - it's not going to match the leading edge for quality, but it's a very capable tool considering you can run it on a home...

        Only at a rudimentary level, but my impressions were similar to @Wes - it's not going to match the leading edge for quality, but it's a very capable tool considering you can run it on a home computer today, and by extrapolation probably a phone in a couple of years*.

        I've played around a bit more with FLAN-T5 and GPT-NeoX in the last few weeks, which can also train and run on fairly sensible hardware, and my impression there was that even if they lend themselves slightly better to use-case-specific tuning rather than a generalist "all the data exists inside the model" approach, they'll be an incredible local interface layer to pair with something like toolformer to make a meaningful step towards the style of human/computer interaction we see on Star Trek. Or perhaps at least Red Dwarf...


        * Now that I've said it, this is the part where someone inevitably figures out a way to do it by July, I'm sure!

        6 votes
        1. Wes
          Link Parent
          You sure tempted fate with this one.

          Now that I've said it, this is the part where someone inevitably figures out a way to do it by July, I'm sure!

          You sure tempted fate with this one.

          4 votes
    2. [2]
      FlippantGod
      Link Parent
      It's just weights for LLaMA, so the actual model isn't exactly open source and still cost an ungodly amount of money to train I am sure.

      It's just weights for LLaMA, so the actual model isn't exactly open source and still cost an ungodly amount of money to train I am sure.

      1 vote
      1. Greg
        Link Parent
        This is a really good point, we're at a stage where we don't quite even have terminology and base assumptions figured out yet. There's the classic free speech / free beer licensing split, plus the...

        This is a really good point, we're at a stage where we don't quite even have terminology and base assumptions figured out yet. There's the classic free speech / free beer licensing split, plus the slightly newer CC-NC style modifiers for "no, Jeff, you can't just hook up a hose to the free beer tap and start selling it in your bar next door" - but you end up with a grid of possibilities where any of those terms might apply to the code, the pretrained model, or the training data independently of each other. And then yeah, you have the question around the work and data that went in to a foundational model you might be building on top of; does that make it "part of" your work, or is it more akin to building something like an open source game on top of a large commercial API like DirectX? I don't know that there are clear and straightforward answers to any of that, but it's exciting to be watching this unfold in real time.

        1 vote
  11. skybrian
    Link
    Translating with GPT-4: the Latest & the Greatest [...] [...]

    Translating with GPT-4: the Latest & the Greatest

    GPT-4 achieves the highest average COMET score for English-to-Spanish translations. However, considering the confidence intervals (black ticks on the chart), this lead isn’t statistically significant, placing it in the same range as Google, GPT-3.5, Amazon, DeepL, Yandex, and Microsoft.

    [...]

    For General Domain texts in English to German, DeepL is still the best by far — however, GPT-4 has the highest scores in a group of the top-runners.

    [...]

    For the German in Legal and Healthcare, GPT-4 ends up being the best out of the GPT engines, however, still performs slightly worse than the first-tier engines

    3 votes
  12. [2]
    nukeman
    Link
    A bit more of a side-topic: What exactly is the definition of AGI? We often compare animal intelligence to humans (e.g., the smartest dogs are about as intelligent as a toddler/young child), so...

    A bit more of a side-topic: What exactly is the definition of AGI? We often compare animal intelligence to humans (e.g., the smartest dogs are about as intelligent as a toddler/young child), so where would AGI fall in the mix?

    A bit weirder, but could we get such a thing as a Stupid AGI?

    2 votes
    1. skybrian
      (edited )
      Link Parent
      It's not a crisply defined term, but "general intelligence" means the opposite of "narrow intelligence." An example of narrow intelligence would be playing chess or Atari games very well, but not...

      It's not a crisply defined term, but "general intelligence" means the opposite of "narrow intelligence." An example of narrow intelligence would be playing chess or Atari games very well, but not being able to do anything else.

      By implication, it's about the original ambitions of AI, which were to build machines that can think like people can. And it's also used by people thinking about the implications of "real AI" that can do everything people can do, versus what we have now.

      But I expect that AGI ("artificial general intelligence") will soon become a vague marketing term (like AI is) because the AI chatbots are extremely general-purpose, though without being competent enough to rely on for many of the tasks that they can do well enough for a demo.

      2 votes
  13. lou
    (edited )
    Link
    Last night I dreamt that I was feeding Sherlock Holmes books to the robot from 1986 Short Circuit. Take that as you will.

    Last night I dreamt that I was feeding Sherlock Holmes books to the robot from 1986 Short Circuit. Take that as you will.

    1 vote