27 votes

British AI startup beats humans in international forecasting

29 comments

  1. [22]
    lynxy
    Link
    Quoting Toby Shelvane, founder of ManticAI- "Some say LLMs just regurgitate their training data, but you can’t predict the future like that. [...] It requires genuine reasoning." Just.. no....

    Quoting Toby Shelvane, founder of ManticAI- "Some say LLMs just regurgitate their training data, but you can’t predict the future like that. [...] It requires genuine reasoning."

    Just.. no. Literally the idea of these agents is to spit out the most likely next token given an input set, and predicting future events is somewhat an effort in extrapolation. If anything, this just lowers my confidence in the human ability to predict future events.

    I'm not exactly impressed by companies that are spun up to middleman AI- they're not making anything novel, they're just employing multiple existing backends to try to fudge a competition.

    29 votes
    1. [7]
      Greg
      Link Parent
      The guy behind the company seems to be for real, so I think this is one of the comparatively rare occasions where they’re more than just a quick cash in wrapped around a ChatGPT API account.

      The guy behind the company seems to be for real, so I think this is one of the comparatively rare occasions where they’re more than just a quick cash in wrapped around a ChatGPT API account.

      7 votes
      1. [3]
        lynxy
        Link Parent
        I'm not certain I understand Google Scholar attribution requirements, but the only articles in that list for which he is actually listed as a contributor are the ones on analysis of the danger of...

        I'm not certain I understand Google Scholar attribution requirements, but the only articles in that list for which he is actually listed as a contributor are the ones on analysis of the danger of AI, and before that, the privacy implications for track-and-trace solutions? The two actually technical articles that might require some underlying understanding of LLMs to write have very extensive lists of attributions, but he's not in either list. Am I missing something?

        4 votes
        1. [2]
          Greg
          Link Parent
          Google Scholar shows a truncated list of authors, his name is present in the full list if you expand it on the original publication. You have actually stumbled across something that's a whole...

          Google Scholar shows a truncated list of authors, his name is present in the full list if you expand it on the original publication. You have actually stumbled across something that's a whole thing here, and needs a bit of context...

          There's some debate about the trend for tech industry scholarly publications including 1,000+ authors, which is something that a lot of the headline papers do now, but being part of that thousand is no small achievement. On the one hand, yeah, it does kind of obfuscate the significance of any individual contribution, and it's definitely influenced in part by internal politics (who in the field wouldn't want their name attached to the main publication for a major model, after all?); but on the other hand, you don't get a piece of research at that scale done without a large team all doing meaningful work towards it, and unlike in academia there's far less likelihood that each subgroup within the team will be working on their own respective publications, so the single big paper with everyone on it is often the only way to really credit everyone who genuinely did make the research happen.

          tl;dr he's an author on the big papers, but so are >1,000 other people. Most if not all of them probably did make real contributions, but it's hard to untangle how much is down to any one of them.

          10 votes
          1. lynxy
            Link Parent
            Ah, I see- thanks for the dive into a world I'm unfamiliar with :)

            Ah, I see- thanks for the dive into a world I'm unfamiliar with :)

            2 votes
      2. [3]
        arghdos
        Link Parent
        Being an author on Gemini doesn’t make you “for real” it makes you someone with an insane financial motivation for the general public to believe: Further the first quote of the article: Is wrong...

        Being an author on Gemini doesn’t make you “for real” it makes you someone with an insane financial motivation for the general public to believe:

        Some say LLMs just regurgitate their training data, but you can’t predict the future like that. [...] It requires genuine reasoning.

        Further the first quote of the article:

        Ben Shindel, one of the professional forecasters who found himself behind AI during the contest before finishing above Mantic. “We’ve really come a long way here compared with a year ago when the best bot was at something like rank 300.”

        Is wrong based on the first damn sentence on the contest’s webpage:

        Congratulations to the winners of our first ever Metaculus Cup!

        Maybe he meant something else, no clue, but do we really need yet another free marketing puff piece for some AI startup from a journalist who can’t be bothered to read the web page for the contest they’re reporting on so that they know to ask a freaking clarifying question in paragraph two?

        https://www.metaculus.com/notebooks/39990/winners-of-the-summer-2025-metaculus-cup/

        1. unkz
          Link Parent
          They have been doing a quarterly cup competition since 2023. This latest one is just a slightly different format (summer rather than Q1/Q2/Q3/Q4) , so it’s the “first” of the new format. There’s...

          Maybe he meant something else, no clue

          They have been doing a quarterly cup competition since 2023. This latest one is just a slightly different format (summer rather than Q1/Q2/Q3/Q4) , so it’s the “first” of the new format. There’s absolutely nothing incorrect about any of the statements.

          https://www.metaculus.com/notebooks/17700/quarterly-cup-tournament-q1-2025/

          3 votes
        2. Greg
          Link Parent
          I've worked alongside people from DeepMind on a couple of occasions, and I can tell you with absolute certainty that they employ some of the most ridiculously intelligent and capable scientists...

          I've worked alongside people from DeepMind on a couple of occasions, and I can tell you with absolute certainty that they employ some of the most ridiculously intelligent and capable scientists I've ever met. They've got a goddamn Nobel Prize under their belt! So you can call this personal bias if you like, or you can call it direct experience, but I actually put a lot of weight on someone being an author on one of their papers.

          Like I said above, it's not a 100% guarantee of meaningful contribution when there are that many people listed and that many complex internal reasons governing who does and doesn't get a mention, and I'm sure not everyone at any large organisation is necessarily part of their A-team even on a good day, but in lieu of any other information it's enough for me to think they're probably actually working on some legitimate tech rather than just repackaging someone else's hosted LLM.

          But yeah, I get it. It is a marketing piece. Given how little I could see about the company itself (that's actually why I ended up looking for the guy's personal CV, because I wanted to know if they were legit and couldn't see anything but copies of this same story when I tried to dig into the company) it's maybe fair to call it a puff piece. And we're being absolutely flooded by buzzword-laden crap under the banner of "AI", so I get the skepticism and the irritation. All of that is why I think it's even more important to sift out the work that could potentially make an actual positive contribution from the crap that almost definitely won't.

          2 votes
    2. [3]
      unkz
      Link Parent
      This is like the AI effect in action.

      If anything, this just lowers my confidence in the human ability to predict future events.

      This is like the AI effect in action.

      5 votes
      1. [2]
        lynxy
        Link Parent
        Please can you elaborate on how the AI effect relates to the original article, or my comment, depending on which you meant? I'm not certain I'm catching your meaning.

        Please can you elaborate on how the AI effect relates to the original article, or my comment, depending on which you meant? I'm not certain I'm catching your meaning.

        2 votes
        1. unkz
          Link Parent
          I think the linked article puts it clearly enough Just because a LLM associated application is doing it doesn’t mean the task is now somehow lesser or doesn’t involve “reasoning”.

          I think the linked article puts it clearly enough

          The author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'."

          Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"

          Just because a LLM associated application is doing it doesn’t mean the task is now somehow lesser or doesn’t involve “reasoning”.

          8 votes
    3. Lobachevsky
      (edited )
      Link Parent
      Then either this isn't the whole story or the human ability to "reason" is worse at these predictions than "literally the idea to spit out the most likely next token given an input set" and as...

      Literally the idea of these agents is to spit out the most likely next token given an input set, and predicting future events is somewhat an effort in extrapolation. If anything, this just lowers my confidence in the human ability to predict future events.

      Then either this isn't the whole story or the human ability to "reason" is worse at these predictions than "literally the idea to spit out the most likely next token given an input set" and as such isn't as valuable (at these predictions).

      The real take is that finding patterns in the data has a LOT of very solid applications in a variety of fields and machine learning is seemingly an incredibly useful tool that excels at finding patterns in the data. Or at the very least exceeds human ability to do so.

      One of the most straightforward examples when a "dumb" algorithm outperformed human expert predictions would be index funds in the stock market. So, you don't need some spiritual ability to "reason" or be "intelligent" to get good results from accumulating data (one could argue that by accumulating data from many humans we are already accumulating their reasoning capabilities and hopefully this way can exceed any individual one's).

      Also this quote from the article:

      Warren Hatch, the chief executive of Good Judgment, a forecasting company co-founded by Tetlock, said: “We expect AI will excel in certain categories of questions, like monthly inflation rates. For categories with sparse data that require more judgment, humans retain the edge. The main point for us is that the answer isn’t human or AI, but instead human and AI to get the best forecast possible as quickly as possible.”

      5 votes
    4. teaearlgraycold
      Link Parent
      If humans actually could predict the future we'd have a lot more problems.

      If humans actually could predict the future we'd have a lot more problems.

      3 votes
    5. [9]
      saturnV
      Link Parent
      nitpick: with standard sampling and post training techniques this isn't true

      the idea of these agents is to spit out the most likely next token given an input set,

      nitpick: with standard sampling and post training techniques this isn't true

      1 vote
      1. [8]
        mordae
        Link Parent
        It must be, given there is only the model, the sequence so far and white noise in the working set and the output is the next token or two.

        It must be, given there is only the model, the sequence so far and white noise in the working set and the output is the next token or two.

        1. unkz
          Link Parent
          I think there is some confusion out there stemming from the true fact that an LLM produces a vector that a sampler can interpret as a probability distribution and the mistaken idea that LLMs are...

          I think there is some confusion out there stemming from

          • the true fact that an LLM produces a vector that a sampler can interpret as a probability distribution and
          • the mistaken idea that LLMs are effectively massive full context Markov chains that provide a probability distribution that matches the training corpus
          2 votes
        2. [6]
          saturnV
          Link Parent
          the model outputs a probability distribution of tokens, and (when the temp isn't 0) the sampler will often pick a token which isn't the most likely one to prevent getting stuck in loops / enable...

          the model outputs a probability distribution of tokens, and (when the temp isn't 0) the sampler will often pick a token which isn't the most likely one to prevent getting stuck in loops / enable "creativity" (creativity is a very loose analogy, don't take it too literally). Even when the temp is 0 models often aren't deterministic for complicated reasons.

          1 vote
          1. [5]
            mordae
            Link Parent
            So what's your argument? Sampler output is still just a function of input, model and noise. What complicated reasons? Correlated noise from some kind of race condition? Memory errors?

            So what's your argument? Sampler output is still just a function of input, model and noise.

            What complicated reasons? Correlated noise from some kind of race condition? Memory errors?

            1. [2]
              papasquat
              Link Parent
              It's mostly floating point errors

              It's mostly floating point errors

              1. mordae
                Link Parent
                Those are completely deterministic and repeatable, unless you somehow reorder the operations.

                Those are completely deterministic and repeatable, unless you somehow reorder the operations.

            2. [2]
              saturnV
              Link Parent
              argument is that "most likely" implies deterministic in terms of only the text, saying deterministic as a function of the text and also a source of randomness is basically vacuous

              argument is that "most likely" implies deterministic in terms of only the text, saying deterministic as a function of the text and also a source of randomness is basically vacuous

              1. mordae
                Link Parent
                The random bits are part of the input set for the computation. There is no creativity, just a bit of whimsy.

                The random bits are part of the input set for the computation. There is no creativity, just a bit of whimsy.

  2. [3]
    unkz
    Link
    I think the main advantage here is being able to look deeply into 60 unrelated topics without getting just plain bored. I’m sure an “mixture of experts” of actual human experts will still...

    I think the main advantage here is being able to look deeply into 60 unrelated topics without getting just plain bored. I’m sure an “mixture of experts” of actual human experts will still outperform for much longer. Interesting stuff though.

    6 votes
    1. [2]
      creesch
      (edited )
      Link Parent
      It's also important to note that while the title seems to imply that it ranked first, that is not the case. It ranked eight in a competition of 300 competitors. Which is still good, but not...

      It's also important to note that while the title seems to imply that it ranked first, that is not the case. It ranked eight in a competition of 300 competitors. Which is still good, but not outperforming all humans.

      15 votes
      1. umbrae
        Link Parent
        I’m generally an optimist in these things but I was also curious, but couldn’t find, how many AI entrants there were. If 150 of the entrants were AI, this would probably be less impressive of an...

        I’m generally an optimist in these things but I was also curious, but couldn’t find, how many AI entrants there were. If 150 of the entrants were AI, this would probably be less impressive of an achievement, or example.

        2 votes
  3. [2]
    scarecrw
    Link
    Anyone familiar with the competition know more about the scoring? I'm curious if they had any system to allow contestants to weight their confidence in their predictions.

    Anyone familiar with the competition know more about the scoring? I'm curious if they had any system to allow contestants to weight their confidence in their predictions.

    3 votes
    1. TemulentTeatotaler
      Link Parent
      Here's their scoring FAQ if you'd like to check it out? More involved than I'd feel like giving an opinion on, but a quick look suggests it does penalize overconfidence: It also has scores for...

      Here's their scoring FAQ if you'd like to check it out? More involved than I'd feel like giving an opinion on, but a quick look suggests it does penalize overconfidence:

      One interesting property of the log score: it is much more punitive of extreme wrong predictions than it is rewarding of extreme right predictions.

      It also has scores for things like performance relative to peers and "coverage" (how early you were / the span for which your prediction was correct).

      5 votes
  4. [2]
    tanglisha
    Link
    Gah, I was really hoping this was about improving weather forecasting.

    Gah, I was really hoping this was about improving weather forecasting.

    1 vote