18 votes

Evaluating GPT5's reasoning ability using the Only Connect game show

9 comments

  1. [5]
    IudexMiku
    Link
    I agree with some of the discussion on Hacker News. It seems unlikely that the models weren't trained on most of these questions already. It'd be interesting if these were performed weekly, as a...

    I agree with some of the discussion on Hacker News. It seems unlikely that the models weren't trained on most of these questions already.

    It'd be interesting if these were performed weekly, as a season aired, to guarantee unseen questions.

    8 votes
    1. [4]
      Adys
      Link Parent
      It’s less about whether they were trained on it and more about whether they were explicitly trained on it. I think most people on HN have no idea of the implications of the two and a lot of nuance...

      It’s less about whether they were trained on it and more about whether they were explicitly trained on it.

      I think most people on HN have no idea of the implications of the two and a lot of nuance is lost there.

      1 vote
      1. [3]
        IudexMiku
        Link Parent
        Would you mind explaining that distinction? I've only studied basic machine learning, so the difference is lost on me. I was under the impression it was a boolean trained vs not trained.

        Would you mind explaining that distinction? I've only studied basic machine learning, so the difference is lost on me. I was under the impression it was a boolean trained vs not trained.

        2 votes
        1. [2]
          Adys
          Link Parent
          The reason many benchmarks become useless in AI is because as soon as they are released, the benchmark itself is included in the training as part of the process. So it's not just "Here's data...

          The reason many benchmarks become useless in AI is because as soon as they are released, the benchmark itself is included in the training as part of the process. So it's not just "Here's data containing puzzles and solutions", but straight up "Here, train on this until you consistently get good results".

          LLM training is not what people think it is. If the Only Connect games are part of the data the models were trained on (can be, may not be), then they likely are a tiny amount of what the LLM was trained on. A microscopic part of a corpus of billions of pages of data that has little to zero impact on results that aren't extremely close to the original wording of the questions in whatever the model was trained on.

          Training against benchmarks is done either with oversampling of the benchmark data (which can negatively impact the model quality), or through supervised fine-tuning / reinforcement learning (set up a test like we did and get the model to answer better).

          Candidly, I do not at all believe Only Connect was part of the direct training of any of the models we tested. It's not impossible, but given the results we've seen, it's extremely unlikely and we genuinely couldn't find any evidence that it was. But sometimes HNers tend to like to pretend they have knowledge about AI so they can get hired, so more often than not you get people commenting nonsense out of their ass :) An unfortunate part of working in this sector. It was always the case in tech, except that now the people who used to boast about MongoDB being web-scale whose only harm was building bad rails SaaS are now spouting nonsense in the most impactful advance of the century... /rant

          9 votes
          1. tanglisha
            Link Parent
            Now that you say that I realize I haven't even heard the term "Web scale" in a really long time. I feel like it was mostly thrown down with the intention of it being a debate ender.

            people who used to boast about MongoDB being web-scale

            Now that you say that I realize I haven't even heard the term "Web scale" in a really long time. I feel like it was mostly thrown down with the intention of it being a debate ender.

            2 votes
  2. [2]
    balooga
    Link
    The other day I was playing Pinpoint, one of LinkedIn's daily mobile puzzle games. With one guess left I was stumped so I decided to see how ChatGPT 5 handled it. Here's the conversation: Me:...

    The other day I was playing Pinpoint, one of LinkedIn's daily mobile puzzle games. With one guess left I was stumped so I decided to see how ChatGPT 5 handled it. Here's the conversation:

    Me:

    You are playing a game where you have to find the thing that several words have in common. It could be a common feature of all of the concepts described by the words, or it could be a feature of the words themselves (like that they fill in the blank in a common phrase, or something similar).

    These are the words:

    Cheese
    Guitar
    Wheat
    Lettuce
    Documents

    Think hard about your guess.

    ChatGPT:

    They can all be shredded.

    • shredded cheese
    • shredding on a guitar
    • Shredded Wheat (the cereal)
    • shredded lettuce
    • shredding documents

    Me:

    🏆

    ChatGPT:

    Nice! 🥳
    I’ll take my trophy and… shred it? For thematic consistency, of course.

    I mean, in hindsight maybe it was obvious but I had a mental block and couldn't get to the answer. GPT-5 nailed it. And the little joke at the end was cute... I actually laughed out loud.

    6 votes
    1. R3qn65
      Link Parent
      I wonder - based on how we think these models work (high dimensional vectors representing semantics), this kind of game should be what they are absolutely best at. I wonder if that bears out in...

      I wonder - based on how we think these models work (high dimensional vectors representing semantics), this kind of game should be what they are absolutely best at. I wonder if that bears out in reality?

      6 votes
  3. Adys
    (edited )
    Link
    This is work I've been spearheading with my team the past few weeks, wanted to share it here as I know we have some Only Connect fans as well. Open to questions :) Discussion on HN:...

    This is work I've been spearheading with my team the past few weeks, wanted to share it here as I know we have some Only Connect fans as well. Open to questions :)

    Discussion on HN: https://news.ycombinator.com/item?id=44876205

    5 votes
  4. skybrian
    Link
    Nice work! Nit: the use of color for the graphs was puzzling at first. They are by model family: all the GPT5 results are red, Claude is grey, and so on.

    Nice work!

    Nit: the use of color for the graphs was puzzling at first. They are by model family: all the GPT5 results are red, Claude is grey, and so on.

    2 votes