18 votes

Evaluating GPT5's reasoning ability using the Only Connect game show

Posted August 13 by Adys

Tags: artificial intelligence, language models.large, evaluation.model, gpt.5, television, game shows, only connect, reasoning, author.alberto manzi, source.ingram

https://ingram.tech/posts/evaluating-gpt5-reasoning-ability

Link information

This data is scraped automatically and may be incorrect.

Title: Evaluating & Ranking GPT-5 Reasoning Ability
Authors: Alberto Manzi
Published: Aug 11 2025
Word count: 618 words

9 comments

[5]
IudexMiku
August 13
Link
I agree with some of the discussion on Hacker News. It seems unlikely that the models weren't trained on most of these questions already. It'd be interesting if these were performed weekly, as a...

I agree with some of the discussion on Hacker News. It seems unlikely that the models weren't trained on most of these questions already.

It'd be interesting if these were performed weekly, as a season aired, to guarantee unseen questions.

8 votes
1. [4]
  Adys (OP)
  August 13
  Link Parent
  It’s less about whether they were trained on it and more about whether they were explicitly trained on it. I think most people on HN have no idea of the implications of the two and a lot of nuance...
  
  It’s less about whether they were trained on it and more about whether they were explicitly trained on it.
  
  I think most people on HN have no idea of the implications of the two and a lot of nuance is lost there.
  
  1 vote
  1. [3]
    IudexMiku
    August 13
    Link Parent
    Would you mind explaining that distinction? I've only studied basic machine learning, so the difference is lost on me. I was under the impression it was a boolean trained vs not trained.
    
    Would you mind explaining that distinction? I've only studied basic machine learning, so the difference is lost on me. I was under the impression it was a boolean trained vs not trained.
    
    2 votes
    
    [2]
    Adys (OP)
    August 13
    Link Parent
    The reason many benchmarks become useless in AI is because as soon as they are released, the benchmark itself is included in the training as part of the process. So it's not just "Here's data...
    
    The reason many benchmarks become useless in AI is because as soon as they are released, the benchmark itself is included in the training as part of the process. So it's not just "Here's data containing puzzles and solutions", but straight up "Here, train on this until you consistently get good results".
    
    LLM training is not what people think it is. If the Only Connect games are part of the data the models were trained on (can be, may not be), then they likely are a tiny amount of what the LLM was trained on. A microscopic part of a corpus of billions of pages of data that has little to zero impact on results that aren't extremely close to the original wording of the questions in whatever the model was trained on.
    
    Training against benchmarks is done either with oversampling of the benchmark data (which can negatively impact the model quality), or through supervised fine-tuning / reinforcement learning (set up a test like we did and get the model to answer better).
    
    Candidly, I do not at all believe Only Connect was part of the direct training of any of the models we tested. It's not impossible, but given the results we've seen, it's extremely unlikely and we genuinely couldn't find any evidence that it was. But sometimes HNers tend to like to pretend they have knowledge about AI so they can get hired, so more often than not you get people commenting nonsense out of their ass :) An unfortunate part of working in this sector. It was always the case in tech, except that now the people who used to boast about MongoDB being web-scale whose only harm was building bad rails SaaS are now spouting nonsense in the most impactful advance of the century... /rant
    
    9 votes
    
    tanglisha
    August 15
    Link Parent
    Now that you say that I realize I haven't even heard the term "Web scale" in a really long time. I feel like it was mostly thrown down with the intention of it being a debate ender.
    
    people who used to boast about MongoDB being web-scale
    
    Now that you say that I realize I haven't even heard the term "Web scale" in a really long time. I feel like it was mostly thrown down with the intention of it being a debate ender.
    
    2 votes
[2]
balooga
August 13
Link
The other day I was playing Pinpoint, one of LinkedIn's daily mobile puzzle games. With one guess left I was stumped so I decided to see how ChatGPT 5 handled it. Here's the conversation: Me:...
The other day I was playing Pinpoint, one of LinkedIn's daily mobile puzzle games. With one guess left I was stumped so I decided to see how ChatGPT 5 handled it. Here's the conversation:

Me:

You are playing a game where you have to find the thing that several words have in common. It could be a common feature of all of the concepts described by the words, or it could be a feature of the words themselves (like that they fill in the blank in a common phrase, or something similar).

These are the words:

Cheese
Guitar
Wheat
Lettuce
Documents

Think hard about your guess.

ChatGPT:
They can all be shredded.

shredded cheese

shredding on a guitar

Shredded Wheat (the cereal)

shredded lettuce

shredding documents
Me:

🏆

ChatGPT:

Nice! 🥳
I’ll take my trophy and… shred it? For thematic consistency, of course.

I mean, in hindsight maybe it was obvious but I had a mental block and couldn't get to the answer. GPT-5 nailed it. And the little joke at the end was cute... I actually laughed out loud.
6 votes
1. R3qn65
  August 13
  Link Parent
  I wonder - based on how we think these models work (high dimensional vectors representing semantics), this kind of game should be what they are absolutely best at. I wonder if that bears out in...
  
  I wonder - based on how we think these models work (high dimensional vectors representing semantics), this kind of game should be what they are absolutely best at. I wonder if that bears out in reality?
  
  6 votes
Adys (OP)
August 13 (edited August 13)
Link
This is work I've been spearheading with my team the past few weeks, wanted to share it here as I know we have some Only Connect fans as well. Open to questions :) Discussion on HN:...

This is work I've been spearheading with my team the past few weeks, wanted to share it here as I know we have some Only Connect fans as well. Open to questions :)

Discussion on HN: https://news.ycombinator.com/item?id=44876205

5 votes
skybrian
August 13
Link
Nice work! Nit: the use of color for the graphs was puzzling at first. They are by model family: all the GPT5 results are red, Claude is grey, and so on.

Nice work!

Nit: the use of color for the graphs was puzzling at first. They are by model family: all the GPT5 results are red, Claude is grey, and so on.

2 votes