18
votes
Evaluating GPT5's reasoning ability using the Only Connect game show
Link information
This data is scraped automatically and may be incorrect.
- Title
- Evaluating & Ranking GPT-5 Reasoning Ability
- Authors
- Alberto Manzi
- Published
- Aug 11 2025
- Word count
- 618 words
I agree with some of the discussion on Hacker News. It seems unlikely that the models weren't trained on most of these questions already.
It'd be interesting if these were performed weekly, as a season aired, to guarantee unseen questions.
It’s less about whether they were trained on it and more about whether they were explicitly trained on it.
I think most people on HN have no idea of the implications of the two and a lot of nuance is lost there.
Would you mind explaining that distinction? I've only studied basic machine learning, so the difference is lost on me. I was under the impression it was a boolean trained vs not trained.
The reason many benchmarks become useless in AI is because as soon as they are released, the benchmark itself is included in the training as part of the process. So it's not just "Here's data containing puzzles and solutions", but straight up "Here, train on this until you consistently get good results".
LLM training is not what people think it is. If the Only Connect games are part of the data the models were trained on (can be, may not be), then they likely are a tiny amount of what the LLM was trained on. A microscopic part of a corpus of billions of pages of data that has little to zero impact on results that aren't extremely close to the original wording of the questions in whatever the model was trained on.
Training against benchmarks is done either with oversampling of the benchmark data (which can negatively impact the model quality), or through supervised fine-tuning / reinforcement learning (set up a test like we did and get the model to answer better).
Candidly, I do not at all believe Only Connect was part of the direct training of any of the models we tested. It's not impossible, but given the results we've seen, it's extremely unlikely and we genuinely couldn't find any evidence that it was. But sometimes HNers tend to like to pretend they have knowledge about AI so they can get hired, so more often than not you get people commenting nonsense out of their ass :) An unfortunate part of working in this sector. It was always the case in tech, except that now the people who used to boast about MongoDB being web-scale whose only harm was building bad rails SaaS are now spouting nonsense in the most impactful advance of the century... /rant
Now that you say that I realize I haven't even heard the term "Web scale" in a really long time. I feel like it was mostly thrown down with the intention of it being a debate ender.
The other day I was playing Pinpoint, one of LinkedIn's daily mobile puzzle games. With one guess left I was stumped so I decided to see how ChatGPT 5 handled it. Here's the conversation:
Me:
ChatGPT:
Me:
ChatGPT:
I mean, in hindsight maybe it was obvious but I had a mental block and couldn't get to the answer. GPT-5 nailed it. And the little joke at the end was cute... I actually laughed out loud.
I wonder - based on how we think these models work (high dimensional vectors representing semantics), this kind of game should be what they are absolutely best at. I wonder if that bears out in reality?
This is work I've been spearheading with my team the past few weeks, wanted to share it here as I know we have some Only Connect fans as well. Open to questions :)
Discussion on HN: https://news.ycombinator.com/item?id=44876205
Nice work!
Nit: the use of color for the graphs was puzzling at first. They are by model family: all the GPT5 results are red, Claude is grey, and so on.