-
4 votes
-
Candlemass – Black Star (2025)
5 votes -
Nintendo President on the new Switch 2, tariffs and what's next for the company
17 votes -
Lights - SURFACE TENSION (2025)
5 votes -
They Might Be Giants - Hovering Sombrero (2001)
5 votes -
Voces Primaverae Childrens Chorus - Lalechet Șevi Acharaich (1992)
6 votes -
The Man-Eating Tree – Seer (2025)
4 votes -
Explaining the Donald Trump tariff in the US
18 votes -
What games have you been playing, and what's your opinion on them?
What have you been playing lately? Discussion about video games and board games are both welcome. Please don't just make a list of titles, give some thoughts about the game(s) as well.
30 votes -
‘A Quiet Place: Day One’ helmer Michael Sarnoski to direct adaptation of popular video game ‘Death Stranding’ for A24 and Kojima Productions
25 votes -
The new US tariffs - weird formulas, risks, and the coming trade war
34 votes -
Mapping Earth, one billion years ago
18 votes -
‘A Minecraft Movie’ at $157m a record opening for videogame pic, toppling ‘Super Mario Bros’; Warner Bros brings the box office back alive
30 votes -
Why US President Donald Trump's tariff chaos actually makes sense (big picture)
13 votes -
Why the island of Bornholm is Danish and not German, Swedish or Polish
7 votes -
How The Beverly Hillbillies changed everything - a retrospective
8 votes -
Volbeat – In the Barn of the Goat Giving Birth to Satan's Spawn in a Dying World of Doom (2025)
15 votes -
Feed Me - One Click Headshot (Grafix Remix) (2012, 2024)
7 votes -
Aerosols: Airborne particles in Earth's atmosphere (2012)
4 votes -
Is Swedish striker Viktor Gyökeres the right fit for Arsenal in the Premier League?
6 votes -
Can you beat Donkey Kong Country without bananas? | VG Myths
7 votes -
I built a fire pit with a hidden cold plunge inside
6 votes -
Heart Aerospace has just revealed its X1 demonstrator aircraft – thirty-seater commercial electric airplane with hybrid capabilities
6 votes -
Arch Enemy – A Million Suns (2025)
4 votes -
Cocoricó - The Story of the Poop (2006)
5 votes -
BABYMETAL - from me to u (feat. Poppy) (2025)
11 votes -
Playstacean - Custom crab shaped PlayStation build
12 votes -
The Hives – Enough Is Enough (2025)
8 votes -
Milkywhale – Breathe In (2025)
3 votes -
Igorrr - ADHD (2025)
19 votes -
Chappell Roan - Pink Pony Club (Live from the 67th Grammy Awards, 2025)
6 votes -
Skrillex - FUCK U SKRILLEX YOU THINK UR ANDY WARHOL BUT UR NOT!! <3 (2025)
30 votes -
The ARC-AGI-2 benchmark could help reframe the conversation about AI performance in a more constructive way
The popular online discourse on Large Language Models’ (LLMs’) capabilities is often polarized in a way I find annoying and tiresome. On one end of the spectrum, there is nearly complete dismissal...
The popular online discourse on Large Language Models’ (LLMs’) capabilities is often polarized in a way I find annoying and tiresome.
On one end of the spectrum, there is nearly complete dismissal of LLMs: an LLM is just a slightly fancier version of the autocomplete on your phone’s keyboard, there’s nothing to see here, move on (dot org).
This dismissive perspective overlooks some genuinely interesting novel capabilities of LLMs. For example, I can come up with a new joke and ask ChatGPT to explain why it’s funny or come up with a new reasoning problem and ask ChatGPT to solve it. My phone’s keyboard can’t do that.
On the other end of the spectrum, there are eschatological predictions: human-level or superhuman artificial general intelligence (AGI) will likely be developed within 10 years or even within 5 years, and skepticism toward such predictions is “AI denialism”, analogous to climate change denial. Just listen to the experts!
There are inconvenient facts for this narrative, such as that the majority of AI experts give much more conservative timelines for AGI when asked in surveys and disagree with the idea that scaling up LLMs could lead to AGI.
The ARC Prize is an attempt by prominent AI researcher François Chollet (with help from Mike Knoop, who apparently does AI stuff at Zapier) to introduce some scientific rigour into the conversation. There is a monetary prize for open source AI systems that can perform well on a benchmark called ARC-AGI-2, which recently superseded the ARC-AGI benchmark. (“ARC” stands for “Abstract and Reasoning Corpus”.)
ARC-AGI-2 is not a test of whether an AI is an AGI or not. It’s intended to test whether AI systems are making incremental progress toward AGI. The tasks the AI is asked to complete are colour-coded visual puzzles like you might find in a tricky puzzle game. (Example.) The intention is to design tasks that are easy for humans to solve and hard for AI to solve.
The current frontier AI models score less than 5% on ARC-AGI-2. Humans score 60% on average and 100% of tasks have been solved by at least two humans in two attempts or less.
For me, this helps the conversation about AI capabilities because it gives a rigorous test and quantitative measure to my casual, subjective observations that LLMs routinely fail at tasks that are easy for humans.
François Chollet was impressed when OpenAI’s o3 model scored 75.7% on ARC-AGI (the older version of the benchmark). He emphasizes the concept of “fluid intelligence”, which he seems to define as the ability to adapt to new situations and solve novel problems. Chollet thinks that o3 is the first AI system to demonstrate fluid intelligence, although it’s still a low level of fluid intelligence. (o3 also required thousands of dollars’ worth of computation to achieve this result.)
This is the sort of distinction that can’t be teased out by the polarized popular discourse. It’s the sort of nuanced analysis I’ve been seeking out, but which has been drowned out by extreme positions on LLMs that ignore inconvenient facts.
I would like to see more benchmarks that try to do what AGI-AGI-2 does: find problems that humans can easily solve and frontier AI models can’t solve. These sort of benchmarks can help us measure AGI progress much more usefully than the typical benchmarks, which play to LLMs’ strengths (e.g. massive-scale memorization) and don’t challenge them on their weaknesses (e.g. reasoning).
I long to see AGI within my lifetime. But the super short timeframes given by some people in the AI industry feel to me like they border on mania or psychosis. The discussion is unrigorous, with people pulling numbers out of thin air based on gut feeling.
It’s clear that there are many things humans are good at doing that AI can’t do at all (where the humans vs. AI success rate is ~100% vs. ~0%). It serves no constructive purpose to ignore this truth and it may serve AI research to develop rigorous benchmarks around it.
Such benchmarks will at least improve the quality of discussion around AI capabilities, insofar as people pay attention to them.
Update (2024-04-11 at 19:16 UTC): François Chollet has a new 20-minute talk on YouTube that I recommend. I've watched a few videos of Chollet talking about ARC-AGI or ARC-AGI-2, and this one is beautifully succinct: https://www.youtube.com/watch?v=TWHezX43I-4
10 votes -
The rise and fall of "The Resistance"
3 votes -
Amy Hakanson shows us the sixteen stringed, thirty-nine keyed nyckelharpa
6 votes -
Tip to tip: Crossing Japan with no map
21 votes -
Five Nights at Freddy’s 2 | Official teaser
3 votes -
How a simple tractor conquered the South Pole
7 votes -
Books are the new luxury
5 votes -
Nintendo Direct: Nintendo Switch 2 – 4.2.2025
52 votes -
Helios Welding Visualization System uses intense pulsed lighting to outshine electric arcs, laser cutters, and even burning magnesium
3 votes -
Henry Kissinger's Moo Goo Gai Pan is real. Is it good?
6 votes -
Spontaneous fractals appear when you pull things apart. Viscous fingering (Saffman–Taylor instability) occurs when a less viscous fluid is pushed into a more viscous fluid.
18 votes -
Counter-Strike: Football — a competitive multiplayer FPS written in... PHP???
6 votes -
Eddy Burback chronicles his month without a phone
22 votes -
The loneliest NPC in Super Mario Odyssey
12 votes -
Bloodred Hourglass – We Should Be Buried Like This (2025)
6 votes -
Blizzard reportedly receiving new StarCraft game pitches from well-known Korean developers
9 votes -
Only fifty-six people have beaten this Pokémon game
17 votes -
What games have you been playing, and what's your opinion on them?
What have you been playing lately? Discussion about video games and board games are both welcome. Please don't just make a list of titles, give some thoughts about the game(s) as well.
18 votes