-
30 votes
-
The ARC-AGI-2 benchmark could help reframe the conversation about AI performance in a more constructive way
The popular online discourse on Large Language Models’ (LLMs’) capabilities is often polarized in a way I find annoying and tiresome. On one end of the spectrum, there is nearly complete dismissal...
The popular online discourse on Large Language Models’ (LLMs’) capabilities is often polarized in a way I find annoying and tiresome.
On one end of the spectrum, there is nearly complete dismissal of LLMs: an LLM is just a slightly fancier version of the autocomplete on your phone’s keyboard, there’s nothing to see here, move on (dot org).
This dismissive perspective overlooks some genuinely interesting novel capabilities of LLMs. For example, I can come up with a new joke and ask ChatGPT to explain why it’s funny or come up with a new reasoning problem and ask ChatGPT to solve it. My phone’s keyboard can’t do that.
On the other end of the spectrum, there are eschatological predictions: human-level or superhuman artificial general intelligence (AGI) will likely be developed within 10 years or even within 5 years, and skepticism toward such predictions is “AI denialism”, analogous to climate change denial. Just listen to the experts!
There are inconvenient facts for this narrative, such as that the majority of AI experts give much more conservative timelines for AGI when asked in surveys and disagree with the idea that scaling up LLMs could lead to AGI.
The ARC Prize is an attempt by prominent AI researcher François Chollet (with help from Mike Knoop, who apparently does AI stuff at Zapier) to introduce some scientific rigour into the conversation. There is a monetary prize for open source AI systems that can perform well on a benchmark called ARC-AGI-2, which recently superseded the ARC-AGI benchmark. (“ARC” stands for “Abstract and Reasoning Corpus”.)
ARC-AGI-2 is not a test of whether an AI is an AGI or not. It’s intended to test whether AI systems are making incremental progress toward AGI. The tasks the AI is asked to complete are colour-coded visual puzzles like you might find in a tricky puzzle game. (Example.) The intention is to design tasks that are easy for humans to solve and hard for AI to solve.
The current frontier AI models score less than 5% on ARC-AGI-2. Humans score 60% on average and 100% of tasks have been solved by at least two humans in two attempts or less.
For me, this helps the conversation about AI capabilities because it gives a rigorous test and quantitative measure to my casual, subjective observations that LLMs routinely fail at tasks that are easy for humans.
François Chollet was impressed when OpenAI’s o3 model scored 75.7% on ARC-AGI (the older version of the benchmark). He emphasizes the concept of “fluid intelligence”, which he seems to define as the ability to adapt to new situations and solve novel problems. Chollet thinks that o3 is the first AI system to demonstrate fluid intelligence, although it’s still a low level of fluid intelligence. (o3 also required thousands of dollars’ worth of computation to achieve this result.)
This is the sort of distinction that can’t be teased out by the polarized popular discourse. It’s the sort of nuanced analysis I’ve been seeking out, but which has been drowned out by extreme positions on LLMs that ignore inconvenient facts.
I would like to see more benchmarks that try to do what AGI-AGI-2 does: find problems that humans can easily solve and frontier AI models can’t solve. These sort of benchmarks can help us measure AGI progress much more usefully than the typical benchmarks, which play to LLMs’ strengths (e.g. massive-scale memorization) and don’t challenge them on their weaknesses (e.g. reasoning).
I long to see AGI within my lifetime. But the super short timeframes given by some people in the AI industry feel to me like they border on mania or psychosis. The discussion is unrigorous, with people pulling numbers out of thin air based on gut feeling.
It’s clear that there are many things humans are good at doing that AI can’t do at all (where the humans vs. AI success rate is ~100% vs. ~0%). It serves no constructive purpose to ignore this truth and it may serve AI research to develop rigorous benchmarks around it.
Such benchmarks will at least improve the quality of discussion around AI capabilities, insofar as people pay attention to them.
Update (2024-04-11 at 19:16 UTC): François Chollet has a new 20-minute talk on YouTube that I recommend. I've watched a few videos of Chollet talking about ARC-AGI or ARC-AGI-2, and this one is beautifully succinct: https://www.youtube.com/watch?v=TWHezX43I-4
10 votes -
The rise and fall of "The Resistance"
3 votes -
Amy Hakanson shows us the sixteen stringed, thirty-nine keyed nyckelharpa
6 votes -
Tip to tip: Crossing Japan with no map
21 votes -
Five Nights at Freddy’s 2 | Official teaser
3 votes -
How a simple tractor conquered the South Pole
7 votes -
Books are the new luxury
5 votes -
Nintendo Direct: Nintendo Switch 2 – 4.2.2025
52 votes -
Helios Welding Visualization System uses intense pulsed lighting to outshine electric arcs, laser cutters, and even burning magnesium
3 votes -
Henry Kissinger's Moo Goo Gai Pan is real. Is it good?
6 votes -
Spontaneous fractals appear when you pull things apart. Viscous fingering (Saffman–Taylor instability) occurs when a less viscous fluid is pushed into a more viscous fluid.
18 votes -
Counter-Strike: Football — a competitive multiplayer FPS written in... PHP???
6 votes -
Eddy Burback chronicles his month without a phone
22 votes -
The loneliest NPC in Super Mario Odyssey
12 votes -
Bloodred Hourglass – We Should Be Buried Like This (2025)
6 votes -
Blizzard reportedly receiving new StarCraft game pitches from well-known Korean developers
9 votes -
Only fifty-six people have beaten this Pokémon game
17 votes -
What games have you been playing, and what's your opinion on them?
What have you been playing lately? Discussion about video games and board games are both welcome. Please don't just make a list of titles, give some thoughts about the game(s) as well.
18 votes -
Sunna Margrét – Come With Me [Live on KEXP] (2024)
4 votes -
Framing Godland
3 votes -
Do you have games that you play (almost) exclusively?
I was reading the recent post about strategy games, and I'm still astonished to see for how many hours (at least hundreds, often 1000+) people are playing these. I'm guessing that in these cases,...
I was reading the recent post about strategy games, and I'm still astonished to see for how many hours (at least hundreds, often 1000+) people are playing these. I'm guessing that in these cases, all your gaming time is exclusively taken by that single game.
So, do you have (or did you have) games, or series, like that? Do you play solo or multi? What compels you to spend so much time on a single game? How do you feel about it?
39 votes -
Hi everyone! Wanted to share my electronics workbench.
10 votes -
Nestor – Caroline (2024)
3 votes -
Could you rearm Europe without US weapons? - Equipping a unified European military (April 1 special)
9 votes -
Live-action League of Legends series reportedly underway; Vietnam considered as a filming location
6 votes -
Skraeckoedlan – The Vermillion Sky (2023)
2 votes -
Rick Astley - Pink Pony Club (Chappell Roan cover, 2025)
35 votes -
Joe Edelman: "Is anything worth maximizing?", a talk about how tech platforms optimize for metrics
Video: https://www.youtube.com/watch?v=GyVHrGLiTcc (46m20s) Transcript: https://medium.com/what-to-build/is-anything-worth-maximizing-d11e648eb56f (10,314 words with footnotes and references)...
Video: https://www.youtube.com/watch?v=GyVHrGLiTcc (46m20s)
Transcript: https://medium.com/what-to-build/is-anything-worth-maximizing-d11e648eb56f (10,314 words with footnotes and references)
Excerpt:
...for simple maximizers, its choices are just about numbers. That means its choices are in the numbers. Here, the choice between two desserts is just a choice between numbers. We could say its choice is already made. And that it has no responsibility, since it’s just following what the numbers say.
Reason-based maximizers don’t just see numbers, though, they also see values. Here, there’s a choice between two desserts — but it isn’t a choice between two numbers. See, it’s also a choice between two values. One option means being a seize-the-day, intensity kind of person. The other means being a foody, aristocratic, elegance kind of person.
My personal thoughts about this talk: it's a kind of strange, kind of dubious philosophical and multi-disciplinary reflection on metrics for organizations, especially metrics for tech companies, and on the pitfalls of optimizing for metrics in what the speaker argues is too "simple" a way.
I don't entirely trust the speaker or the argument, but there was enough in the talk to stimulate curiosity and reflection that I thought it was worth watching.
18 votes -
Horndal – Blacklisted (2024)
4 votes -
‘Legend of Zelda’ film sets March 2027 theatrical release from Sony Pictures
23 votes -
How to be more productive
6 votes -
Dubai Creek Tower | Abandoned
3 votes -
What game invented jumping on enemies?
16 votes -
Do you take inventory of your hobbies and projects?
Most of my time in any given day is spent sleeping (eight hours), working (nine hours, plus another one or two for commuting), chores (maintaining the home, personal hygiene, etc.), and spending...
Most of my time in any given day is spent sleeping (eight hours), working (nine hours, plus another one or two for commuting), chores (maintaining the home, personal hygiene, etc.), and spending time with my wife (and occasionally with friends and family).
This means that I don’t have a lot of “spare time”. I maybe get one or two hours a day, and a few more on Saturdays and Sundays.
I often feel anxious and depressed about this inescapable reality. I have a lot of projects and hobbies that I would like to fill my spare time with, but not enough for all of them.
Years ago, I began to try to reframe the circumstances of my life in my mind in order to prevent a complete mental collapse. I tell myself that this life is finite, that I will never be able to have all the experiences that I would like to, and that’s OK. I can live with that reality. And I should instead, focus my energy on dedicating myself to the projects and hobbies that I absolutely do not want to miss out on.
I still struggle to stick to just a few of those, because there are so many (especially creative) activities that I enjoy. I regularly go through cycles of taking on too many of these, then becoming overwhelmed because I don’t have enough time for each, then cutting out most of them to focus on the ones that I want to prioritize, and repeating the cycle.
Today, I have reached the part of that cycle where I will cut some of them out.
Whenever I do that, it really helps me to take inventory of what those activities are, so that I can stay focused, and delay taking on more or new ones until I am satisfied with where I got with my current ones.
So, here are the projects and hobbies that I want to spend my spare time on, starting today:
- Reading one hour every morning (been diligently doing that since January 1). Two books I am reading through the year. A third book I read as much as I have time left (have read more than ten this year already). I also occasionally read some blogs on Bear Blog.
- Writing on two blogs (one daily, one occasionally), as well as writing my book.
- Occasionally chatting on a forum, Tildes, and four Discord guilds.
- Taking one daily walk while listening to a podcast.
- Occasionally watching YouTube videos (I am—coincidentally—subscribed to exactly 50 channels, almost all of which have an upload schedule of one video every other week or slower).
What are your activities?
Side notes: The list above is a summary. My list is a lot more precise, to help me focus. Also, I’m currently unemployed, but before I quit my last job, I had actually been working almost without interruption for several years. My day-to-day routine back then was exactly as I described it in the beginning of this post.
19 votes -
The secrets of Minecraft's old textures
10 votes -
We need to be more tech critical
6 votes -
Beverly Kills – Marigold (2024)
2 votes -
Stremio is an impressive program
This post will talk about piracy. I won't provide any links or direct instructions. That said, if a mod or admin thinks there is something inappropriate about talking about that stuff, feel free...
This post will talk about piracy. I won't provide any links or direct instructions. That said, if a mod or admin thinks there is something inappropriate about talking about that stuff, feel free to mention this in the comments and I will remove any inappropriate details as soon as I can.
Like many Latin Americans, I am a long-term pirate. I have pirated stuff with floppy disks, with CD-ROMs, through IRC, FTP, Kazaa, Napster, Soulseek, websites, and torrent. I have also purchased several illegal media from street vendors. The whole idea of traditional piracy is to get the files I want for me to own, which is why I made a Plex server for myself.
Stremio is a challenge to all of this. It is much easier to setup than Plex and basically requires no maintenance. It is a program that allows me to stream video content from a variety of sources, legal or illegal. It took less than 30 minutes to set it up on my computer, and I know that it exists for both of my TVs. I am using it with the Torrentio addon.
Stremio changed my viewing habits much in the same way paid streaming services did. I am more spontaneous in my choices. I have watched Doctor Who from 2005, ER, Tiny Toons Adventures, Animaniacs, The Twighlight Zone (original), The Magicians, Blackadder, and Falling Skies (alien TV show with Noah Wyle!). Playback sometimes takes a little while to start, but went it does it rarely stutters, even on old or less popular shows. A paid debrid service should improve on that. I am now considering removing most of our extremely expensive paid streaming services and replacing them with Stremio. Money is tight and, when added up, they make quite a dent on our budget!
One bad thing about Stremio is that it is basically a leech. It does not seed the torrents. I am considering getting Real Debrid as it seemingly reduces the strain on torrents via caching.
Right now, my only concern with changing everything to Stremio is that my wife will probably dislike choosing between multiple sources for an episode, and some episodes come with bad subtitles. That would require minimal effort to solve, but might still be too much for her.
Anyway, I am very impressed by Stremio. It is so good, in fact, that I am half-jokingly worried about the police knocking on my door.
Just kidding, that doesn't happen around here.
66 votes -
Inside Brazil's Belo Horizonte’s food scene (Anthony Bourdain)
10 votes -
The origins of Dwarf Fortress (episode one)
30 votes -
The best game animation of 2024
16 votes -
Race against the regime: The 1936 Olympics, and the Nazi rise to power
7 votes -
Nintendo’s new system for sharing digital Switch games, explained
14 votes -
Black Flower - Particles (ABSession) (2025)
4 votes -
Resilience - Animated short film
5 votes -
Ambient Techno To Zen Out To (2025)
5 votes -
Metroid Prime 4: Beyond – Nintendo Direct 3.27.2025
26 votes -
Virtual Game Card – Nintendo Direct 3.27.2025
20 votes -
Yukimi – Break Me Down (2024)
4 votes