Activity

Votes

Comments

New

All activity

Showing only topics with the tag "language models.large". Back to normal view

The ARC-AGI-2 benchmark could help reframe the conversation about AI performance in a more constructive way

~tech Ask

The popular online discourse on Large Language Models’ (LLMs’) capabilities is often polarized in a way I find annoying and tiresome. On one end of the spectrum, there is nearly complete dismissal...

The popular online discourse on Large Language Models’ (LLMs’) capabilities is often polarized in a way I find annoying and tiresome.

On one end of the spectrum, there is nearly complete dismissal of LLMs: an LLM is just a slightly fancier version of the autocomplete on your phone’s keyboard, there’s nothing to see here, move on (dot org).

This dismissive perspective overlooks some genuinely interesting novel capabilities of LLMs. For example, I can come up with a new joke and ask ChatGPT to explain why it’s funny or come up with a new reasoning problem and ask ChatGPT to solve it. My phone’s keyboard can’t do that.

On the other end of the spectrum, there are eschatological predictions: human-level or superhuman artificial general intelligence (AGI) will likely be developed within 10 years or even within 5 years, and skepticism toward such predictions is “AI denialism”, analogous to climate change denial. Just listen to the experts!

There are inconvenient facts for this narrative, such as that the majority of AI experts give much more conservative timelines for AGI when asked in surveys and disagree with the idea that scaling up LLMs could lead to AGI.

The ARC Prize is an attempt by prominent AI researcher François Chollet (with help from Mike Knoop, who apparently does AI stuff at Zapier) to introduce some scientific rigour into the conversation. There is a monetary prize for open source AI systems that can perform well on a benchmark called ARC-AGI-2, which recently superseded the ARC-AGI benchmark. (“ARC” stands for “Abstract and Reasoning Corpus”.)

ARC-AGI-2 is not a test of whether an AI is an AGI or not. It’s intended to test whether AI systems are making incremental progress toward AGI. The tasks the AI is asked to complete are colour-coded visual puzzles like you might find in a tricky puzzle game. (Example.) The intention is to design tasks that are easy for humans to solve and hard for AI to solve.

The current frontier AI models score less than 5% on ARC-AGI-2. Humans score 60% on average and 100% of tasks have been solved by at least two humans in two attempts or less.

For me, this helps the conversation about AI capabilities because it gives a rigorous test and quantitative measure to my casual, subjective observations that LLMs routinely fail at tasks that are easy for humans.

François Chollet was impressed when OpenAI’s o3 model scored 75.7% on ARC-AGI (the older version of the benchmark). He emphasizes the concept of “fluid intelligence”, which he seems to define as the ability to adapt to new situations and solve novel problems. Chollet thinks that o3 is the first AI system to demonstrate fluid intelligence, although it’s still a low level of fluid intelligence. (o3 also required thousands of dollars’ worth of computation to achieve this result.)

This is the sort of distinction that can’t be teased out by the polarized popular discourse. It’s the sort of nuanced analysis I’ve been seeking out, but which has been drowned out by extreme positions on LLMs that ignore inconvenient facts.

I would like to see more benchmarks that try to do what AGI-AGI-2 does: find problems that humans can easily solve and frontier AI models can’t solve. These sort of benchmarks can help us measure AGI progress much more usefully than the typical benchmarks, which play to LLMs’ strengths (e.g. massive-scale memorization) and don’t challenge them on their weaknesses (e.g. reasoning).

I long to see AGI within my lifetime. But the super short timeframes given by some people in the AI industry feel to me like they border on mania or psychosis. The discussion is unrigorous, with people pulling numbers out of thin air based on gut feeling.

It’s clear that there are many things humans are good at doing that AI can’t do at all (where the humans vs. AI success rate is ~100% vs. ~0%). It serves no constructive purpose to ignore this truth and it may serve AI research to develop rigorous benchmarks around it.

Such benchmarks will at least improve the quality of discussion around AI capabilities, insofar as people pay attention to them.

0 comments

deepdeeppuddle

11 hours ago

6 votes
Using Claude and undocumented Google Calendar features to automate event creation
~tech
- google.calendar
Article 358 words, published Mar 25 2025
1 comment

mattsayar.com

2 days ago

5 votes
Tracing the thoughts of a large language model

~tech Link

1 comment

anthropic.com

March 28

10 votes
Review: Cræft, by Alexander Langlands

~tech Article 4182 words

1 comment

thepsmiths.com

March 24

4 votes
Please stop externalizing your costs directly into my face

~tech Article 752 words, published Mar 17 2025

56 comments

drewdevault.com

March 20

120 votes
Block AI scrapers with Anubis
~comp
- open source
Article 1617 words, published Jan 19 2025
29 comments

xeiaso.net

March 17

27 votes
FOSS infrastructure is under attack by AI companies

~tech Article 1864 words

8 comments

thelibre.news

March 20

39 votes
The Long Context - Interactive fiction driven by an LLM

~comp Article 5933 words

2 comments

thelongcontext.com

March 17

12 votes
LLM crawlers continue to DDoS SourceHut

~tech Article 427 words

1 comment

sr.ht

March 17

11 votes
Mayo Clinic's secret weapon against AI hallucinations: Reverse RAG in action

~tech Article 1345 words, published Mar 7 2025

1 comment

venturebeat.com

March 11

8 votes
Factorio Learning Environment – a benchmark that tests agents in long-term planning, program synthesis, and resource optimization

~tech Link

2 comments

jackhopkins.github.io

March 11

13 votes
Bartosz Milewski - Understanding Attention in LLMs
~comp
- programming
Article 973 words
0 comments

bartoszmilewski.com

March 8

6 votes
Is it wrong to use AI to fact check and combat the spread of misinformation?
~tech
- social media
Ask
I’ve been wondering about this lately. Recently, I made a post about Ukraine on another social media site, and someone jumped in with the usual "Ukraine isn't a democracy" right-wing talking...

I’ve been wondering about this lately.

Recently, I made a post about Ukraine on another social media site, and someone jumped in with the usual "Ukraine isn't a democracy" right-wing talking point. I wrote out a long, thoughtful reply, only to get the predictable one-liner propaganda responses back. You probably know the type, just regurgitated stuff with no real engagement.

After that, I didn’t really feel like spending my time and energy writing out detailed replies to every canned response. But I also didn’t want to just let it sit there and have people who might be reading the exchange assume there’s no pushback or correction.

So instead, I tried leveraging AI to help me write a fact-checking reply. Not for the person I was arguing with, really, but more as an FYI for anyone else following along. I made sure it stayed factual and based in reality, avoided name-calling, and kept the tone above the usual mudslinging. And of course, I double-checked what it wrote to make sure it matched my understanding and wasn’t just spitting out garbage or hallucinations.

But it got me thinking that there’s a lot of fear about AI being used to spread and create misinformation. But do you think there’s also an opportunity to use it as a tool to counter misinformation, without burning ourselves out in the process?

Curious how others see it.

17 comments

Merry

March 6

16 votes
Students and their ChatGPT

~tech Article 713 words, published Feb 26 2025

9 comments

dlay.eu

March 2

27 votes
Algorithmic Complacency: Algorithms are breaking how we think
~tech
Video 37:52
23 comments

YouTube: Technology Connections

February 22

82 votes

Funny output from Claude.ai: a codeblock in a question about an early rennaissance theological work

~comp

programming

Ask

I asked Claude for references to some ideas the ai had output earlier about a 16th century cardinal (Sadoleto). It's response was, Let me analyze the letter, and then this: async function...

I asked Claude for references to some ideas the ai had output earlier about a 16th century cardinal (Sadoleto). It's response was, Let me analyze the letter, and then this:

async function searchForSadoletoLetter() {
  try {
    const files = ['sadoleto_letter.txt', 'sadoleto.txt', 'sadoleto_to_geneva.txt', 'letter_to_geneva.txt'];
    
    for (const filename of files) {
      try {
        const content = await window.fs.readFile(filename, { encoding: 'utf8' });
        console.log(`Found file: ${filename}`);
        console.log("First 500 characters:");
        console.log(content.substring(0, 500));
        return content;
      } catch (err) {
        // File not found, try the next one
      }
    }
    console.log("Could not find any files containing Sadoleto's letter");
    return null;
  } catch (err) {
    console.error("Error searching for files:", err);
    return null;
  }
}

const content = await searchForSadoletoLetter();

// Since we don't have the letter available as a file, let's search for published versions online
console.log("The letter isn't available as a file. We would need to rely on scholarly knowledge about Sadoleto's letter to Geneva (1539).");
console.log("Sadoleto's letter was published in 'A Reformation Debate: Sadoleto's Letter to the Genevans and Calvin's Reply' (ed. John C. Olin).");

This strikes me as odd? Did Claude write a script to generate its own response? Have coders put something like this in as a gaurdrail?

edit: details about earlier convo

15 votes

Have you altered the way you write to avoid being perceived as AI?

~tech Ask

I recently had an unpleasant experience. Something I wrote fully and without AI generation of any kind was perceived, and accused of, having been produced by AI. Because I wanted to get everything...

I recently had an unpleasant experience. Something I wrote fully and without AI generation of any kind was perceived, and accused of, having been produced by AI. Because I wanted to get everything right, in that circumstance, I wrote in my "cold and precise" mode, which admittedly can sound robotic. However, my writing was pointed, perhaps even a little hostile, with a clear point of view. Not the kind of text AI generally produces. After the experience, I started to think of ways to write less like an AI -- which, paradoxically, means forcing my very organic self into adopting "human-like" language I don't necessarily care for. That made me think that AI is probably changing the way a lot of people write, perhaps in subtle ways. Have you noticed this happening with you or those around you?

24 comments

lou

February 17

30 votes
Building a personal, private AI computer on a budget
~comp
- hardware
Article 2933 words
23 comments

ewintr.nl

February 9

24 votes
"The Bullshit Machines" - A free humanities course on LLMs for college freshmen from UW professors
~humanities
- education.higher
Link
9 comments

thebullshitmachines.com

February 9

43 votes
NBC producers deny using AI in new series ‘Detective Fireman Lawyer Chicago Los Angeles Show’

~tv Article 231 words

4 comments

theonion.com

February 6

37 votes
DeepSeek’s safety guardrails failed every test researchers threw at its AI chatbot
~tech
- security
Article 474 words
29 comments

WIRED

February 1

16 votes
Building games with LLMs to help my kid learn math

~tech Article 253 words

1 comment

mattsayar.com

February 3

9 votes
What trustworthy resources are you using for AI/LLMs/ML education?

~tech Ask (recommendations)

Every company is trying to shoehorn AI into every product, and many online materials provide a general snake oil vibe, making it increasingly difficult to parse. So far, my primary sources have...

Every company is trying to shoehorn AI into every product, and many online materials provide a general snake oil vibe, making it increasingly difficult to parse. So far, my primary sources have been GitHub, Medium, and some YouTube.

My goal is to better understand the underlying technology so that I can manipulate it better, train models, and use it most effectively. This goes beyond just experimenting with prompts and trying to overcome guardrails. It includes running local, like Ollama on my M1 Max, which I'm not opposed to.

5 comments

GreasyGoose

January 24

8 votes
Are LLMs making Stack Overflow irrelevant?

~tech Article 1099 words

40 comments

pragmaticengineer.com

January 22

23 votes
Nepenthes: a tarpit intended to catch AI web crawlers
~tech
- internet
Article 1509 words
23 comments

zadzmo.org

January 19

33 votes
Task-Specific LLM Evals that Do & Don't Work

~comp Article 6254 words, published Mar 31 2024

2 comments

eugeneyan.com

December 9, 2024

4 votes
Researchers explain that it is easy to redirect LLM equiped robots, including military and security robots in dangerous ways
~tech
- security
Article 1003 words, published Nov 11 2024
0 comments

ieee.org

November 24, 2024

15 votes
Project Zero: Using large language models to catch vulnerabilities in real-world code
~tech
- security
- google
Article 1866 words
2 comments

googleprojectzero.blogspot.com

November 2, 2024

7 votes
Gender, race, and intersectional bias in resume screening via language model

~tech Article 834 words

2 comments

geekwire.com

November 1, 2024

14 votes
Anthropic announces New Claude 3.5 Sonnet, Claude 3.5 Haiku and the Computer Use API

~tech Article 2288 words

13 comments

anthropic.com

October 22, 2024

19 votes
How harmful are AI’s biases on diverse student populations?

~tech Link

1 comment

stanford.edu

October 21, 2024

9 votes
GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models

~tech Article 278 words

12 comments

apple.com

October 19, 2024

15 votes
OpenAI is a bad business

~tech Article 7735 words, published Oct 2 2024

61 comments

wheresyoured.at

October 9, 2024

43 votes
How to setup a local LLM ("AI") on Windows
~tech
- microsoft
Video 13:22, published Sep 23 2024
3 comments

YouTube: Dave's Garage

September 29, 2024

12 votes
Covert racism in AI: How language models are reinforcing outdated stereotypes

~tech Link

2 comments

stanford.edu

September 23, 2024

20 votes
Prison inmates in Finland are being employed as data labellers to improve accuracy of AI models

~tech Article 1679 words

7 comments

euronews.com

September 22, 2024

22 votes
OpenAI: Introducing o1

~tech Link

5 comments

openai.com

September 12, 2024

14 votes
AI is here. What now?

~tech Video 46:16

23 comments

YouTube: Eddy Burback

September 2, 2024

18 votes
AI accuses journalist of escaping psych ward, abusing children and widows

~tech Article 455 words

31 comments

futurism.com

September 2, 2024

29 votes
AI makes racist judgement calls when asked to evaluate speakers of African American vernacular English

~tech Link

14 comments

science.org

August 29, 2024

23 votes
The LLMentalist effect: how chat-based large language models replicate the mechanisms of a psychic's con

~tech Article 4404 words, published Jul 4 2023

14 comments

softwarecrisis.dev

August 16, 2024

29 votes
Solving a couple of hard problems with an LLM

~tech Article 868 words, published Jul 21 2024

4 comments

phfactor.net

July 24, 2024

13 votes
How are AI and LLMs used in your company (if at all)?

~tech Ask (survey)

I'm working on an AI chat portal for teams, think Perplexity but trained on a company's knowledgebase (prosgpt dot com for the curious) and i wanted to talk to some people who are successfully...

I'm working on an AI chat portal for teams, think Perplexity but trained on a company's knowledgebase (prosgpt dot com for the curious) and i wanted to talk to some people who are successfully using LLMs in their teams or jobs to improve productivity

Are you using free or paid LLMs? Which ones?

What kind of tasks do you get an LLM to do for you?

What is the workflow for accomplishing those tasks?

Cheers,
nmn

25 comments

nmn

July 14, 2024

12 votes
"Mechanistic interpretability" for LLMs, explained

~comp Article 3670 words

1 comment

Substack: Sean Trott

July 8, 2024

6 votes
Vibe Check - Let AI find you the best things

~tech Link

21 comments

vibecheck.market

June 24, 2024

30 votes
Researchers describe how to tell if ChatGPT is confabulating

~comp Article 522 words

5 comments

Ars Technica

June 21, 2024

24 votes
Experiences using a local voice assistant with LLM with HomeAssistant?

~tech Ask (advice)
Has anyone out there hooked HomeAssistant up to a local LLM? I'm very tempted: Alexa integrations fail often. HomeAssistant integrations tend to be rock solid. Alexa is rule/pattern matching...

Has anyone out there hooked HomeAssistant up to a local LLM? I'm very tempted:
- Alexa integrations fail often. HomeAssistant integrations tend to be rock solid.
- Alexa is rule/pattern matching based. LLMs can understand natural language fairly well. The "magical incantations" required by Alexa are awkward.
Other than the software, the device side seems challenging. There are $50 fully-baked POP devices. I'm less sure on the DIY front.

Also, I desperately want my house to speak to me in the voice of the NCC-1701D computer. I've read enough now to know this should be achievable with a modicum of effort via OSS voice cloning tools or training a new model (same difference except "voice cloning" seems to often refer to doing this without training a whole new model?).

Thoughts? Experiences?

I've seen several pages that have led me to conclude this is tenable:

https://github.com/myshell-ai/OpenVoice

https://github.com/domesticatedviking/TextyMcSpeechy

https://github.com/mezbaul-h/june

https://www.home-assistant.io/voice_control/voice_remote_local_assistant/

https://heywillow.io/hardware/#esp32-s3-box-lite
1 comment

elight

June 21, 2024

14 votes
Detecting hallucinations in large language models using semantic entropy

~tech Article 8513 words

4 comments

Nature

June 21, 2024

17 votes
I will fucking piledrive you if you mention AI again

~comp Article 4269 words

32 comments

mataroa.blog

June 19, 2024

119 votes
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

~comp Article

2 comments

arXiv

June 15, 2024

9 votes