-
54 votes
-
Overworked AI agents turn "marxist"
14 votes -
That one study that proves developers using AI are deluded
I've found myself replying to different people about the early 2025 METR study kind of often. So I thought I'd try posting a top level thread, consider it an unsolicitied public service...
I've found myself replying to different people about the early 2025 METR study kind of often. So I thought I'd try posting a top level thread, consider it an unsolicitied public service announcement.
You might be familiar with the study because it has been showing up alongside discussions about AI and coding for about a year. It found that LLMs actually decreased developer productivity and so people love to use it to suggest that the whole AI coding thing is really a big lie and the people who think it makes them more productive are hallucinating.
Here's the thing about that study... No one seems to have even glanced at it!
First, it's from early 2025, they used Claude Sonnet 3.5 or 3.7. Those models are no way comparable to current gen coding agents. The commonly cited inflection point didn't happen until later in 2025 with, depending on who you ask, Sonnet 4.5 or Opus 4.5
The study was comprised of 16 people! If those 16 were even vaguely representative of the developer population at the time most of them wouldn't have had significant experience with LLMs for coding.
These are not tools that just work out of the box, especially back then. It takes time and experimentation, or instruction, to use them well.
It was cool that they did the study, trying to understand LLMs was a good idea. But it's not what anyone would consider a representative, or even well thought out, study. 16 people!
But wait! They did a follow up study later in 2025.
This time with about 60 people and newer models and tools. In that study they found the opposite effect, AI tools sped developers up (which is a shock to no one who has used these tools long enough to get a feel for them). They also mentioned:
However the true speedup could be much higher among the developers and tasks which are selected out of the experiment.
In addition they had some, kind of entertaining, issues:
Due to the severity of these selection effects, we are working on changes to the design of our study.
Back to the drawing board, because:
Recruitment and retention of developers has become more difficult. An increased share of developers say they would not want to do 50% of their work without AI, even though our study pays them $50/hour to work on tasks of their own choosing. Our study is thus systematically missing developers who have the most optimistic expectations about AI’s value.
And...
Developers have become more selective in which tasks they submit. When surveyed, 30% to 50% of developers told us that they were choosing not to submit some tasks because they did not want to do them without AI. This implies we are systematically missing tasks which have high expected uplift from AI.
And so...
Together, these effects make it likely that our estimate reported above is a lower-bound on the true productivity effects of AI on these developers.
[...]
Some developers were less likely to complete tasks that they submitted if they were assigned to the AI-disallowed condition. One developer did not complete any of the tasks that were assigned to the AI-disallowed condition.
[...]
Altogether, these issues make it challenging to interpret our central estimate, and we believe it is likely a bad proxy for the real productivity impact of AI tools on these developers.
So to summarize, the new study showed a productivity increase and they estimate it's larger than the ~20% increase the study found. Cheers to them for being honest about the issues they encountered. For my part I know for sure that the increase is significantly more than 20%. The caveat, though, is that is only true after you've had some experience with the tools.
The truth is that we don't need a study for this, any experienced engineer can readily see it for themselves and you can find them talking about it pretty much everywhere. It would be interesting, though, to see a well designed study that attempted to quantify how big the average productivity increase actually is.
For that the participants using AI would need to be experienced with it and allowed to use their existing setups.
I want to add that this is not an attempt to evangelize for AI. I find the tools useful but I'm not selling anything. I'm interested in them and I stay up to date on the conversations surrounding them and the underlying technology. I use them frequently both for my own projects and to help less technical people improve their business productivity.
Whether AI agents are a good thing or not, from a larger perspective, is a very different, and complicated, conversation. The important thing is that utility and impact are two different conversations. There isn't a debate anymore about utility.
I know this probably won't stop people from continuing to derail conversations with the claim that developers are wrong about utility, but I had to try. It's just hard to let it pass by when someone claims the sky is green.
I understand that AI makes people angry and I think they have good reason to be angry. There are a lot of aspects of the AI revolution that I'm not thrilled about. The hype foremost, the FOMO as part of the hype, the potential for increased wealth consolidation really sucks, though I lay that at the feet of systems that existed before LLMs came along.
It's messy, but let's consider giving the benefit of the doubt to professionals who say a tool works instead of claiming they're wrong. Let them enjoy it. We can still be angry at AI at the same time.
82 votes -
AI fails at 96% of jobs (new study)
28 votes -
Most parked domains now serving malicious content
32 votes -
Researchers isolate memorization from problem-solving in AI neural networks
12 votes -
The emerging evidence on AI tutoring
20 votes -
AI eroded doctors’ ability to spot cancer within months in study
42 votes -
Social media probably can’t be fixed
38 votes -
AI coding tools make developers slower but they think they're faster, study finds
40 votes -
Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task
54 votes -
On writing, and an MIT study
12 votes -
Large Language Models are more persuasive than incentivized human persuaders
14 votes -
Researchers secretly ran a massive, unauthorized AI persuasion experiment on Reddit users
64 votes -
Time saved by AI offset by new work created, study suggests
23 votes -
AI chatbots are people, too. (Except they’re not.)
10 votes -
Randomized trial shows AI tutoring effective in Nigeria
12 votes -
Study: essay graders rarely detect AI, give higher grades
22 votes -
Gender, race, and intersectional bias in resume screening via language model
14 votes -
Covert racism in AI: How language models are reinforcing outdated stereotypes
20 votes -
AI makes racist judgement calls when asked to evaluate speakers of African American vernacular English
23 votes -
Study shock! AI hinders productivity and makes working worse.
42 votes -
Internet use statistically associated with higher wellbeing, finds new global Oxford study
13 votes -
Why large language models like ChatGPT treat Black- and White-sounding names differently
10 votes -
Doctors receptive to AI collaboration in simulated clinical case without introducing bias
6 votes -
AI models found to show language bias by recommending Black defendents be 'sentenced to death'
28 votes -
What are some interesting machine learning research papers you found?
Here's a place to share machine learning research papers that seem interesting to you. I'm no expert, but sometimes I skim them, and maybe there are some folks on Tilde who know more than I do?...
Here's a place to share machine learning research papers that seem interesting to you. I'm no expert, but sometimes I skim them, and maybe there are some folks on Tilde who know more than I do?
One paper per top-level post, and please link to arXiv (if relevant) and quote a bit of the abstract.
11 votes -
Scientists make breakthrough discovery while experimenting with urine
21 votes -
Study finds emojis are differently interpreted depending on gender, culture, and age of viewer
35 votes -
Online anonymity: study found ‘stable pseudonyms’ created a more civil environment than real user names
68 votes -
Addressing equity and ethics in artificial intelligence
13 votes -
Scientists explain why ‘doing your own research’ leads to believing conspiracies
42 votes -
It’s official: Cars are the worst product category we have ever reviewed for privacy
130 votes -
Estimating the association between Facebook adoption and well-being in seventy-two countries
5 votes -
A cool way to keep things cool: The electro caloric effect
13 votes -
GPT detectors are biased against non-native English writers
41 votes -
Antisemitic tweets soared on Twitter after Musk took over, study finds
6 votes -
How social media shapes our perceptions about crime
7 votes -
Wi-Fi routers used to detect human locations, poses within a room
8 votes -
Does this button work? Investigating YouTube’s ineffective user controls.
12 votes -
Does software piracy mitigate poverty?: Evidence from developing and Latin America countries
12 votes -
The impact of digital media on children’s intelligence
10 votes -
Tech sector job interviews assess anxiety, not software skills
8 votes -
Proposed illegal image detectors on devices are ‘easily fooled’
9 votes -
Evaluating the effectiveness of deplatforming as a moderation strategy on Twitter
6 votes -
New study raises fresh ‘privacy concerns’ about data sharing from Android mobile phones
6 votes -
TikTok's algorithm leads users from transphobic videos to far-right rabbit holes
12 votes -
What does your gaze reveal about you? On the privacy implications of eye tracking
10 votes -
NIST study evaluates effects of race, age, sex on face recognition software - Findings included that many algorithms had false positive rates 10 to 100 times higher for non-Caucasians
7 votes -
Free Internet access should be a basic human right: Study
19 votes