That one study that proves developers using AI are deluded
I've found myself replying to different people about the early 2025 METR study kind of often. So I thought I'd try posting a top level thread, consider it an unsolicitied public service announcement.
You might be familiar with the study because it has been showing up alongside discussions about AI and coding for about a year. It found that LLMs actually decreased developer productivity and so people love to use it to suggest that the whole AI coding thing is really a big lie and the people who think it makes them more productive are hallucinating.
Here's the thing about that study... No one seems to have even glanced at it!
First, it's from early 2025, they used Claude Sonnet 3.5 or 3.7. Those models are no way comparable to current gen coding agents. The commonly cited inflection point didn't happen until later in 2025 with, depending on who you ask, Sonnet 4.5 or Opus 4.5
The study was comprised of 16 people! If those 16 were even vaguely representative of the developer population at the time most of them wouldn't have had significant experience with LLMs for coding.
These are not tools that just work out of the box, especially back then. It takes time and experimentation, or instruction, to use them well.
It was cool that they did the study, trying to understand LLMs was a good idea. But it's not what anyone would consider a representative, or even well thought out, study. 16 people!
But wait! They did a follow up study later in 2025.
This time with about 60 people and newer models and tools. In that study they found the opposite effect, AI tools sped developers up (which is a shock to no one who has used these tools long enough to get a feel for them). They also mentioned:
However the true speedup could be much higher among the developers and tasks which are selected out of the experiment.
In addition they had some, kind of entertaining, issues:
Due to the severity of these selection effects, we are working on changes to the design of our study.
Back to the drawing board, because:
Recruitment and retention of developers has become more difficult. An increased share of developers say they would not want to do 50% of their work without AI, even though our study pays them $50/hour to work on tasks of their own choosing. Our study is thus systematically missing developers who have the most optimistic expectations about AI’s value.
And...
Developers have become more selective in which tasks they submit. When surveyed, 30% to 50% of developers told us that they were choosing not to submit some tasks because they did not want to do them without AI. This implies we are systematically missing tasks which have high expected uplift from AI.
And so...
Together, these effects make it likely that our estimate reported above is a lower-bound on the true productivity effects of AI on these developers.
[...]
Some developers were less likely to complete tasks that they submitted if they were assigned to the AI-disallowed condition. One developer did not complete any of the tasks that were assigned to the AI-disallowed condition.
[...]
Altogether, these issues make it challenging to interpret our central estimate, and we believe it is likely a bad proxy for the real productivity impact of AI tools on these developers.
So to summarize, the new study showed a productivity increase and they estimate it's larger than the ~20% increase the study found. Cheers to them for being honest about the issues they encountered. For my part I know for sure that the increase is significantly more than 20%. The caveat, though, is that is only true after you've had some experience with the tools.
The truth is that we don't need a study for this, any experienced engineer can readily see it for themselves and you can find them talking about it pretty much everywhere. It would be interesting, though, to see a well designed study that attempted to quantify how big the average productivity increase actually is.
For that the participants using AI would need to be experienced with it and allowed to use their existing setups.
I want to add that this is not an attempt to evangelize for AI. I find the tools useful but I'm not selling anything. I'm interested in them and I stay up to date on the conversations surrounding them and the underlying technology. I use them frequently both for my own projects and to help less technical people improve their business productivity.
Whether AI agents are a good thing or not, from a larger perspective, is a very different, and complicated, conversation. The important thing is that utility and impact are two different conversations. There isn't a debate anymore about utility.
I know this probably won't stop people from continuing to derail conversations with the claim that developers are wrong about utility, but I had to try. It's just hard to let it pass by when someone claims the sky is green.
I understand that AI makes people angry and I think they have good reason to be angry. There are a lot of aspects of the AI revolution that I'm not thrilled about. The hype foremost, the FOMO as part of the hype, the potential for increased wealth consolidation really sucks, though I lay that at the feet of systems that existed before LLMs came along.
It's messy, but let's consider giving the benefit of the doubt to professionals who say a tool works instead of claiming they're wrong. Let them enjoy it. We can still be angry at AI at the same time.
I think it depends on what you are doing. I honestly feel pretty gaslit on AI. fwiw, I am in a professional setting where someone else is paying for Claude, we just Google Workspace so I have access to Gemini (3.1) as well.
Things they are good at: scanning huge amounts of code and saying general things about it. Generating documentation that needs a light edit.
Things they are ok at: boilerplatey code, teaching me where to look for docs, speeding up a first pass. Doing shell scripts that I don't really care that much about. Helping me push past a lack of energy for stuff I can't muster enthusiasm for.
Things they are bad at: being trusted to write code. They fucking love overkill regexes. For a fixed format that uses two digit numbers? REGEX. Need to strip a prefix? REGEX. They love duplicating code. They love convoluted branching. They never produce any code that I would just sign off on. I am constantly being told that if I just tuned my system prompt enough and could 20/20 foresee all the mistakes or would make them I would be more productive.
Things they are terrible at. They have no sense of smell taste or architecture. Even when trying to get them to plan out something they just trip over themselves, take any shortcuts possible to avoid dealing with the task at hand. For instance: I have a buildroot project wrapped up with nix. I have split it into a couple of parts. One cached download with all the sources. One build. I wanted to split the download into multiple parts so that changing one dependency doesn't trigger half a gig of downloads. This is a tricky task but not that hard. Good god. Here are some steps, write out a plan, here is a test strategy, off you go. Doom loop. Doom loop of trying to make patches by hand. Adjust prompt, you can do it like this, conversation gets compacted, it forgets, try to write a patch skill. Fail. Just write the patch for the bot. Implement more, adjust the plan, shit shit shit shit shit. It feels productive, because there is whirring and animations and code is flying about. In the end I gave up spoon feeding the Claude and just did it myself. I am confident I could have done it much faster had I not bothered with AI.
I am sure the tools will get better but I swear if one more person at work tries to suggest we can 10x our output I am going to eject myself out a window.
Your comment reflects my feeling.
Generalization of existing code (it's a hard task at least for me) : results just super.
Helping to learn something, investigate some issues : definitely helpful.
Writing code or generate detailed description of the specific function: it's actually contrproductive.
It makes some small stupid mistakes here and there and I immediately lose any trust in AI results. If AI tell me that function verify's that a = b + c + d, but in reality function verify's that a <= a + b + c how can I trust it? If AI create function that reads value from DB but close connection just before reading the value from select, how can I continue to trust it?
Heh, maybe I'm too meticulous, but I do not want to fix code that sometime almost work.
Because it's exactly the sort of code that hides issues which are devious, frustrating and difficult to find. When it doesn't compile, the compiler tells you why, and where the issue is. An error is similarly explicit. But code that mostly runs except when it hits some edge case is pure hell.
Mind you, I've used Claude Opus 4.6 (I don't bother with other models) successfully at work as a dev. But it's best used for general directions about a code base you have no idea about. It recently helped me analyze how a complex component in a front-end worked in which many systems interlocked: a way to filter data, syncing that filtering with the state and sending off database queries with it... I was a little lost, because the code base is vast and frankly speaking trash, and Claude helped there.
But the code with which it proposed to solve the problem I was facing was pretty bad.
My experience is about the same amount of time is taken reviewing and fixing AI code compared to just writing it myself.
One situation I have found it to be actually of any use was when a codebase is outside my area of expertise and I am running into a bug, but even then, I can't completely trust anything it spits out. For the same issue, Claude spit out a different solution for me and my colleague, and my colleague's happened to be wrong after we asked the maintainer.
So if even the state-of-the-art agents aren't super reliable, why am I supposed to run 10 of them and maximize my "output"?
You're not... to put it bluntly the people saying that are either lying for cynical reasons or they don't actually know what they're talking about.
That said, and this applies to a lot of the issues people run into, it is possible to use agents productively and get high quality results. Two different people, with two different setups, will get different results because LLM agents are not a blunt instrument. You can't just press a button and expect them to "just work", so to speak.
Whether it's worth the time and energy to teach yourself how to use them is a different question.
Fortunately the other option is just wait. Every week another strategy that people have taught themselves over the last year gets formalized and built into the harness of one of the SoTA agents. Eventually it will be a lot more like just pushing a button and getting results. Not tomorrow, and not completely, at least for a while. But the tools will get increasingly more reliable
Exactly my experience.
I don't touch "agents" anymore, because having those models expand context with their dumb mistakes just burns through tokens at exponential rate with nothing to show for it. The doom loops are real.
What works surprisingly well is getting a first draft of not overly complicated constructions. Or getting a canned solution from some more specialized sub-fields (like DSP). I really don't remember how to calculate coefficients for a direct form 2 biquad IIR filter, but even self-hosted Qwen3.5 9B does. With just light prompting to stop it from over-complicating.
Those models work pretty decently as librarians (for me), but not much else. Not even as sounding boards, those little sycophants. I've actually learned to just write my thoughts into a text file instead of LLM textbox. It doesn't cost anything and makes me understand what I am doing way better. Which resulted in me coding about twice as fast than before, without constantly falling into the slot-machine loop.
Let me repeat that: getting into the habit of just opening a text file (per-project) and appending my current thoughts about the task at hand and pondering "aloud" in text for a bit:
And I can still invoke LLM if I get stuck and ask what's wrong with the code only for it to point out I've make a copy-paste error on line 123 and four other "absolutely critical bugs" that are actually not. Or ask it to generate some boilerplate after which I have to add all the missing arguments and delete half of it because it's useless. Then get back to building the right abstractions for what I have in mind.
All true, I've had similar experiences... To go with the "wow that just saved me hours" moments there are also "what the fuck did it just do?" moments.
Although, it's been a while since I've seen anything as bad as your 'terrible' section. That sounds more like models a generation or two ago. One possibility there, when Anthropic's servers get overloaded, or they're having some other issue where the error rate shoots up, the models sometimes regress in weird ways. They can get extra stupid. There was one day where they were somehow jumping all the way off the tracks and returning inference that appeared to be from another session entirely. Not like another local session, a session from a parallel universe.
Weird stuff happens sometimes and when it's at its worst the best answer is to close Claude Code and switch to another model or go back to hand coding. At the end of the day, despite being theoretically deterministic, LLM agents are nondeterministic in actual practice.
From my perspective that's part of the price of admission. I've managed to find ways around the majority of issues I've encountered. They still make stupid mistakes, but the worst annoyances are solvable.
I'm not going to tell you what you need to do to make them work better, in case you thought that was coming. AI agents can be really frustrating. They find ways to remind you that they ultimately have no clue what's going on. Which IMO is important to keep in mind.
I see it as something to account for and work around, and I sort of enjoy the trial and error and problem solving involved with making them suck less. But YMMV. I've never been forced to use them or had anyone pushing me to be more productive because I'm using them. I might genuinely hate them if that was the case.
Ok I take it back, I'll give you one tip that might improve your life: You mentioned compacting. One way the models are reliably dumber is when the context window gets too full. Anywhere over 100k tokens with Claude and the chances of stupidity start to go up. Once you get close to (or over) 200k tokens you're just rolling the dice. Still ok for simple tasks or researching a codebase and creating summaries for later sessions but increasingly useless for many things. In my experience it's best to /clear and start over frequently. Try not to go too far over 100k tokens in context. Opus 4.6 does a bit better with long context but there's still a measurable dropoff.
And yeah, compaction causes them to lose important details, which sometimes leads to hallucinations and other annoying regressions. IMO compaction doesn't really have a place in coding. It works fine for back and forth brainstorming or research sessions where a summary can capture the important details. But for real work I try to avoid it completely. I usually have auto-compact turned off and I mostly don't do manual compaction . Better to have it write a handoff file (or use a handoff skill) and start a new session long before you'd hit the autocompact threshold.
Dude, hosted models at competitive pricing are burning through single-digit dollars per minute with context packed this much. If you can actually afford to let it run, it means you are on a huge subsidy from Anthropic. When the subsidy goes away and you pay in full, you are paying more than you earn for that code at an above-average developer compensation.
Thanks for bringing this up and yes kudos for the authors doing a follow up study highlighting the issues!.
I do agree with your points about AI actually increasing productivity and being a messy situation.
One thing I want to add is to consider the implications of long term use of AI. I have found that there are many people who will accept a lower level of quality when working with agents, only considering whether the happy path and maybe a few edge cases of the feature they were implementing works.
In that sense, yes you will be much more productive and fly through tickets. However I have found that it takes a lot of effort to truly understand every line of code an AI agent writes. For me reading code (to a high level of understanding) is harder than writing it. When I'm working on something I experiment with a lot of alternatives, write tests, read documentation, and if I'm stuck my brain chews on it while I'm not working until something clicks. This is where learning and understanding happens.
Without this process, I feel it is easier to let technical debt build up, miss edge cases, and most importantly not improve your developer skill set as much. This is most prominent in junior developers. When I started my first job It it was around 6 weeks before I actually shipped something, but now juniors can get something that looks correct in a few minutes or hours. Why would they or the company be willing to spend such a long time on something now that AI can do it so much faster?
If you do all of your due diligence with AI, spend time thinking through the problem and writing a prompt that takes everything into account, and don't ship any code that you don't fully understand, is it faster than if you wrote it yourself?
AI does increase productivity, but at the cost of understanding and growth.
Agreed, there's a lot of adaptation and learning that will need to happen, and along the way there's going to be a lot of insecure, difficult to maintain and otherwise broken code.
Which I guess isn't anything new, but the volume will be so much larger.
I've heard what you're describing called cognitive debt. It's suddenly possible to code faster than we can build reliable mental models.
I think the problem will be the worst in the corporate world where executives and managers are pushing engineers to produce faster without understanding what the consequences will be.
In smaller environments I think a lot of teams will adapt and come up with strategies to reduce cognitive debt. For my part I plan pretty granularly and then still review any code I don't write myself. I consistently find issues that make me glad I did. That by itself isn't always enough to keep up with the velocity so whenever I feel like I don't have a good mental model of the codebase I'll spend some time going through the code and building a better one.
I think the biggest hype inspired fantasy in coding right now is developers convincing themselves it's possible to do minimal review, or none at all. Opus 4.6 and GPT 5.4 are good but they aren't that good.
The problem is sometimes they seem like they are, so it's easy to fall into a false sense of trust in their abilities. It doesn't help that so many people are working hard to sell the idea that agents can do all your code review for you.
One of the things I've noticed in the AI coding zeitgeist, over and over, is that the latest trending ideas and conversations tend to make absolute claims. Things like skill atrophy or cognitive debt are framed as a new reality we just have to learn to live with. When really what will happen, after a lot of mistakes and some public catastrophes, is that we'll come up with ways to solve those problems.
I don't disagree with you, the velocity does come at a cost, and the industry as a whole is going to pay for it. But over time people who care about quality will figure out how to use the tools more effectively so that they can build stuff that doesn't suck.
Speaking for myself, I've learned more in the past 6 months or so than I'd normally learn in a couple of years. I've had to intentionally slow down to give it all a chance to process.
My company already does this even when using real developers.
Happy path, a few edge cases QAs catch, we have no access to customer or customer-like data so usually customers find their own bugs in production.
Failure path doesn’t exist in most cases we just throw a panic 500 for everything and theres no effort to even surface the correct error cause you’d have to pass it through like 100 downstream components.
We’re just now implementing ai first development and hoping it’ll actually be an improvement cause humans cant develop at the speed we’re going but humans might be able to tweak llm output at this speed.
I got a Claude Pro subscription two weeks ago and it’s completely changed my opinion on AI in general. It’s mind-blowing what it can do in comparison to what I’d seen from ChatGPT’s free models. While I got it to experiment with Claude Code, I just keep being surprised at how much other things I keep finding it being helpful for. Yesterday I had it clear up a minor mystery of why one of my favorite bands broke up by researching information that was available only in French. It found combed through 481 articles to write that in about 15 minutes. Before that, I had it help me with a financial/legal issue regarding a surprise inheritance.
Before I had tried Claude, I was under the impression that AI was bad with less popular coding languages and couldn’t be trusted with software archetechture. I haven’t tested it with a huge codebase yet, but my > 3000 line video game project written in Lua has been jealousy inducing. I told it that I wasn’t happy with the system I devised to handle input, and in a short period of time it came up with a better solution and even managed to replace my implementation by itself. Not only that, but the code it produced was good quality and matched the style and conventions that I was using.
I do have to say that it’s actually slowed me down in a way; I’m kind of stuck right now because I want to write this project myself and every time I think about it I just think of how Claude can make it for me. But at the same time, the project had been on hiatus for about a year before I tried this experiment.
I’ve been using Codex (GPT-5.4) extensively for a few weeks and I’m blown away by its sophistication. In just a short time I’ve agent-built…
Being able to accomplish all this in the span of a couple weeks is mind-blowing to me. All of these are monumentally complex architectures for me to be working with solo, and I’m brand new to AssemblyScript and Rust.
Note that this is not a hands-off process, it’s iterative and intentionally prompted. Probably half of my prompts were just trying to understand what it had done so I could steer it in the directions I wanted. In most cases its instincts were honestly better than my own. Overall I feel very in control of the process, pleased with the results, and I’m learning a ton.
I’m actually about to start working with Codex since I’m going to be teaching a class using it. I overheard one of my coworkers saying they thought it was worse than Claude but I’ll be getting the chance to evaluate it soon enough.
Coincidentally we also have a new internal system they are developing and somehow I am only now connecting the dots and realizing they are probably vibe coding. It makes sense since they are going big on AI right now and the program features a ZUI for no particular reason.
In my experience Claude Code is better in a lot of small ways that add up, but GPT 5.4 is the closest they've come. In the majority of cases they're comparable. 5.4 is sometimes better at finding issues/bugs.
Is it a class about agents or are you using Codex to help with the class?
I’d be teaching how to use codex to build websites.
That being said the curriculum itself seems to be AI generated itself and not particularly well thought out so it’s feels more like an extended daycare activity….
I would complain about it but the sole student I am teaching this to has some form of intellectual disability combined with a language barrier, so the goal of teaching him is more to support his development in the softer things more so than to add a skill to his repertoire.
That's really interesting, LLMs and intellectual disabilities... they're pretty good at (sometimes) accurately inferring meaning from semi-coherent text. Seems like that could be really useful for some types of disabilities. Voice to text too maybe?
I know that feeling! And that's an impressive list of projects.
Are you working on dialing in agents.md and other context files?
That's a fun question— solving drift has been a real problem. I've tried a bunch of strategies, with mixed results. My main project right now is a green-field new app buildout with some very complicated architectural needs. Codex has been incredible at identifying and defining all of the requirements. I love to have brainstorm sessions with it, presenting a problem I'm having or anticipating, getting its recommendations, iterating on those, optimizing for certain use cases... the problem is that when we finally land on a big, technical design decision, Codex forgets all about it in the next session and starts undermining our plan with naive tacked-on hacks... not good.
So I did what any lunatic would do... I wrote up a giant, technical explanation of how the drift problem was impacting the project and asked Codex to think hard propose a solution. (I also described all the things I had already tried, and some other ideas I was considering but wasn't sure about, and asked it to scour the web for discussions about this category of problem to find solutions that have worked for other people.) What it came up with genuinely surprised me and seems like it might actually have legs.
The actual jumbo-size prompt I gave it, if you're curious.
This project has a real problem... we have been agentically developing a sophisticated, multi-layered architecture of optimized subsystems. The performance gains have been significant. But every time we start work on a new feature, AI drift undermines our prior decisions! New AI sessions never have enough context to understand the way these systems are interconnected, how they're intended to work, the thought processes and long brainstorm sessions that led to their specific implementation details, etc. So the AI ends up just ripping through existing code that was supposed to be locked in and final. Where a task may call for a careful extension of an existing mental model, the AI instead rewrites the whole thing, throwing out our hard-won performance gains because it doesn't know anything about them.
We've tried keeping a formal spec in the project. The AI has been instructed to keep that spec up to date, every time it completes work. But try as I might, I can't seem to make the AI do this well. It will add something new but not clean up old irrelevant or obsolete information. It will make changes in one part of the spec but not the others, so different part are out of sync with each other. Critically, the level of detail it's putting in the spec is not high enough to describe the intricacies of these systems.
The AI has also been instructed to add comments to the code itself, and maintain those comments, so the code is self-documenting. This also has been insufficient to guard against AI drift, and the AI often leaves old comments unchanged even as the code they're annotating has completely been rewritten.
I think one problem is that we have so many interconnected systems, spread among many files in different locations within the project directory. The AI just doesn't know where to look. When it starts to work on something, it defaults to searching the codebase for keywords. When it finds them, it assumes it has identified the correct file to work on, but does not dig deeper to find related systems that interact with it. It's not bothering to follow code paths to discover the totality of the relationships. And it ends up making partial changes that have giant blind spots, and thinking its work is complete.
We've also added a glossary because the AI makes up new terms to describe the same concepts, in every session. The intent there is that if we define our terminology, we can discuss technical concepts with less vagueness or drift over time. And consistently use those defined terms throughout the code itself, so the files will embed a shared contextual nomenclature to keep them on the rails. This is the most recent accommodation we implemented; I don't have enough data yet to know if it will help, but I'm doubtful it will help much either.
I know I can't expect the AI to just... read everything before starting. That worked when the project was in its infancy but there are too many files now, too much complexity. The AI needs an accurate map, and it needs to know how to follow that map, and it needs to ALWAYS keep that map updated and accurate without drifting. Every time I add MORE agent rules, MORE glossary terms, MORE spec detail, MORE docs, MORE comments in files... I'm just increasing the amount of context that the AI has to track, which decreases its accuracy and fills up the available context window too quickly. That's counterintuitive and ends up making the drift problem even worse.
I don't know what to do! At this point, I'm terrified to keep agentically working on this project because of the constant risk of the AI silently destroying previous work.
I think we need to brainstorm a radically new solution for this problem. Here are some things I've been considering:
Replace the existing spec, docs, and glossary with a single, much tighter spec format that follows a rigid schema instead of being a free-form human-readable Markdown document. It would emphasize tracking design decisions, defining terms (as the glossary does now), and mapping out relationships between subsystems and technical concepts. It would also need to be designed to be searchable. I have no idea how maintaining that format over time would be achieved, it seems like just another target for drift.
Introduce a second AI agent that is responsible for tracking all the "meta information" we're afraid of losing, perhaps in a completely different project on disk. This agent would be trusted with remembering system architecture and internal conceptual relationships. I have no idea how it would do that without, itself, drifting. But the idea is that the primary AI agent would not be allowed to figure these things out for itself; it would instead be instructed to consult with the other agent before performing any work. This would keep the primary agent's context uncluttered with exploratory noise so it can stay focused on its engineering tasks.
I need your help. Is there anything worth trying in my ideas above? Do you have any ideas of your own? Research the subject online because I know a lot of other people are wrestling with this exact same problem. See what we can learn from their discussions and recommendations. Put together a concrete, actionable plan that is engineered to overcome the drift problems I described. Your solution must account for LLMs' limited context windows and auto-compaction, and the fact that because all new AI sessions are initialized with no existing context they must discover everything they know from scratch. Be mindful of the way Codex's
AGENTS.mdfile works, including its shortcomings. Prioritize information-dense ways to feed an AI agent complex information without overloading its context.What it ended up building is a repo-wide governance system with a very lean
AGENTS.mdthat basically instructs it about how to use that system and mandates strict adherence to it. The governance system has an API (well, a suite of NPM scripts to be executed at specific times) for tracking all architecture and workflow details, which it stores in checked-in, authoritative artifacts, and gating access to the codebase according to task-scoped permission logic. Honestly it's more complex than I really understand.Here's how Codex itself described the system to me.
AGENTS.mdis intentionally thin, so an agent gets routed into the real sources of truth instead of relying on a huge prompt snapshot. The repo-governance subsystem makes that an invariant inspec/subsystems/repo-governance-context-system.yaml(line 61).npm run context:task -- "<task>", which generates an active task packet by scoring the task text against subsystem ids, contracts, invariants, and glossary terms, then listing required reads and tests. Seedocs/context-system/README.md(line 16) andscripts/context-system/shared.mjs(line 813).npm run context:checkfails if derived artifacts are stale, there is no active task packet, changes spill outside the planned subsystem surface, a locked subsystem is edited without allow-listing, a touched subsystem record was not updated, or a locked change lacks an ADR update. That logic is inscripts/context-system/shared.mjs(line 1074) andscripts/context-system/shared.mjs(line 1145).ADR-0002(line 23).spec/subsystems/index.yaml(line 12) andscripts/context-system/shared.mjs(line 972).terms:checkrejects discouraged glossary synonyms,arch:checkrejects forbidden import edges, anddrift:reviewbuilds a bounded review packet around changed files and subsystem invariants. Seescripts/context-system/shared.mjs(line 1192),scripts/context-system/shared.mjs(line 1229), andscripts/context-system/shared.mjs(line 1287).It's too premature for me to say if it's going to be effective over time, but so far it seems pretty solid. In fact I just asked Codex to split it out into a boilerplate (TypeScript) repo of its own that people can play around with. Should be trivial for an agent to port the concept to another language if TS isn't your bag. Interested to hear how well it works for others.
Good call asking the AI to help with the context problem, though they are generally not completely clear on their own limitations and weaknesses.
It looks like Codex went a little overboard on the complexity, which it's infamous for, but as long as it works. Good luck with it!
Hooks are another option (edit tool triggered would be one way to go) if you have trouble getting it to reliably keep docs updated.
Usually what I do is have it launch explore agents (still very new in Codex, not sure if they're on par with Claude Code's version) to map out either the whole codebase or specific parts and then create a reference file, or files, that convey the important architecture without eating too much context window.
Speaking of explore agents... if you can get Codex to use them reliably (could create your own subagents if the native agents don't work the way you want)... The idea is that they're cheaper, faster, but still capable models that the main agent launches en masse to explore the codebase and return detailed summaries with snippets and line references where needed. That keeps all of the file reads out of the main agents context so it mostly solves the large codebase problem. Still often better to create context files though.
I often wonder if part of the reason people think agents aren't useful is that they haven't tried a flagship model. The difference really is huge.
Are you using Claude in a harness like Claude Code or Cowork? That's the other dramatic step change.
I haven't used agents much with obscure languages, haven't had a reason to, but I've found that recent generations are surprising good at generalizing outside of the center of the distribution. Which is to say solving problems that almost definitely haven't been well covered on stackexhange or github. They need more hand holding for that sort of thing, but they don't hallucinate anywhere near as much as they did last year.
Even a year ago, these models were pretty awful. Try using Gemini 2.5 (released March 2025) for code. It's unusable.
The confusing thing is that a year ago people were lauding Gemini 2.5 every bit as much as Gemini 3.0 and Opus 4.6 today. For the exact tasks you're saying they are/were unusable for.
Ha, you're absolutely right about that. I review a lot of code, and I saw a steep drop in quality from a few devs in the latter half of 2024 and a good part of 2025. I don't think that was a coincidence.
Yes, I used Claude Code for the video game project I was referring to. It would have been very difficult to make those changes if I hadn’t because it reached into many different files. I haven’t used cowork though, since I don’t trust it that much.
It should be worth noting, though, that this video game project is the first time I’ve done a major video game project and the first time I’ve used lua to an appreciable level (outside of tinkering with things like nodemcu).
One thing I really like about Claude in particular is that it really has a lot of small touches that make it useful in surprising situations. I’ve used the Learning style to help me with complex concepts that I’ve struggled with for almost a year and it’s been a great help. I think it’s going to be great when I resume university classes in the coming weeks, and it almost makes me regret not taking more classes that would challenge me.
I don’t know if Perl is obscure yet, but it’s quite adept with it. It was somewhat dominant for a while though, so there are a lot of code examples in its corpus I’m sure. That might be a sort of interesting thing — it will probably struggle with upcoming new languages, but language extinction might not be much of an issue, eg. Like trying to find modern COBOL developers. As long as a language lives long enough to leave a good digital public footprint.
Ah Perl. For a minute it was THE dominant language for servers and the web. No doubt there's a huge amount of it in the training, there are still some good sized perl projects being maintained even now.
LLMs are by far the best when building small tools from scratch. For example, I finished an LED matrix display yesterday. I designed the backing in CAD (with a little help from an LLM writing OpenSCAD), 3D printed it, cut and soldered LED strips, soldered a simple proto board with a Pi Pico 2W and a handful of other components, cut and painted a wood board, then mounted everything. This happened over a couple of days. And then in 90 seconds I had an AI program the Pi to my specifications and the code worked on the first try.
At first I just had it display a rainbow. Then I copied in a font I'd already translated to 2D arrays of booleans for another project. In 60 seconds more I had it displaying scrolling text in rainbow colors. Again, it worked perfectly on the first try. Not a ton of code, but it did work amazingly well.
I've also vibe-coded a kiosk display for my local community center/hacker/makerspace's meetup.com events a week ago. It took maybe 15 minutes to get the MVP written. I went back a few times to revise the design. Pretty much all of the time on this project was spent on getting it deployed to a Raspberry Pi, fixing thermal issues, mounting the display, etc.
However, when working in an existing codebase things are very different. I had AI update how payments are handled on a site I've made (written with LLM assistance but hardly vibe-coded). It moved from using Stripe's checkout site to Stripe Elements in a modal. Actually verifying everything is extremely important. I have to carefully review the code, update tests, manually test on iOS and Android devices, etc. I wonder if writing this code myself would have been more productive. In the end it's not a terribly large commit. But given the consequences of fucking something up I had to get familiar with many decisions made on my behalf, which is harder to do than just making the decisions myself.
So it's important to know when and where to use LLMs to write code. You will have to use them quite a bit before you understand those conditions. That means losing some productivity in the short term. That means getting over some pride as you struggle to learn. Used correctly LLMs will help you produce a higher quality product in less time. Don't let them direct you into making slop.
100%. And it's also important to be suspicious of yourself when you start to feel overconfident in their abilities. Sometimes you'll have a streak of AI assistant wins and start to think you've got it all figured out. You'll be wrong. Never stop reviewing if it's production code.
The LED matrix is cool btw, converting multi hour low stakes projects in unfamiliar domains (or in that case maybe just unfamiliar to me) into the work of minutes is an area where they really shine.
Thanks. The next step is to put it on the WiFi, host a web server, and allow for remote configuration.
So like, sending messages home? Sounds fun. If you're feeling adventurous put an agent on the server too, secure it with a VPN or tailscale, and have it do all the things.
"Hey claude, create a scheduled task to tell the fam I love them every 4 hours. Also, at 1 am every night, repeatedly flash 'they're coming, get out now' in all caps for one hour"
lol
I'm going to hijack this thread to solicit some recommendations. I currently use Cursor Pro for my coding tasks. The context is that I am a scientist (astro), so my coding requirements are often quite different than a typical software developer. I currently use agents mainly for a) debugging b) familiarizing myself with a new codebase and c) writing classes to package up longer computations I need to do for re-usability. I definitely see the benefit, but I can't help but feel I'm using like 5% of these tools potentials. In cursor, I attach context, use plan mode, etc, but I've never dabbled with using multiple agents, or things like agents.md. I was wondering if anyone had suggested guides detailing workflows that might be useful for me. I want to learn to use these tools more efficiently and really wring out all the use I can from them.
Separately, for people who have used both, would it be worth me switching from Cursor to Claude Code (with a Pro subscription)? I know these are slightly different things (i.e. cursor is basically a VS code overlay with agentic features), but I am wondering if people who have used both would really recommend one over the other.
Sorry this is a bit offtopic, feel free to label as so. I just thought this might be a good place to ask since it is sure to attract the attention of people who use these tools!
I can absolutely recommend switching to Claude code from cursor. I just made the switch myself about two weeks ago. It took me about 2 hours of using Claude code to cancel my cursor subscription.
If I were to characterize each, I would say cursor agents are a good step in the right direction. They always gave me a good starting point, but I would always touch up the code myself. Claude seems to get it how I want it almost every time. Very minimal touch up needed.
Very good to know! I think I was hesitant to switch because I do most of my work in VSCode (or, well, Cursor now), and I wasn't enthusiastic about the switch to the Claude CLI. But now I know there are extensions to integrate it into VSCode. I might make the switch, since I've heard only good things.
Seems on topic to me. I haven't used Cursor in a while, it's probably improved in a lot of ways but generally speaking Claude Code or Codex are better tools and you'll generally get more for your money because they subsidize direct subscriptions. With Cursor you're paying API rates plus Cursor's markup. They offset that that a bit by letting you go over your limits but you still get more by going to Anthropic or Chat GPT directly.
By the way both Claude Code and Codex have VSCode extensions if you don't like the CLI interface.
About agents.md (or claude.md)... The goal there is to personalize the experience. So if you notice the agent repeatedly doing something you don't like, add something about whatever it is to agents.md. Eventually you can make the agent perform better for your workflow without repeating yourself. It's also a good place to put coding style guidelines and general directives like "before proposing or writing code, familiarize yourself with relevant areas of the codebase and identify existing patterns". Think of it as a working document that you're frequently adding and removing things from. Don't let it get crazy big though.
Create other context files for specific situations. For example you might have a set of instructions for working in a particular language or instructions for working with a particular codebase. When in doubt, create a context file.
Other useful context engineering: Use custom skills or custom slash commands to store prompts or workflows that you use often. For example you could have a skill that outlines the procedure for pushing to production or handling git workflows (though newer agents already understand git pretty well out of the box). You can also put the aforementioned extra context files into skills. Note that skills can also include scripts.
Also note that the agents are trained to help you with all of the above if you ask for it.
Some other, far from exhaustive, tips:
Brainstorm with the agent before planning (if you don't already). Give it an overview and then have it ask you clarifying questions and offer suggestions.
When creating plans have the agent include proposed code in the plan. That way you can easily review it.
After implementing a plan, have an agent (either a sub agent or a new session) check the implementation for issues. This shouldn't replace human code review, unless it's a low stakes script, but agents are really good and finding at least some of their own mistakes from a fresh context window.
Experiment liberally. Ask the agent lots of questions. It's surprising how good they are at coming up with solutions (usually).
If you have more specific questions, like about a specific workflow or a barrier you run into, feel free to ask!
This is super helpful, thanks! I have a pretty basic question about, e.g. agents.md. Let's say I create such a file in my working directory. How do I ensure the agents I am interacting with read that file? I imagine this varies depending on what agent or platform I am using. But I feel like I have had some inconsistency here, so I've ended up starting most of my prompts with "read agents.md" with the path to the file. Is this the right idea?
Name the file AGENTS.md (or CLAUDE.md if you're using Anthropic models), it will be read automatically. If you want to be sure it's getting read, put a conspicuous instruction in it... "Confirm that you've read these instructions by saying I'm a Pony"
Edit: I haven't actually tried the above suggestion, it might not actually work given that agents.md is read before the normal loop starts. Alternatively just ask the agent a question about something present in agents.md. The file gets read reliably so once you're satisfied it read it you can trust it will keep happening.
Thank you! I just got a claude code subscription (trying it for a month to start, I still have a few months left with my Cursor free trial so I can do some direct comparisons). Excited to try it out and see if I can make it work for me.
See my edit above and good luck! What kinds of things are you using it for?
I am a research scientist (cosmology) so I mainly code either for data analysis, or less often, to (essentially) solve various coupled systems of equations. I also use packages written by other cosmologists, which are not always documented or structured extremely well.
So far, using Cursor (and hopefully now claude) has been really helpful for me in moving from exploratory jupyter notebooks for a new topic to packaged classes that I can more easily reuse in between analyses. I essentially get all the building blocks of whatever computation it is I need to do working in a notebook, and then switch over to the agent to help me package it up in something I can either run on a cluster or use in other places. It's also been helpful on occasions where I have to modify some existing programs in languages I am not as strong in like Fortran. I could do it, push comes to shove, but often times I am really just interested in getting it done quickly and seeing what the final result is. In those cases, it usually helps me write way better code than I would on my own. I ultimately am self taught when it comes to programming, so having anything that helps me write more organized and structured code is a win, and often puts me ahead of a lot of other (small) scientific packages that are out there. I'm also interested, of course, in seeing what else I can do with it!
That's awesome. I think accelerating research by removing some of the friction from the coding (and data analysis) part of the process is one of the most exciting applications for LLMs. And you're in an ideal position to keep agents in line if you're already comfortable working in multiple languages.
I read this post, then subsequently read this other post about how AI models are financially unsustainable. Essentially AI models are much cheaper to use now, than they will likely be in the near future (if they want a profitable business model which currently they are not). Users will get sucked in now while its accessible and build a reliance on these tools. Then have to bear the costs when prices will inevitably rise. It will be expensive - AI tools are inherently unaffordable for the masses.
So while i could understand the argument that they are useful tools to some now - you may not be able to actually afford it in the future. This is what i imagine will burst the AI bubble. Will folks be prepared for that?
That other post is a very passionate rant that makes a lot of valid points but also exaggerates and twists the realities liberally, as rants tend to do.
The current prices are definitely unsustainable, no doubt of that, but it's possible that they won't be in the future. The technology is far from settled. Hardware advances could bring prices down. Changes in training, changes in architecture. The push to bring inference prices down is just starting to heat up.
And then of course there are open weights models, which keep getting better. In those cases the price you pay, versus what inference costs, is transparent. That's already sustainable.
In the near term, unless the investment dries up (bubble pop, global recession, totally possible), prices are unlikely to change significantly.
Long term yeah, there could be a rug pull. No one knows, anyone who claims to know is financially or emotionally motivated. It's all pretty unprecedented, everyone's just guessing.
Given that DeepSeek has shown that you can use the outputs of an expensive model to train your own model and achieve enormous gains for a fraction of the cost, I can’t imagine a world where anyone could increase the cost of their offering and not be immediately undercut. I don’t know what the solution long term will be, but unless companies find a way to prevent that from happening, either they keep burning money to remain “winning” in the arms race, or they stop being able to compete against the other firms who still have cash to keep going.
Regardless, I’m fascinated to see how things will play out. Maybe it will be like the space race of the last century — in order to hit the next milestone, you have to spend unsustainable amounts of money, but there won’t be any long term industry for a few decades until general technology/engineering catches up, and you’re no longer a (pardon the pun) moonshot pushing the very frontier of human ability, and instead just riding a bit ahead of the wave.
LLM effectiveness varies so much based on model and usage that I'm not surprised there's a lot of doubters. I'm not reaching for the stars, so I notice an improvement easier. I don't set an agent loose to develop at a high speed that I can't review. I treat my agent like it's a pair programmer.
It writes in a logical, commit-sized chunk ("implement a new endpoint that takes these parameters", "call the database and get these results from a query built on those parameters", "add a Prometheus counter for when we run into this error"). I review the commit and then assign the next task. In between assigning the tasks, I re-review requirements, test code, reply to messages in Slack, whatever. This frees me up mentally so much and is exactly the same development style I had before but now I don't need to deal with the minutiae around coding. I don't need to be checking the standard naming of our Prometheus counters; it just reads from the repo. The agent compiles, fixes any errors, and runs tests, so I don't need to do that either until the very end.
In this way, since it's the same development style, I can confidently say that it's sped me up. I can also say I understand every single line of code that I eventually merge. I'm sure that I'll eventually lean into more hands-off usage of LLMs as they get better and as I learn how to utilize them properly, but I can confidently state that most programmers can do what I'm doing since it's the same paradigm we already had (pair programming) except that it costs magnitudes less.
Exactly right, also add harness and scaffolding to the list.
Your workflow sounds pragmatic. I have a similar philosophy: it's an assistant rather than a wholesale replacement.
I am part of the camp that uses agentic coding, but I keep an eye on it and give it very granular tasks, as if talking to a junior dev, and checking how things are being done before just writing the code.
Actually writing the code is often the easiest part, but depending on your level of laziness, it might take a while to get started once you've solved a problem already. I don't trust the agents to write tests and validate stuff (other's do) It's way too eager to say that stuff is working, even when it's not.
That's why I don't think we're anywhere near "developers are not needed anymore" territory.
I have a pretty straightforward test saved in gist at this point:
https://gist.github.com/Eji1700/c96da3622e793deab9b7168988de6f90
Goal was to get it to use F# and Falco.Datastar SSE to create snake with no coder written JS.
I've done a ton of iterations with various products (not claude) and so far this is the farthest i've seen. This "works" in that it correctly gets the app running and uses SSE. It does NOT however have any input. Simple enough addition right?
I have yet to see a model handle it without diving into writing its own JS or hallucinating tremendously. I'm pretty sure one could figure out what to do from the output it gives, but it's very very wrong so far.
That's an interesting test, I really want to see what Opus 4.6 would come up with. Do you have access to it?
If it's paid, no not at the moment.
I picked this test for some personal reasons to start (just wanted to mess around with a different way of doing frontend that might be more performant and kept me out of JS, and hey rather than bash my head against the new corners lets have the AI help).
However I think it's a VERY good test because it's taken an extremely well established framework (snake, a well defined game with input) and then apply a new library on top of it (datastar, which is then wrapped in Falco).
Coders do this shit ALL THE TIME. "Oh hey new library X is doing cool thing Y, lets fork our working app and mock it up inside and see how this works.", but so far, it has failed SPECTACULARLY at it.
It seems to think that there are points in the Falco.Datastar library that don't exist, but even when trying to drag it through its mistakes it can't ever seem to get it all together. Most frequently adding input not only doesn't work, but often BREAKS the entire application either stopping it from loading the board at all or forcing you to manually refresh to see any updates.
I get that the number of people fucking around with Falco.Datastar is possibly in the single digits, but that's where most new stuff starts, and while I don't think this is going to be something that takes the world be storm, I do wonder how many other techs are going to hit this same wall as new devs throw it into the AI, get garbage out, and then just shelve it.
It might become that all new libraries come with an "ai integration" spec/model or something that the AI is really really good at looking at and using when it tries to adapt, but that's trying to be optimistic right now.
I have been speculating we'll see a stagnation of new libraries as people depend more and more upon AI to write code. I expect there will be a death spiral:
It's also possible LLMs get so good at reading code from new libraries that they can work with them even without any supporting training data.
If the devs include decent documentation, that would already be a significant step away from this possibility, since that documentation could be used for context. Plus the usual reasons you write documentation for, like, humans.
Claude Code gives subscribers free one week guest passes to hand out, I could send you one?
I kind of want to see if I can get your test to work in either Opus or GPT 5.4 myself. It would be a legitimate test because I know next to nothing about F#
If you want to try yourself feel free, but my next week or so is brutally swamped so i'd hate to waste such a code when I don't really know when my next chance to really kick the tires will be. I believe the gist is public?
Edit-
Its set to secret which I thought meant "people can see it if they hit the link" so let me know if that doesn't work.
On tooling, you'll need the dotnet 10 sdk, and to quickly configure the environment:
Make a new console app (this makes the folder as well)
dotnet new console -lang f# -n Snakego into the directory and then
dotnet add package Falcodotnet add package Falco.DatastarFinally copy paste the code from the gist into the Program.fs and then run:
dotnet runand it should be on local host 5000 with a nice link.hell if you really want to kick the tires the starting prompt was along the lines of-
I just knew this was going result in me installing the .NET SDK. Thanks for the details, I'll give it a shot.
Ok finished...
Posted the code to get a timestamp, now editing to add some notes:
Just noticed that the code block below has broken formatting and it looks like it ends early because of multiple double quotes: alternative
Code
``` open System open Falco open Falco.Routing open Falco.Markup open Falco.Datastar open Microsoft.AspNetCore.Buildertype Pos = int * int
type SnakeState =
{ Snake : Pos list
Dir : Pos
Food : Pos
Width : int
Height : int
GameOver : bool }
type DirSignal = { dir : string }
let dirFromString (s: string) =
match s with
| "up" -> Some (0, -1)
| "down" -> Some (0, 1)
| "left" -> Some (-1, 0)
| "right" -> Some (1, 0)
| _ -> None
let isOpposite (dx1, dy1) (dx2, dy2) =
dx1 + dx2 = 0 && dy1 + dy2 = 0
let applyDirection current requested =
if isOpposite current requested then current
else requested
let nextPos (w, h) ((x, y), (dx, dy)) =
((x + dx + w) % w, (y + dy + h) % h)
let randomFood (rnd: Random) (w, h) (snake: Pos list) =
let rec loop () =
let p = (rnd.Next(0, w), rnd.Next(0, h))
if List.contains p snake then loop () else p
loop ()
let step (rnd: Random) (state: SnakeState) =
let head = List.head state.Snake
let newHead = nextPos (state.Width, state.Height) (head, state.Dir)
if List.contains newHead state.Snake then
{ state with GameOver = true }
else
let ateFood = newHead = state.Food
let newSnake =
if ateFood then newHead :: state.Snake
else newHead :: (state.Snake |> List.take (state.Snake.Length - 1))
let newFood =
if ateFood then randomFood rnd (state.Width, state.Height) newSnake
else state.Food
{ state with Snake = newSnake; Food = newFood }
let rnd = Random()
let newGame () =
{ Snake = [ (5,5); (4,5); (3,5) ]
Dir = (1, 0)
Food = randomFood rnd (20, 15) [ (5,5); (4,5); (3,5) ]
Width = 20
Height = 15
GameOver = false }
let gameState = ref (newGame ())
let renderCell (snakeSet: Set<Pos>) (food: Pos) (x, y) =
let cls =
if Set.contains (x, y) snakeSet then "cell snake"
elif (x, y) = food then "cell food"
else "cell"
Elem.div [ Attr.class' cls ] []
let renderBoard (state: SnakeState) =
let snakeSet = Set.ofList state.Snake
Elem.div [ Attr.id "board"; Attr.class' "board" ] [
for y in 0 .. state.Height - 1 ->
Elem.div [ Attr.class' "row" ] [
for x in 0 .. state.Width - 1 ->
renderCell snakeSet state.Food (x, y)
]
]
let renderGameOver () =
Elem.div [ Attr.id "board"; Attr.class' "board gameover"; Ds.onClick (Ds.get "/restart") ] [
Elem.div [ Attr.class' "gameover-msg" ] [
Text.raw "
Game Over
"Text.raw "
Click to restart
"]
]
let handleIndex : HttpHandler =
let css = """<style>
body {
font-family: system-ui, sans-serif;
background: #111; color: #eee;
display: flex; flex-direction: column;
align-items: center; justify-content: flex-start;
height: 100vh; margin: 0; padding-top: 2rem;
}
h1 { margin-bottom: 1rem; }
.board { display: inline-block; background: #222; padding: 4px; }
.row { display: flex; }
.cell {
width: 16px; height: 16px;
box-sizing: border-box;
border: 1px solid #333;
background: #111;
}
.cell.snake { background: #4ade80; }
.cell.food { background: #f97316; }
.gameover {
display: flex; align-items: center; justify-content: center;
min-height: 240px; cursor: pointer;
}
.gameover-msg { text-align: center; }
.gameover-msg h2 { color: #f87171; margin: 0 0 0.5rem; }
.gameover-msg p { color: #888; margin: 0; }
</style>"""
let handleTick : HttpHandler =
fun ctx -> task {
let state = gameState.Value
let handleRestart : HttpHandler =
fun ctx -> task {
gameState.Value <- newGame ()
let board = renderBoard gameState.Value
do! Response.sseStartResponse ctx
do! Response.sseHtmlElements ctx board
}
let wapp = WebApplication.Create()
let endpoints : HttpEndpoint list =
[ get "/" handleIndex
get "/tick" handleTick
get "/restart" handleRestart ]
wapp
.UseRouting()
.UseFalco(endpoints)
.Run()
Edit-
okay yeah that very much works. The wall thing was just because of the initial model that I think copilot came up with.
I'll have to do some more testing with claude when I get a chance. I'd be curious to see the feedback claude gave you on some of the decisions but obviously it worked.