Tildes

Activity

Votes

Comments

New

All activity

Showing only topics with the tag "language models.large". Back to normal view

Vibe coding is just the return of Excel/Access, with more danger
~comp
- programming
Ask
I probably triggered some PTSD right there. Was just in a meeting at work, where we listed off everything that makes software development hard and slow. An excersize for the thread would be to...

I probably triggered some PTSD right there.

Was just in a meeting at work, where we listed off everything that makes software development hard and slow. An excersize for the thread would be to replicate that list. It turned out that Claude helps with like 1/5th or less of it....especially in a collaborative environment.

So, the situation we're now encountering is that random business areas can vibe code out something, tell nobody, throw it in AWS, have it become a critical part of a business process that fails when they quit, and nobody even has access to look at what was made.

It gives me comfort that in about 5 years there will be a new surge in demand for programmers to reign in all the rogue applications that need shutdown because of the immense risk to continual operation of a company, from data leaks to broken payroll.

It'll be Y2K all over again.

39 comments

vord

4 days ago

44 votes
AI: Where in the loop should humans go?
~comp
- programming
Article 2978 words, published Mar 7 2025
3 comments

ferd.ca

4 days ago

16 votes
Static analysis, dynamic analysis, and stochastic analysis
~comp
- programming
Ask
For a long time programmers have had two types of program verification tools, static analysis (like a compiler's checks) and dynamic analysis (running a test suite). I find myself using LLMs to...

For a long time programmers have had two types of program verification tools, static analysis (like a compiler's checks) and dynamic analysis (running a test suite). I find myself using LLMs to analyze newly written code more and more. Even when they spit out a lot of false positives, I still find them to be a massive help. My workflow is something like this:
1. Commit my changes
2. Ask Claude Opus "Find problems with my latest commit"
3. Look though its list and skip over false positives.
4. Fix the true positives.
5. git add -A && git commit --amend --no-edit
6. Clear Claude's context
7. Back to step 2.
I repeat this loop until all of the issues Claude raises are dismissable. I know there are a lot of startups building a SaaS for things like this (CodeRabbit is one I've seen before, I didn't like it too much) but I feel just doing the above procedure is plenty good enough and catches a lot of issues that could take more time to uncover if raised by manual testing.

It's also been productive to ask for any problems in an entire repo. It will of course never be able to perform a completely thorough review of even a modestly sized application, but highlighting any problem at all is still useful.

Someone recently mentioned to me that they use vision-capable LLMs to perform "aesthetic tests" in their CI. The model takes screenshots of each page before and after a code change and throws an error if it thinks something is wrong.
10 comments

teaearlgraycold

6 days ago

9 votes
That one study that proves developers using AI are deluded

~tech Ask

I've found myself replying to different people about the early 2025 METR study kind of often. So I thought I'd try posting a top level thread, consider it an unsolicitied public service...

I've found myself replying to different people about the early 2025 METR study kind of often. So I thought I'd try posting a top level thread, consider it an unsolicitied public service announcement.

You might be familiar with the study because it has been showing up alongside discussions about AI and coding for about a year. It found that LLMs actually decreased developer productivity and so people love to use it to suggest that the whole AI coding thing is really a big lie and the people who think it makes them more productive are hallucinating.

Here's the thing about that study... No one seems to have even glanced at it!

First, it's from early 2025, they used Claude Sonnet 3.5 or 3.7. Those models are no way comparable to current gen coding agents. The commonly cited inflection point didn't happen until later in 2025 with, depending on who you ask, Sonnet 4.5 or Opus 4.5

The study was comprised of 16 people! If those 16 were even vaguely representative of the developer population at the time most of them wouldn't have had significant experience with LLMs for coding.

These are not tools that just work out of the box, especially back then. It takes time and experimentation, or instruction, to use them well.

It was cool that they did the study, trying to understand LLMs was a good idea. But it's not what anyone would consider a representative, or even well thought out, study. 16 people!

But wait! They did a follow up study later in 2025.

This time with about 60 people and newer models and tools. In that study they found the opposite effect, AI tools sped developers up (which is a shock to no one who has used these tools long enough to get a feel for them). They also mentioned:

However the true speedup could be much higher among the developers and tasks which are selected out of the experiment.

In addition they had some, kind of entertaining, issues:

Due to the severity of these selection effects, we are working on changes to the design of our study.

Back to the drawing board, because:

Recruitment and retention of developers has become more difficult. An increased share of developers say they would not want to do 50% of their work without AI, even though our study pays them $50/hour to work on tasks of their own choosing. Our study is thus systematically missing developers who have the most optimistic expectations about AI’s value.

And...

Developers have become more selective in which tasks they submit. When surveyed, 30% to 50% of developers told us that they were choosing not to submit some tasks because they did not want to do them without AI. This implies we are systematically missing tasks which have high expected uplift from AI.

And so...

Together, these effects make it likely that our estimate reported above is a lower-bound on the true productivity effects of AI on these developers.

[...]

Some developers were less likely to complete tasks that they submitted if they were assigned to the AI-disallowed condition. One developer did not complete any of the tasks that were assigned to the AI-disallowed condition.

[...]

Altogether, these issues make it challenging to interpret our central estimate, and we believe it is likely a bad proxy for the real productivity impact of AI tools on these developers.

So to summarize, the new study showed a productivity increase and they estimate it's larger than the ~20% increase the study found. Cheers to them for being honest about the issues they encountered. For my part I know for sure that the increase is significantly more than 20%. The caveat, though, is that is only true after you've had some experience with the tools.

The truth is that we don't need a study for this, any experienced engineer can readily see it for themselves and you can find them talking about it pretty much everywhere. It would be interesting, though, to see a well designed study that attempted to quantify how big the average productivity increase actually is.

For that the participants using AI would need to be experienced with it and allowed to use their existing setups.

I want to add that this is not an attempt to evangelize for AI. I find the tools useful but I'm not selling anything. I'm interested in them and I stay up to date on the conversations surrounding them and the underlying technology. I use them frequently both for my own projects and to help less technical people improve their business productivity.

Whether AI agents are a good thing or not, from a larger perspective, is a very different, and complicated, conversation. The important thing is that utility and impact are two different conversations. There isn't a debate anymore about utility.

I know this probably won't stop people from continuing to derail conversations with the claim that developers are wrong about utility, but I had to try. It's just hard to let it pass by when someone claims the sky is green.

I understand that AI makes people angry and I think they have good reason to be angry. There are a lot of aspects of the AI revolution that I'm not thrilled about. The hype foremost, the FOMO as part of the hype, the potential for increased wealth consolidation really sucks, though I lay that at the feet of systems that existed before LLMs came along.

It's messy, but let's consider giving the benefit of the doubt to professionals who say a tool works instead of claiming they're wrong. Let them enjoy it. We can still be angry at AI at the same time.

61 comments

post_below

March 19

82 votes
The center has a bias

~tech Article 1531 words

91 comments

pocoo.org

April 13

35 votes
Prototyping with LLMs

~tech Article 362 words, published Apr 6 2026

7 comments

jim-nielsen.com

April 10

19 votes
Anthropic announces deal with Google, Broadcom, says revenue has tripled
~finance
- business
Article 441 words
27 comments

Quartz

April 9

31 votes
AI Coding agents are the opposite of what I want
~comp
- programming
Ask
I've been thinking a lot about LLM assisted development, and in particular why I keep dropping the available tools after a few attempts at using them. I realized recently that it's taking away the...

I've been thinking a lot about LLM assisted development, and in particular why I keep dropping the available tools after a few attempts at using them.

I realized recently that it's taking away the part of software development I enjoy: the creative problem solving that comes with writing code. What's left is code review tasks, testing, security checks, etc. Important tasks, but they all primarily involve heavy concentration, and much less creativity.

Why aren't agents focused on handling the mundane tasks instead? Tell me if I've just introduced a security vulnerability or a runtime bug. Generate realistic test data and give me info on what the likely output would be. Tell me that the algorithm I just wrote is O(n^2).

Those tasks are so much more applicable to matching against existing data, something LLMs should be extremely good at, rather than trying to get them to write something novel, which so far they've been mostly bad at, at least in my experience.

47 comments

karsaroth

April 6

46 votes
Project Glasswing: securing critical software for the AI era
~tech
- security.cyber
Article 1053 words
15 comments

anthropic.com

April 7

25 votes
Claude Mythos preview
~tech
- security.cyber
Article 13 495 words
4 comments

anthropic.com

April 7

25 votes
Harm reduction centered on AI use

~tech Video 1:30:12, published Apr 2 2026

2 comments

YouTube: Dr. Fatima

April 7

9 votes
Gemma needs help

~comp Article 1894 words, published Mar 10 2026

19 comments

lesswrong.com

March 25

31 votes
Designing an agent reading test

~comp Article 3887 words

1 comment

dacharycarey.com

April 6

10 votes
Here’s what the world had to say about the AI economy

~tech Article 674 words

18 comments

windfalltrust.org

April 3

18 votes
Google releases Gemma 4

~comp Article 640 words

18 comments

blog.google

April 2

28 votes
Anticipating a world where LLM use is widespread

~tech Article 1637 words

8 comments

azhdarchid.com

April 1

16 votes
Claude Code's source code leaked

~tech Article 629 words

9 comments

theregister.com

March 31

50 votes
The cognitive dark forest

~tech Article 1036 words

20 comments

ryelang.org

March 29

31 votes
A.T.L.A.S: outperform Claude Sonnet with a 14B local model and RTX 5060 Ti

~tech Link

10 comments

GitHub: itigges22

March 28

43 votes
Sycophantic AI decreases prosocial intentions and promotes dependence
~tech
- google
- social media
Link
9 comments

science.org

March 28

31 votes
Google’s TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x
~tech
- google
Article 345 words
28 comments

Ars Technica

March 27

44 votes
cq: Stack Overflow for agents

~tech Article 1334 words

4 comments

mozilla.ai

March 24

15 votes
Anthropic takes legal action against OpenCode

~tech Link

4 comments

GitHub: anomalyco

March 21

19 votes
I hope you don't use generative AI - an essay about my experience offering an open-source tool

~tech Article 2386 words

95 comments

rmv.fyi

March 18

71 votes
Executing programs inside transformers with exponentially faster inference

~comp Link

6 comments

percepta.ai

March 13

14 votes
Can coding agents relicense open source through a “clean room” implementation of code?
~comp
- open source
- programming
Article 1004 words
39 comments

simonwillison.net

March 6

51 votes
The future of AI

~tech Article 2989 words, published Feb 26 2026

11 comments

lucijagregov.com

March 8

15 votes
GNU and the AI reimplementations

~tech Link

3 comments

antirez.com

March 8

23 votes
A "Real BMO" local AI Agent with a Raspberry Pi and Ollama

~tech Video 23:16, published Feb 2 2026

5 comments

YouTube: brenpoly

February 6

17 votes
Electricity use of AI coding agents
~enviro
- energy
Article 2768 words, published Jan 20 2026
23 comments

simonpcouch.com

March 4

29 votes
Is it worthwhile to run local LLMs for coding today?
~comp
- programming
Ask (advice)
I've made the decision to purchase a new M5 Macbook Air because of the memorypocalypse. My current M1 model is already upgraded to the amount of memory and storage as the current base model and...

I've made the decision to purchase a new M5 Macbook Air because of the memorypocalypse. My current M1 model is already upgraded to the amount of memory and storage as the current base model and I'm wondering if it's worth spending the extra 2-4 hundred dollars on memory upgrades today.

My current computer is more than good enough for today but I figure I should probably future proof just in case. I was thinking the 16GB would be enough, but I also know that I'm kind of falling behind by not embracing AI coding agents. According to my research the maximum 32GB is recommended for most coding-relevant models - almost as a minimum.

I work in education so coding is not actually much of a need, and obviously there are cloud providers I could use if I end up needing them in the future. I also have less than a teacher's salary because I work part time, which is the greatest reason why I'm sticking with the 16GB base for the moment, but other than that I also don't do many memory-intensive programs. But I thought I would get some recommendations before they start shipping.

I'd also be interested on people's opinions on trading in my old one, since it'll only get me ~$275 back. I'm considering reneging on that part and keeping it around to act as a web server or give it to my husband who has a computer that still runs Windows 7 and barely uses it.

40 comments

Akir

March 6

35 votes
Hacker used Anthropic's Claude chatbot to attack multiple government agencies in Mexico

~comp Article 455 words, published Feb 25 2026

8 comments

Engadget

March 6

21 votes
Eval awareness in Claude Opus 4.6’s BrowseComp performance

~tech Article 2084 words

2 comments

anthropic.com

March 7

14 votes
An AI agent published a hit piece on me

~tech Article 1489 words

23 comments

theshamblog.com

February 12

49 votes
LLMs can unmask pseudonymous users at scale with surprising accuracy
~tech
Article 344 words
43 comments

Ars Technica

March 3

44 votes
My personal AI assistant project

~tech Ask

Let me start off by saying that I'm exhausted by AI hype. Being interested in LLM agent technology (AI agent hereafter for brevity) means skimming over a lot of hype for one or two useful, semi...

Let me start off by saying that I'm exhausted by AI hype. Being interested in LLM agent technology (AI agent hereafter for brevity) means skimming over a lot of hype for one or two useful, semi reality based, bits of information. Maybe the part that I find the most frustrating is how effective the hype is. I don't know if there's ever been a hype cycle like this. Probably a big part of the reason for that is the internet has already proven, within living memory for most people, that technological revolutions really can change everything. Or mess everything up. Either way they generate a lot of economic activity.

So this post is not that. I'm not going to tell you about how AI agents are the second coming for Christ. I'm not selling anything.

Fairly early into learning about AI agents I wanted a way to connect to the agent remotely without hosting it somewhere or exposing ports to the internet. I settled on tailscale and a remote terminal and moved on, I rarely used it. Somehow the tiny friction of "Turn on tailscale, open terminal app, connect, run agent" was enough to make it not feel worth it.

I know I'm far from the only person who had the same "I want it remote" thought, the best evidence: OpenClaw. It's just one of those things that everyone naturally converges on.

If you're not familiar with OpenClaw, the TLDR is: Former founder with more money than he'll ever need vibecodes a bridge between instant messenger apps and LLM APIs. Nothing about it is technically challenging or requires solving any particularly hard problems. It almost immediately becomes the fastest growing GitHub repo of all time and is currently at number 14 for number of stars. It blew up the (tech) internet like very few things ever have. Within months he was hired by Open AI.

OpenClaw now does more than just connect messaging and agents, but I believe that one piece is the killer feature. My tailscale terminal solution, combined with a scheduled task or a cron job and some context files could already do all of the things that OpenClaw can do, and countless people had already implemented similar solutions. But I think it was the tiny bit of friction OpenClaw removed that was responsible for a lot its popularity.

I thought that was interesting but I have no interest in the security nightmare that is OpenClaw, or the "sentience" vibe for that matter, so I built my own tool.

Essentially it's just a light secondary harness combined with a bridge between Signal and Claude Code. It does some other things too, things I wished existing harnesses did, some memory and guidelines, automated prompts and reminders to wake the agent up and have it do stuff, some context to give the agent some level of persistence, make it less LLMy, less annoying. None of that is particularly interesting though.

Once I got it working (MVP took less than a day) and started playing with it, the OpenClaw phenomenon made a lot more sense. Somehow having the agent in a chat interface, with almost zero friction (just open the chat and send something) was cooler than it had any reason to be.

I can't explain it any better than that at the moment. Not only was it kinda fun, it lent itself to a whole range of "what ifs". What if it could do X? What if I wrote a tool that gave it Y capability? I've been experiencing that for some time, but somehow agent in your pocket has a different feeling.

Here's an example of a "what if". What if it could do our grocery shopping? I definitely want that. I already had a custom browser tool that I built for agent coding assistance so I was most of the way there. It was just a matter of teaching the agent to login and navigate a website, something they're already trained to do. Some hand holding, a few helper scripts, and an evening's worth of hours later and I had it working. The agent can respond to a shopping request by building a shopping list based on our most recent orders, presenting it to us for approval/edits in a Signal group chat, doing searches for any additional product requests and adding the finalized order to the cart. It could also checkout the order and schedule the delivery time but I'm doing the last 2 clicks manually for the time being. It's an idiot savant, it seems like a bad idea to give it access to my credit card. Maybe eventually.

The fact that I can handle shopping with a couple of signal messages feels effortless in a way that handling shopping by connecting to my PC terminal remotely via tailscale terminal wouldn't have. Especially when I can include people in the loop who have no interest in tailscaling anywhere. Everyone can use messaging apps.

I imagine before long solutions like this will be built in, either in the grocery websites and apps, or into the frontier harnesses themselves. There will probably be agents everywhere, for better or worse. Probably I'll wish that the agents would all fuck off. In the meantime it's exciting how easy it is to get these tools to do useful things.

13 comments

post_below

February 23

33 votes
AI’s memorization crisis

~tech Article 2298 words, published Jan 9 2026

19 comments

The Atlantic

February 28

24 votes
Anthropic rejects latest US Pentagon offer: ‘We cannot in good conscience accede to their request’

~tech Article

39 comments

CNN

February 27

61 votes
New accounts on Hacker News ten times more likely to use em-dashes
~tech
- internet
- social media
Article 255 words
44 comments

marginalia.nu

February 25

54 votes
Updating Eagleson's Law in the age of agentic AI

~comp Ask

Eagleson's Law states "Any code of your own that you haven't looked at for six or more months might as well have been written by someone else." I keep reading how fewer and fewer of the brightest...

Eagleson's Law states

"Any code of your own that you haven't looked at for six or more months might as well have been written by someone else."

I keep reading how fewer and fewer of the brightest developers are writing code and letting their AI agent to do it all. How do they know what's really happening? Does it matter anymore?

Curious to hear this communities thoughts

9 comments

hamitosis

February 25

11 votes
Ladybird chooses Rust as its successor language to C++, with help from AI
~comp
Article 609 words
18 comments

ladybird.org

February 23

33 votes
The Claude C Compiler: what it reveals about the future of software

~tech Article 3173 words

9 comments

modular.com

February 23

16 votes
Why doesn’t Anthropic use Claude to make a good Claude desktop app?

~tech Article 591 words

41 comments

manualdousuario.net

February 23

27 votes
The AI disruption has arrived, and it sure is fun

~tech Article

52 comments

The New York Times

February 18

29 votes
AI fails at 96% of jobs (new study)

~tech Video 12:49

16 comments

YouTube: ColdFusion

February 13

28 votes
Something big is happening

~tech Article 4865 words, published Feb 9 2026

51 comments

shumer.dev

February 13

33 votes
Building a C compiler with a team of parallel Claudes

~tech Article 2465 words

12 comments

anthropic.com

February 5

20 votes
Is the detachment in the room? - Agents, cruelty, and empathy
~tech
- social media
Article 1999 words
16 comments

hailey.at

February 7

15 votes
Passing question about LLMs and the Tech Singularity

~tech Ask

I am currently reading my way thru Ted Chiang's guest column in the New Yorker, about why the predicted AI/Tech Singularity will probably never happen...

I am currently reading my way thru Ted Chiang's guest column in the New Yorker, about why the predicted AI/Tech Singularity will probably never happen (https://www.newyorker.com/culture/annals-of-inquiry/why-computers-wont-make-themselves-smarter). ETA: I just noticed that article is almost 5 years old; the piece is still relevant, but worth noting.

Good read. Still reading, but so far, I find I disagree with his explicit arguments, but at the same time, he is also brushing up very closely to my own reasoning for why "it" might never happen. Regardless, it is thought-provoking.

But, I had a passing thought during the reading.

People who actually use LLMs like Claude Code to help write software, and/or, who pay close attention to LLMs' coding capabilities ... has anyone actually started experimenting with asking Claude Code or other LLMs that are designed for programming, to look at their own source code and help to improve it?

In other words, are we (the humans) already starting to use LLMs to improve their code faster than we humans alone could do?

Wouldn't this be the actual start of the predicted "intelligence explosion"?

Edit to add: To clarify, I am not (necessarily) suggesting that LLMs -- this particular round of AI -- will actually advance to become some kind of true supra-human AGI ... I am only suggesting that they may be the first real tool we've built (beyond Moore's Law itself) that might legitimately speed up the rate at which we approach the Singularity (whatever that ends up meaning).

30 comments

Eric_the_Cerise

February 4

19 votes
llOOPy lOOPs
~comp
- programming.object oriented
Article 1431 words, published Feb 3 2026
4 comments

autonoma.ca

February 6

12 votes