I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents. This doesn’t really...
I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents. This doesn’t really change my priors about the strengths of LLM-assisted coding. Though it’s still quite cool to read about how much the agent is capable of iterating wholly on its own. I suspect we’ll see improvements in the generation of test suites, since it appears that correct and intelligent test suites combined with a strong architecture are sufficient for agents to arrive at efficient solutions.
I’m curious to hear whether people think this will make software engineering less interesting. Back when I was studying computer science, the tests were always my least favorite part; they were such a slog. Developing interesting and efficient algorithms is always what appealed to me, but it seems like that part may get outsourced to the machine now. It does not affect me, since coding is just a hobby or something I use to supplement my actual work, and I do not need to use LLMs if I don’t want to. I still wonder if I wouldn’t like what software engineering will turn into post-LLMs.
I've never particularly enjoyed writing tests either, but I wonder if that's partly because they feel like duplicated effort? You want to be doing the thing, writing the code to make it happen -...
I've never particularly enjoyed writing tests either, but I wonder if that's partly because they feel like duplicated effort? You want to be doing the thing, writing the code to make it happen - that's the interesting problem solving part - but then you have to invest the same amount of time and brainpower all over again writing tests without the satisfaction of it actually progressing the task you're trying to achieve. I can imagine writing tests potentially feeling like a much more satisfying problem solving task when they are the way that you make direct progress towards the end goal?
This is probably the first example I've seen where I can envisage LLM generated code being used for more serious work, actually. Everything up to now has seemed a bit like the bad old days of Frontpage and Dreamweaver in early web development: they would indeed make something website-shaped exist, even in the hands of someone who otherwise wouldn't know how, but that was pretty much all you could say for them - and the same goes for most of what I've seen about vibe coding so far.
Rigorously test-driven LLM code generation seems a lot more interesting as a concept! Rather than creating something superficially to spec in the hands of someone who doesn't have the understanding to see the problems in the code underneath, we could plausibly end up thinking of code generators a little more like how we think of compilers now. Defining the desired end state in a complete and elegant way becomes the challenge, and reaching that point is abstracted away by one more step, in the same way that most developers don't worry too much about the underlying assembly that their for loop will turn into now.
I don't want to sound too optimistic - commercial incentives tend to end with sloppy, half baked, superficially functional outcomes that just barely hold together until next quarter - but this is the first time I've seen a somewhat plausible path to skilled and effective use of large scale code generation even by people who do care about quality and craftsmanship. We're going to need some much more robust approaches to defining tests for race conditions, deadlocks, and memory leaks if that takes off, though...
Writing tests is something that LLMs are a good fit for as well. Tests tend to be easy to verify by reading them - the hard part is the boilerplate in setting up the scenario. In general, as you...
Writing tests is something that LLMs are a good fit for as well. Tests tend to be easy to verify by reading them - the hard part is the boilerplate in setting up the scenario.
In general, as you go up the ladder the less actual coding you do as a software engineer. The joke is that you transition from writing code in VSCode to Google Docs. So in that sense that doesn’t really change. Writing unit tests in this case is more akin to finding the product requirements, which is definitely a major if not most of your job.
It reminds me a bit of linear programming, where you defined the constraints and a convex optimizer finds the solution.
It changes my priors a bit on what a capable engineer can accomplish using these tools. Are we going to see more libraries built this way? How about entire compilers? In some sense this is a port...
It changes my priors a bit on what a capable engineer can accomplish using these tools. Are we going to see more libraries built this way? How about entire compilers?
In some sense this is a port from a Rust library to a pure Python library. Maybe we will see more libraries ported that way?
I think with projects like this, we can confidently throw away the "vibe" part of vibe coding. This is clearly not about vibes, not about if something feels and looks right. This is very clearly...
I think with projects like this, we can confidently throw away the "vibe" part of vibe coding. This is clearly not about vibes, not about if something feels and looks right. This is very clearly not about vibes, this is a skilled engineer who wants to test an idea and then puts in the work to clean up after the machine. I don't see how that's functionally different than using a debugger (albeit conceptually the other way around, I grant you)
Whatever you call it, it was almost entirely written by the LLM. This part seems sort of vibe-ish:
Whatever you call it, it was almost entirely written by the LLM. This part seems sort of vibe-ish:
After writing the parser, I still don't know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code.
I handled all git commits myself, reviewing code as it went in. I didn't understand all the algorithmic choices, but I understood when it didn't do the right thing.
Eh, jurisdictional. I'd argue that without the programmer's guidance, the LLM would have done nothing ever, but that's a semantic argument that while I'd love to talk about it won't get us...
Eh, jurisdictional. I'd argue that without the programmer's guidance, the LLM would have done nothing ever, but that's a semantic argument that while I'd love to talk about it won't get us anywhere.
Either way, good code that works is still good code that works even if it was in an LLM's token buffer first.
The start of the blog post: Here is Emil’s blog post about how it was written and Willison's summary:
The start of the blog post:
I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It’s a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming.
Here is Emil’s blog post about how it was written and Willison's summary:
[…] A few highlights:
He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There’s no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use.
He picked the core API design himself—a TagHandler base class with handle_start() etc. methods—and told the model to implement that.
He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers.
He threw the original code away and started from scratch as a rough port of Servo’s excellent html5ever Rust library.
He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries.
He used coverage to identify and remove unnecessary code.
He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them.
I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents. This doesn’t really change my priors about the strengths of LLM-assisted coding. Though it’s still quite cool to read about how much the agent is capable of iterating wholly on its own. I suspect we’ll see improvements in the generation of test suites, since it appears that correct and intelligent test suites combined with a strong architecture are sufficient for agents to arrive at efficient solutions.
I’m curious to hear whether people think this will make software engineering less interesting. Back when I was studying computer science, the tests were always my least favorite part; they were such a slog. Developing interesting and efficient algorithms is always what appealed to me, but it seems like that part may get outsourced to the machine now. It does not affect me, since coding is just a hobby or something I use to supplement my actual work, and I do not need to use LLMs if I don’t want to. I still wonder if I wouldn’t like what software engineering will turn into post-LLMs.
I've never particularly enjoyed writing tests either, but I wonder if that's partly because they feel like duplicated effort? You want to be doing the thing, writing the code to make it happen - that's the interesting problem solving part - but then you have to invest the same amount of time and brainpower all over again writing tests without the satisfaction of it actually progressing the task you're trying to achieve. I can imagine writing tests potentially feeling like a much more satisfying problem solving task when they are the way that you make direct progress towards the end goal?
This is probably the first example I've seen where I can envisage LLM generated code being used for more serious work, actually. Everything up to now has seemed a bit like the bad old days of Frontpage and Dreamweaver in early web development: they would indeed make something website-shaped exist, even in the hands of someone who otherwise wouldn't know how, but that was pretty much all you could say for them - and the same goes for most of what I've seen about vibe coding so far.
Rigorously test-driven LLM code generation seems a lot more interesting as a concept! Rather than creating something superficially to spec in the hands of someone who doesn't have the understanding to see the problems in the code underneath, we could plausibly end up thinking of code generators a little more like how we think of compilers now. Defining the desired end state in a complete and elegant way becomes the challenge, and reaching that point is abstracted away by one more step, in the same way that most developers don't worry too much about the underlying assembly that their
forloop will turn into now.I don't want to sound too optimistic - commercial incentives tend to end with sloppy, half baked, superficially functional outcomes that just barely hold together until next quarter - but this is the first time I've seen a somewhat plausible path to skilled and effective use of large scale code generation even by people who do care about quality and craftsmanship. We're going to need some much more robust approaches to defining tests for race conditions, deadlocks, and memory leaks if that takes off, though...
Writing tests is something that LLMs are a good fit for as well. Tests tend to be easy to verify by reading them - the hard part is the boilerplate in setting up the scenario.
In general, as you go up the ladder the less actual coding you do as a software engineer. The joke is that you transition from writing code in VSCode to Google Docs. So in that sense that doesn’t really change. Writing unit tests in this case is more akin to finding the product requirements, which is definitely a major if not most of your job.
It reminds me a bit of linear programming, where you defined the constraints and a convex optimizer finds the solution.
It changes my priors a bit on what a capable engineer can accomplish using these tools. Are we going to see more libraries built this way? How about entire compilers?
In some sense this is a port from a Rust library to a pure Python library. Maybe we will see more libraries ported that way?
I think with projects like this, we can confidently throw away the "vibe" part of vibe coding. This is clearly not about vibes, not about if something feels and looks right. This is very clearly not about vibes, this is a skilled engineer who wants to test an idea and then puts in the work to clean up after the machine. I don't see how that's functionally different than using a debugger (albeit conceptually the other way around, I grant you)
Whatever you call it, it was almost entirely written by the LLM. This part seems sort of vibe-ish:
Eh, jurisdictional. I'd argue that without the programmer's guidance, the LLM would have done nothing ever, but that's a semantic argument that while I'd love to talk about it won't get us anywhere.
Either way, good code that works is still good code that works even if it was in an LLM's token buffer first.
The start of the blog post:
Here is Emil’s blog post about how it was written and Willison's summary: