I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents. This doesn’t really...
I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents. This doesn’t really change my priors about the strengths of LLM-assisted coding. Though it’s still quite cool to read about how much the agent is capable of iterating wholly on its own. I suspect we’ll see improvements in the generation of test suites, since it appears that correct and intelligent test suites combined with a strong architecture are sufficient for agents to arrive at efficient solutions.
I’m curious to hear whether people think this will make software engineering less interesting. Back when I was studying computer science, the tests were always my least favorite part; they were such a slog. Developing interesting and efficient algorithms is always what appealed to me, but it seems like that part may get outsourced to the machine now. It does not affect me, since coding is just a hobby or something I use to supplement my actual work, and I do not need to use LLMs if I don’t want to. I still wonder if I wouldn’t like what software engineering will turn into post-LLMs.
I've never particularly enjoyed writing tests either, but I wonder if that's partly because they feel like duplicated effort? You want to be doing the thing, writing the code to make it happen -...
I've never particularly enjoyed writing tests either, but I wonder if that's partly because they feel like duplicated effort? You want to be doing the thing, writing the code to make it happen - that's the interesting problem solving part - but then you have to invest the same amount of time and brainpower all over again writing tests without the satisfaction of it actually progressing the task you're trying to achieve. I can imagine writing tests potentially feeling like a much more satisfying problem solving task when they are the way that you make direct progress towards the end goal?
This is probably the first example I've seen where I can envisage LLM generated code being used for more serious work, actually. Everything up to now has seemed a bit like the bad old days of Frontpage and Dreamweaver in early web development: they would indeed make something website-shaped exist, even in the hands of someone who otherwise wouldn't know how, but that was pretty much all you could say for them - and the same goes for most of what I've seen about vibe coding so far.
Rigorously test-driven LLM code generation seems a lot more interesting as a concept! Rather than creating something superficially to spec in the hands of someone who doesn't have the understanding to see the problems in the code underneath, we could plausibly end up thinking of code generators a little more like how we think of compilers now. Defining the desired end state in a complete and elegant way becomes the challenge, and reaching that point is abstracted away by one more step, in the same way that most developers don't worry too much about the underlying assembly that their for loop will turn into now.
I don't want to sound too optimistic - commercial incentives tend to end with sloppy, half baked, superficially functional outcomes that just barely hold together until next quarter - but this is the first time I've seen a somewhat plausible path to skilled and effective use of large scale code generation even by people who do care about quality and craftsmanship. We're going to need some much more robust approaches to defining tests for race conditions, deadlocks, and memory leaks if that takes off, though...
Writing tests is something that LLMs are a good fit for as well. Tests tend to be easy to verify by reading them - the hard part is the boilerplate in setting up the scenario. In general, as you...
Writing tests is something that LLMs are a good fit for as well. Tests tend to be easy to verify by reading them - the hard part is the boilerplate in setting up the scenario.
In general, as you go up the ladder the less actual coding you do as a software engineer. The joke is that you transition from writing code in VSCode to Google Docs. So in that sense that doesn’t really change. Writing unit tests in this case is more akin to finding the product requirements, which is definitely a major if not most of your job.
It reminds me a bit of linear programming, where you defined the constraints and a convex optimizer finds the solution.
I agree, this is the key part. With the right boundaries and feedback loops, frontier models can get there. Effective is a relative term here though, even without looking I feel like I have a...
I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents.
I agree, this is the key part. With the right boundaries and feedback loops, frontier models can get there.
Effective is a relative term here though, even without looking I feel like I have a pretty good idea what the codebase in question looks like under the hood. It's going to have wildly different coding conventions mixed together with no apparent logic, moving from one section of the code to another will be an adventure. Naming conventions will be all over the place. It won't be even a little bit DRY. There will be efficiency issues, unneeded defense in depth some places, exploitable holes in others. There will be broken logic that was patched over to pass tests rather than refactored to make actual sense, and so on. In some places it will be beautifully elegant. In short it will look like something built by an army of at once precocious and idiotic juniors with implacable persistence. Most of it will look nothing like production grade code.
Don't get me wrong, it's a really interesting proof of concept, it just doesn't imply all the things it seems to imply about what LLMs are currently capable of. They can do amazing things, but building nontrivial applications that are secure, efficient and maintainable without significant oversight and human feedback is not among them.
Developing interesting and efficient algorithms is always what appealed to me, but it seems like that part may get outsourced to the machine now
I suppose it depends on how you define interesting. If part of the definition is "hasn't been done lots of times before", humans are still required. The time could of course come when that's not true, but LLM technology as it exists now doesn't have a path to that reality.
They can generalize across languages. So if a pattern is well established in one language, but the problem hasn't frequently been solved in another, and there's enough of that language in the training, then the LLM can port the pattern (so to speak). But if it's something that isn't well represented in the training data, you're going to get deal breaking hallucinations. You might be able to brute force it through iteration but not in a way that's better than doing it yourself.
To answer your question though: No I don't think it makes software development less interesting. I think the opposite is true, LLMs make offloading the uninteresting parts possible. That's both interesting and exciting.
It changes my priors a bit on what a capable engineer can accomplish using these tools. Are we going to see more libraries built this way? How about entire compilers? In some sense this is a port...
It changes my priors a bit on what a capable engineer can accomplish using these tools. Are we going to see more libraries built this way? How about entire compilers?
In some sense this is a port from a Rust library to a pure Python library. Maybe we will see more libraries ported that way?
I really like writing tests when doing TDD. I see a possibility that a TDD like cycle is one possibility for whatever we're going to call AI coding. Define what you want, then write the code to...
I really like writing tests when doing TDD. I see a possibility that a TDD like cycle is one possibility for whatever we're going to call AI coding.
Define what you want, then write the code to make it pass, then start again by refining or writing new tests. It gives you many small hits of problem solving satisfaction rather than one big one. I suppose AI would write the test as well.
What I'm not sure about is what AI coding is going to settle on being. I had thought it would be a kind of autocomplete, but that was really just the first iteration. I used Claude to create some Home Assistant scripts and was really happy with the outcome, a script for something I'd been spinning my wheels on for a year or two (look up the high temp, low temp, and rainfall for a specific day). The cycle really was similar to TDD. I told it what I wanted, it created what it thought I meant, then I corrected misunderstandings, added something I'd forgotten to specify, or moved on to new parts of what I wanted.
I think with projects like this, we can confidently throw away the "vibe" part of vibe coding. This is clearly not about vibes, not about if something feels and looks right. This is very clearly...
I think with projects like this, we can confidently throw away the "vibe" part of vibe coding. This is clearly not about vibes, not about if something feels and looks right. This is very clearly not about vibes, this is a skilled engineer who wants to test an idea and then puts in the work to clean up after the machine. I don't see how that's functionally different than using a debugger (albeit conceptually the other way around, I grant you)
Whatever you call it, it was almost entirely written by the LLM. This part seems sort of vibe-ish:
Whatever you call it, it was almost entirely written by the LLM. This part seems sort of vibe-ish:
After writing the parser, I still don't know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code.
I handled all git commits myself, reviewing code as it went in. I didn't understand all the algorithmic choices, but I understood when it didn't do the right thing.
Eh, jurisdictional. I'd argue that without the programmer's guidance, the LLM would have done nothing ever, but that's a semantic argument that while I'd love to talk about it won't get us...
Eh, jurisdictional. I'd argue that without the programmer's guidance, the LLM would have done nothing ever, but that's a semantic argument that while I'd love to talk about it won't get us anywhere.
Either way, good code that works is still good code that works even if it was in an LLM's token buffer first.
I wasn't sure why I didn't like the article and this is why. I guess the term "vibe" is just becoming a stand-in for LLM-assisted. This is someone who used an LLM to engineer a product. Is "vibe...
I wasn't sure why I didn't like the article and this is why. I guess the term "vibe" is just becoming a stand-in for LLM-assisted. This is someone who used an LLM to engineer a product. Is "vibe engineering" an oxymoron?
I agree with the findings, though. LLMs are very good for writing code that you can write and verify yourself, and can test afterwards.
This looks like one of those cases where LLMs actually live up to the hype. In my experience they never come close to a tremendous speed up. Rarely they’ll make something 2x faster to accomplish....
This looks like one of those cases where LLMs actually live up to the hype. In my experience they never come close to a tremendous speed up. Rarely they’ll make something 2x faster to accomplish. When they aren’t a net negative, they usually seem to give more like a 10-50% improvement.
What the business men don’t understand is that you need some rare gifts to make this happen:
A bit of an exaggerated take I saw, but… https://mastodon.social/@glyph/115732687795909357
A bit of an exaggerated take I saw, but…
it's truly amazing what LLMs can achieve. we now know it's possible to produce an html5 parsing library with nothing but the full source code of an existing html5 parsing library, all the source code of all other open source libraries ever, a meticulously maintained and extremely comprehensive test suite written by somebody else, 5 different models, a megawatt-hour of energy, a swimming pool full of water, and a month of spare time of an extremely senior engineer
I very much doubt it took that much energy or water. People overestimate LLM energy usage. Otherwise, seems fair. Still, if you were to port an html library from one language to another, how would...
I very much doubt it took that much energy or water. People overestimate LLM energy usage. Otherwise, seems fair.
Still, if you were to port an html library from one language to another, how would you do it?
I do agree with you the water and energy are probably very exaggerated. As I told my friend that sent me that mastodon originally, I do think it’s cool that there’s now a new HTML5 validator in...
I do agree with you the water and energy are probably very exaggerated.
As I told my friend that sent me that mastodon originally, I do think it’s cool that there’s now a new HTML5 validator in another programming language (though I didn’t realize the test lib used was already in Python?).
LLMs do have their use case, and this does seem like one of them that isn’t “all bad” or even “mostly bad”.
The library's README implies it's the only implementation in Python that has 100% compliance with the HTML5 spec. I don't think the use of LLMs is a coincidence.
The library's README implies it's the only implementation in Python that has 100% compliance with the HTML5 spec. I don't think the use of LLMs is a coincidence.
The start of the blog post: Here is Emil’s blog post about how it was written and Willison's summary:
The start of the blog post:
I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It’s a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming.
Here is Emil’s blog post about how it was written and Willison's summary:
[…] A few highlights:
He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There’s no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use.
He picked the core API design himself—a TagHandler base class with handle_start() etc. methods—and told the model to implement that.
He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers.
He threw the original code away and started from scratch as a rough port of Servo’s excellent html5ever Rust library.
He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries.
He used coverage to identify and remove unnecessary code.
He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them.
Just for fun, Simon Willison ported it to JavaScript. It took less than five hours, while he was doing other things. The parser's output seems to be plain JavaScript objects. The AST could...
Just for fun, Simon Willison ported it to JavaScript. It took less than five hours, while he was doing other things.
The parser's output seems to be plain JavaScript objects. The AST could probably be serialized to JSON? Not sure how useful that is. Someone who uses JavaScript more could perhaps talk about whether that's a good API and compare it to alternatives. For myself, I'd want Typescript declarations, but they probably wouldn't be hard to add.
Willison writes about the implications:
We haven’t even begun to unpack the etiquette and ethics around this style of development. Is it responsible and appropriate to churn out a direct port of a library like this in a few hours while watching a movie? What would it take for code built like this to be trusted in production?
I’ll end with some open questions:
Does this library represent a legal violation of copyright of either the Rust library or the Python one?
Even if this is legal, is it ethical to build a library in this way?
Does this format of development hurt the open source ecosystem?
Can I even assert copyright over this, given how much of the work was produced by the LLM?
Is it responsible to publish software libraries built in this way?
How much better would this library be if an expert team hand crafted it over the course of several months?
The US copyright office holds that human authorship is required for copyright. More discussion here. Maybe we'll end up with a situation where an open source license can't be enforced on code that's generated this way?
I wrote elsewhere about my skepticism of the process, but I didn't feel up to reviewing the Python code in much detail. But JS is my jam! The code is kind of all over the place. The biggest flaw,...
Exemplary
I wrote elsewhere about my skepticism of the process, but I didn't feel up to reviewing the Python code in much detail. But JS is my jam!
The code is kind of all over the place. The biggest flaw, like you said, is a lack of type annotations. Simon said he wanted the code to run as an ES6 module when pulled into the browser, which is fair enough, but JSDoc is a good alternative in that case. Of course, there are no doc comments anywhere anyway, so guessing the public API, particular in terms of what options are available on which methods and what they do, seems to require reading the source code.
The problem with the missing types is that it makes the rest of the code very difficult to follow. For example, there's a function attrListToDict that converts an array of attributes into object map. This function is written very defensively — it allows you to input essentially any JS value and it'll attempt to do the right thing. But it's not clear how defensive it really needs to be — if you look at the call sites, it's an internally-used function that, as far as I can tell, is always passed validated data. Most of that defensiveness is pointless.
Even beyond the lack of types, there are lots of other oddities scattered around the codebase. Even in the attrListToDict function, the implementation is already kind of weird. Why is it using Object.prototype.hasOwnProperty? Almost certainly because Python requires things like key in dict or hasattr(obj, key) to avoid throwing exceptions. In JS, in cases like this, you can just do obj.key ?? default. Elsewhere, to pass optional parameters to functions and classes, sometimes you should pass object literals (fairly standard), sometimes different positional parameters (generally discouraged), and other times instances of specialised options classes. (This feels weird enough that I wonder if it's a quirk descended from porting the original codebase to Python, let alone porting the Python one to JS.)
Then there's the project structure. Let's go back to attrListToDict and the file it's in — html5lib_serializer.js. This is a test utility — it formats the parsed tree so that it can be used in tests that use the html5lib test data. That's useful, but it's not really part of the source. It's also distinct from the toTestFormat function that exists in the serialize.js file, that does a similar thing but into an internal test format. Both of these are implemented within the src folder, so it's not clear that they're not part of the public API. (toTestFormat is even exported in index.js, which I don't really understand the logic behind — the comments imply that it is not meant to be part of the public API.)
The test files are chaotic, each with their own style of writing tests, and none of them using anything useful like NodeJS's built-in test runner. The different output formats (including the test format) are written separately, and it's not clear how one could write their own output format if one wished. The is a chaotic mix of snake_case and camelCase, plus a bit of hyphen-case in some (but not all) file names. (The Node class explicitly has snake_case aliases of all the camelCase names, but no other class does.) There is a streaming parser that does not interact in any way with the rest of the library except for using the same underlying tokenizer, which means it's duplicating a whole bunch of the work from the main parser, and will be a pain to work with if you need streaming and want to generate some subset of nodes at the same time. There is an autogenerated entities-data.js file but no script to generate it. The whole thing has a low-level chaos to it, as if it's a five-year-old project maintained by a team of ten developers, as opposed to a couple of dozen relatively simple files worth of code.
Like, this isn't the worst code I've ever seen, but it already feels like a legacy project, particularly with the disparate styles and how difficult it is to get much of an overview of what's going on. But I think they key part of that is that Simon doesn't really understand how the code works. In this case, he's been quite explicit about this being a one-shot prompt that ran for a few hours, where he then published the result. That's fair enough. But he also asks the question of whether it's responsible to publish this sort of stuff, presumably to NPM or similar, and I think the answer to that has to be no.
From the link: Based on my reading of this, the author of the library did influence enough to be able to copyright it. But my views definitely don’t line up with laws with regard to intellectual...
From the link:
“assistive uses that enhance human expression” [do not affect the ability for the human to copyright something]
Based on my reading of this, the author of the library did influence enough to be able to copyright it. But my views definitely don’t line up with laws with regard to intellectual property. It would be very interesting if any use of AI poisoned the ability to copyright, thus making everything public domain from creation. Then we could rebuild IP laws from scratch in a sensible way.
Incidentally I seem to recall you had been working on some sort of unit testing framework. (Forgive me but I don't remember the details.) Just out of curiosity does that have any relevance to...
Incidentally I seem to recall you had been working on some sort of unit testing framework. (Forgive me but I don't remember the details.) Just out of curiosity does that have any relevance to what's being discussed here with the vibe engineering? Would it be useful to hook up something like that to an LLM to have it generate tests as a complementary tool? Or would an LLM instead replace a framework like that?
The library I wrote is called repeat-test and it's for property testing, which is quite like fuzz testing. Yes, could be useful, but in this case, they already have plenty of tests. I think even...
The library I wrote is called repeat-test and it's for property testing, which is quite like fuzz testing. Yes, could be useful, but in this case, they already have plenty of tests. I think even fuzz tests?
Someone else ported it to OCaml. ... Still, I wonder if this sort of thing could be an interesting exercise for college students taking a computer science class?
Modules and types are defining feature of OCaml, so I pointed the agent at the WHATWG standard and asked it to introduce explanations directly into the interface files themselves. This is quite different from an informal specification; the guidelines can be browsed directly and navigated around via the odoc HTML output.
...
I'm also extremely uncertain about [ever] releasing this library to the central opam repository, especially as there are excellent HTML5 parsers already available. I haven't checked if those pass the HTML5 test suite, because this is wandering into the agents vs humans territory that I ruled out in my groundrules. Whether or not this agentic code is better or not is a moot point if releasing it drives away the human maintainers who are the source of creativity in the code!
Still, I wonder if this sort of thing could be an interesting exercise for college students taking a computer science class?
I think it’s notable that there was a preexisting comprehensive test suite. Both that and the coverage tests served as extremely effective feedback loops for the coding agents. This doesn’t really change my priors about the strengths of LLM-assisted coding. Though it’s still quite cool to read about how much the agent is capable of iterating wholly on its own. I suspect we’ll see improvements in the generation of test suites, since it appears that correct and intelligent test suites combined with a strong architecture are sufficient for agents to arrive at efficient solutions.
I’m curious to hear whether people think this will make software engineering less interesting. Back when I was studying computer science, the tests were always my least favorite part; they were such a slog. Developing interesting and efficient algorithms is always what appealed to me, but it seems like that part may get outsourced to the machine now. It does not affect me, since coding is just a hobby or something I use to supplement my actual work, and I do not need to use LLMs if I don’t want to. I still wonder if I wouldn’t like what software engineering will turn into post-LLMs.
I've never particularly enjoyed writing tests either, but I wonder if that's partly because they feel like duplicated effort? You want to be doing the thing, writing the code to make it happen - that's the interesting problem solving part - but then you have to invest the same amount of time and brainpower all over again writing tests without the satisfaction of it actually progressing the task you're trying to achieve. I can imagine writing tests potentially feeling like a much more satisfying problem solving task when they are the way that you make direct progress towards the end goal?
This is probably the first example I've seen where I can envisage LLM generated code being used for more serious work, actually. Everything up to now has seemed a bit like the bad old days of Frontpage and Dreamweaver in early web development: they would indeed make something website-shaped exist, even in the hands of someone who otherwise wouldn't know how, but that was pretty much all you could say for them - and the same goes for most of what I've seen about vibe coding so far.
Rigorously test-driven LLM code generation seems a lot more interesting as a concept! Rather than creating something superficially to spec in the hands of someone who doesn't have the understanding to see the problems in the code underneath, we could plausibly end up thinking of code generators a little more like how we think of compilers now. Defining the desired end state in a complete and elegant way becomes the challenge, and reaching that point is abstracted away by one more step, in the same way that most developers don't worry too much about the underlying assembly that their
forloop will turn into now.I don't want to sound too optimistic - commercial incentives tend to end with sloppy, half baked, superficially functional outcomes that just barely hold together until next quarter - but this is the first time I've seen a somewhat plausible path to skilled and effective use of large scale code generation even by people who do care about quality and craftsmanship. We're going to need some much more robust approaches to defining tests for race conditions, deadlocks, and memory leaks if that takes off, though...
Writing tests is something that LLMs are a good fit for as well. Tests tend to be easy to verify by reading them - the hard part is the boilerplate in setting up the scenario.
In general, as you go up the ladder the less actual coding you do as a software engineer. The joke is that you transition from writing code in VSCode to Google Docs. So in that sense that doesn’t really change. Writing unit tests in this case is more akin to finding the product requirements, which is definitely a major if not most of your job.
It reminds me a bit of linear programming, where you defined the constraints and a convex optimizer finds the solution.
I agree, this is the key part. With the right boundaries and feedback loops, frontier models can get there.
Effective is a relative term here though, even without looking I feel like I have a pretty good idea what the codebase in question looks like under the hood. It's going to have wildly different coding conventions mixed together with no apparent logic, moving from one section of the code to another will be an adventure. Naming conventions will be all over the place. It won't be even a little bit DRY. There will be efficiency issues, unneeded defense in depth some places, exploitable holes in others. There will be broken logic that was patched over to pass tests rather than refactored to make actual sense, and so on. In some places it will be beautifully elegant. In short it will look like something built by an army of at once precocious and idiotic juniors with implacable persistence. Most of it will look nothing like production grade code.
Don't get me wrong, it's a really interesting proof of concept, it just doesn't imply all the things it seems to imply about what LLMs are currently capable of. They can do amazing things, but building nontrivial applications that are secure, efficient and maintainable without significant oversight and human feedback is not among them.
I suppose it depends on how you define interesting. If part of the definition is "hasn't been done lots of times before", humans are still required. The time could of course come when that's not true, but LLM technology as it exists now doesn't have a path to that reality.
They can generalize across languages. So if a pattern is well established in one language, but the problem hasn't frequently been solved in another, and there's enough of that language in the training, then the LLM can port the pattern (so to speak). But if it's something that isn't well represented in the training data, you're going to get deal breaking hallucinations. You might be able to brute force it through iteration but not in a way that's better than doing it yourself.
To answer your question though: No I don't think it makes software development less interesting. I think the opposite is true, LLMs make offloading the uninteresting parts possible. That's both interesting and exciting.
It changes my priors a bit on what a capable engineer can accomplish using these tools. Are we going to see more libraries built this way? How about entire compilers?
In some sense this is a port from a Rust library to a pure Python library. Maybe we will see more libraries ported that way?
I really like writing tests when doing TDD. I see a possibility that a TDD like cycle is one possibility for whatever we're going to call AI coding.
Define what you want, then write the code to make it pass, then start again by refining or writing new tests. It gives you many small hits of problem solving satisfaction rather than one big one. I suppose AI would write the test as well.
What I'm not sure about is what AI coding is going to settle on being. I had thought it would be a kind of autocomplete, but that was really just the first iteration. I used Claude to create some Home Assistant scripts and was really happy with the outcome, a script for something I'd been spinning my wheels on for a year or two (look up the high temp, low temp, and rainfall for a specific day). The cycle really was similar to TDD. I told it what I wanted, it created what it thought I meant, then I corrected misunderstandings, added something I'd forgotten to specify, or moved on to new parts of what I wanted.
I think with projects like this, we can confidently throw away the "vibe" part of vibe coding. This is clearly not about vibes, not about if something feels and looks right. This is very clearly not about vibes, this is a skilled engineer who wants to test an idea and then puts in the work to clean up after the machine. I don't see how that's functionally different than using a debugger (albeit conceptually the other way around, I grant you)
Whatever you call it, it was almost entirely written by the LLM. This part seems sort of vibe-ish:
Eh, jurisdictional. I'd argue that without the programmer's guidance, the LLM would have done nothing ever, but that's a semantic argument that while I'd love to talk about it won't get us anywhere.
Either way, good code that works is still good code that works even if it was in an LLM's token buffer first.
I wasn't sure why I didn't like the article and this is why. I guess the term "vibe" is just becoming a stand-in for LLM-assisted. This is someone who used an LLM to engineer a product. Is "vibe engineering" an oxymoron?
I agree with the findings, though. LLMs are very good for writing code that you can write and verify yourself, and can test afterwards.
Willison coined "vibe engineering" to contrast it with "vibe coding." I don't know if it's going to stick. Maybe a better term for it will win?
This looks like one of those cases where LLMs actually live up to the hype. In my experience they never come close to a tremendous speed up. Rarely they’ll make something 2x faster to accomplish. When they aren’t a net negative, they usually seem to give more like a 10-50% improvement.
What the business men don’t understand is that you need some rare gifts to make this happen:
Edit: And a good engineer at the helm
A bit of an exaggerated take I saw, but…
https://mastodon.social/@glyph/115732687795909357
I very much doubt it took that much energy or water. People overestimate LLM energy usage. Otherwise, seems fair.
Still, if you were to port an html library from one language to another, how would you do it?
I do agree with you the water and energy are probably very exaggerated.
As I told my friend that sent me that mastodon originally, I do think it’s cool that there’s now a new HTML5 validator in another programming language (though I didn’t realize the test lib used was already in Python?).
LLMs do have their use case, and this does seem like one of them that isn’t “all bad” or even “mostly bad”.
The library's README implies it's the only implementation in Python that has 100% compliance with the HTML5 spec. I don't think the use of LLMs is a coincidence.
The start of the blog post:
Here is Emil’s blog post about how it was written and Willison's summary:
Just for fun, Simon Willison ported it to JavaScript. It took less than five hours, while he was doing other things.
The parser's output seems to be plain JavaScript objects. The AST could probably be serialized to JSON? Not sure how useful that is. Someone who uses JavaScript more could perhaps talk about whether that's a good API and compare it to alternatives. For myself, I'd want Typescript declarations, but they probably wouldn't be hard to add.
Willison writes about the implications:
The US copyright office holds that human authorship is required for copyright. More discussion here. Maybe we'll end up with a situation where an open source license can't be enforced on code that's generated this way?
I wrote elsewhere about my skepticism of the process, but I didn't feel up to reviewing the Python code in much detail. But JS is my jam!
The code is kind of all over the place. The biggest flaw, like you said, is a lack of type annotations. Simon said he wanted the code to run as an ES6 module when pulled into the browser, which is fair enough, but JSDoc is a good alternative in that case. Of course, there are no doc comments anywhere anyway, so guessing the public API, particular in terms of what options are available on which methods and what they do, seems to require reading the source code.
The problem with the missing types is that it makes the rest of the code very difficult to follow. For example, there's a function
attrListToDictthat converts an array of attributes into object map. This function is written very defensively — it allows you to input essentially any JS value and it'll attempt to do the right thing. But it's not clear how defensive it really needs to be — if you look at the call sites, it's an internally-used function that, as far as I can tell, is always passed validated data. Most of that defensiveness is pointless.Even beyond the lack of types, there are lots of other oddities scattered around the codebase. Even in the
attrListToDictfunction, the implementation is already kind of weird. Why is it usingObject.prototype.hasOwnProperty? Almost certainly because Python requires things likekey in dictorhasattr(obj, key)to avoid throwing exceptions. In JS, in cases like this, you can just doobj.key ?? default. Elsewhere, to pass optional parameters to functions and classes, sometimes you should pass object literals (fairly standard), sometimes different positional parameters (generally discouraged), and other times instances of specialised options classes. (This feels weird enough that I wonder if it's a quirk descended from porting the original codebase to Python, let alone porting the Python one to JS.)Then there's the project structure. Let's go back to
attrListToDictand the file it's in —html5lib_serializer.js. This is a test utility — it formats the parsed tree so that it can be used in tests that use the html5lib test data. That's useful, but it's not really part of the source. It's also distinct from thetoTestFormatfunction that exists in theserialize.jsfile, that does a similar thing but into an internal test format. Both of these are implemented within the src folder, so it's not clear that they're not part of the public API. (toTestFormatis even exported inindex.js, which I don't really understand the logic behind — the comments imply that it is not meant to be part of the public API.)The test files are chaotic, each with their own style of writing tests, and none of them using anything useful like NodeJS's built-in test runner. The different output formats (including the test format) are written separately, and it's not clear how one could write their own output format if one wished. The is a chaotic mix of
snake_caseandcamelCase, plus a bit ofhyphen-casein some (but not all) file names. (TheNodeclass explicitly has snake_case aliases of all the camelCase names, but no other class does.) There is a streaming parser that does not interact in any way with the rest of the library except for using the same underlying tokenizer, which means it's duplicating a whole bunch of the work from the main parser, and will be a pain to work with if you need streaming and want to generate some subset of nodes at the same time. There is an autogeneratedentities-data.jsfile but no script to generate it. The whole thing has a low-level chaos to it, as if it's a five-year-old project maintained by a team of ten developers, as opposed to a couple of dozen relatively simple files worth of code.Like, this isn't the worst code I've ever seen, but it already feels like a legacy project, particularly with the disparate styles and how difficult it is to get much of an overview of what's going on. But I think they key part of that is that Simon doesn't really understand how the code works. In this case, he's been quite explicit about this being a one-shot prompt that ran for a few hours, where he then published the result. That's fair enough. But he also asks the question of whether it's responsible to publish this sort of stuff, presumably to NPM or similar, and I think the answer to that has to be no.
From the link:
Based on my reading of this, the author of the library did influence enough to be able to copyright it. But my views definitely don’t line up with laws with regard to intellectual property. It would be very interesting if any use of AI poisoned the ability to copyright, thus making everything public domain from creation. Then we could rebuild IP laws from scratch in a sensible way.
Incidentally I seem to recall you had been working on some sort of unit testing framework. (Forgive me but I don't remember the details.) Just out of curiosity does that have any relevance to what's being discussed here with the vibe engineering? Would it be useful to hook up something like that to an LLM to have it generate tests as a complementary tool? Or would an LLM instead replace a framework like that?
The library I wrote is called repeat-test and it's for property testing, which is quite like fuzz testing. Yes, could be useful, but in this case, they already have plenty of tests. I think even fuzz tests?
Someone else ported it to OCaml.
...
Still, I wonder if this sort of thing could be an interesting exercise for college students taking a computer science class?
Someone else ported it to Swift.