It's unlikely we'll get to "zero bugs" any time soon, that's pretty hyperbolic. Security vulnerabilities being a subset of "bugs". We're already at a place, though, where agent automation can find...
The defects are finite, and we are entering a world where we can finally find them all.
It's unlikely we'll get to "zero bugs" any time soon, that's pretty hyperbolic. Security vulnerabilities being a subset of "bugs".
We're already at a place, though, where agent automation can find more bugs, faster, and for less money, than a team of humans can. Humans can still find things the agents miss, but Mythos is clearly finding more things that humans have missed than the reverse.
The thing that's interesting to me is that we're watching agent bug testing become pretty much mandatory for large software companies. If there's any chance that an AI agent can find security bugs that humans would miss, then they have no choice but to use agents, because malicious actors will otherwise use them to find the same bugs.
I think the post is right to imply that this shifts the advantage towards creators/defenders. Using agents for vulnerability review is (relatively) cheap and accessible and you have the advantage of free access to the source code. Meanwhile on the black hat side you need expertise, funding and expensive black market tools that AI agents could in theory make partially obsolete. And if black hats want to use frontier models to find vulnerabilities, they need to figure out how to defeat ever improving guardrails against malicious use. They can't switch to different models without guardrails because those models aren't good enough (so far) to be useful for security research against code secured by frontier models.
The arms race will continue, probably forever, but AI agents are good news for the white hat side.
Yes and no. AI agents can also create bugs 100x faster than a human. One thing I've noticed in my vibecoding experiments is that iteration often leads a lot of cruft behind. The kind of stuff that...
Yes and no. AI agents can also create bugs 100x faster than a human.
One thing I've noticed in my vibecoding experiments is that iteration often leads a lot of cruft behind. The kind of stuff that will probably pass tests, but also can provide more opportunities for things to go wrong.
Agreed on AI leaving a lot of cruft behind in rapid iteration, it’s definitely something to watch out for; but just like AI pen testing it’s something AI can fix as well. My coding iterations look...
Agreed on AI leaving a lot of cruft behind in rapid iteration, it’s definitely something to watch out for; but just like AI pen testing it’s something AI can fix as well.
My coding iterations look like this actually. Regular cycles of auditing for maintainability, security and performance usually all separately.
(agreed with you on all points; just wanted to say that Mozilla is going all-in on AI just as the bubble is popping -- which is par for course for that brain dead organization -- hence the glazing...
It's unlikely we'll get to "zero bugs" any time soon, that's pretty hyperbolic. Security vulnerabilities being a subset of "bugs".
(agreed with you on all points; just wanted to say that Mozilla is going all-in on AI just as the bubble is popping -- which is par for course for that brain dead organization -- hence the glazing of AI in this post. Probably)
Mozilla is also going all-in on LOCAL AI: https://blog.mozilla.ai/ They have multiple pretty cool projects that have first-lefvel support for local LLMs: https://github.com/mozilla-ai
Yeah ... I think I'll believe it more when I see it. I was excited for the built-in feature for asking questions about the current page to an LLM. I can't say whether it works well, though,...
Yeah ... I think I'll believe it more when I see it. I was excited for the built-in feature for asking questions about the current page to an LLM. I can't say whether it works well, though, because I needed an about:config change to point it at my local inference server, and in the end wound up being a glorified floating tab. I'll take a look at the links later though in case they've done more compelling work, though; thanks.
edit: had a look, and it seems mostly out of touch, lacking product vision, or trend following. May I ask if anything stood out to you as particularly valuable? It kinda looks like NPO slop to me: stuff that would make donors feel excited, but that serves little to no practical purpose (eg "llamafile helps anyone run an LLM, with no experience!" is not a product, and could be solved technically with fewer tradeoffs and better visibility via other technical approaches).
Not everything has to be a "product" that makes profit and nothing there is super-unique and something that couldn't be done better in their repos. Very few things in the world are. llamafile has...
Not everything has to be a "product" that makes profit and nothing there is super-unique and something that couldn't be done better in their repos. Very few things in the world are.
llamafile has existed for a LONG time (3+ years) and I haven't heard of a replacement yet.
Gotcha. And yeah, no worries -- obviously I'm not critiquing you, it's just a pain to see peoples' donations blown on half-baked ideas seemingly formed by some senior engineer glancing through the...
Gotcha. And yeah, no worries -- obviously I'm not critiquing you, it's just a pain to see peoples' donations blown on half-baked ideas seemingly formed by some senior engineer glancing through the HN front page. The "product" comment from me was reflecting how I feel that, when one is spending someone else's money, it becomes more important to have clear and useful objectives for doing so.
Yeah, firefox is one of those applications that is so complex that I have a hard time believing we even will get close soon. Even more so if we extend the concept to other things like incomplete...
It's unlikely we'll get to "zero bugs" any time soon
Yeah, firefox is one of those applications that is so complex that I have a hard time believing we even will get close soon. Even more so if we extend the concept to other things like incomplete implementations of things causing privacy issues (The tl;dr there is that private browsing and firefox containers are not entirely sandboxed as far as extensions go). This is just an issue I am familiar with that has been sitting for 8 years, there are so many more like that. Many of which will cause other behavior elsewhere once "fixed". So even if they somehow managed to get their hands on a magic LLM that doesn't hallucinate and can handle massive context windows without issues it would be a long time before they even got close finding them all, let alone fixing them.
Considering that such an LLM is just a pipe dream, finding all the bugs in a reliable way is also a pipe dream.
To be clear, I do think LLMs can already be used to find bugs. But not in a way that is going to magically fix security across the board.
Uhhhhhuh. Obviously, any form of testing that can help find bugs and vulnerabilities is good. Obviously, any form of testing that can do that CAN FIND BUGS AND VULNERABILITIES. I do actually see...
Uhhhhhuh.
Obviously, any form of testing that can help find bugs and vulnerabilities is good.
Obviously, any form of testing that can do that CAN FIND BUGS AND VULNERABILITIES.
I do actually see AI helping in some areas to make this stuff better in the long run, but at the same time 0 day's are vastly more likely to get worse, not better. Real solutions to this kind of problem will always stem from type or even mathematically safe code from the ground up.
AI is serving as an odd abstraction layer that's just willing to do the known tedious work of "hey asshole don't use this thing that's been in the library as a footgun since 1980", but there's so so much out there to get wrong still, and that's BEFORE they start finding common failure points for AI after more and more adoption.
Just focusing in on this statement. The model and agent harness at play -- Mythos, I guess -- also found a buffer overflow in FreeBSD's NFS server that it could leverage into an RCE vuln. That...
AI is serving as an odd abstraction layer that's just willing to do the known tedious work of "hey asshole don't use this thing that's been in the library as a footgun since 1980" [...]
Just focusing in on this statement. The model and agent harness at play -- Mythos, I guess -- also found a buffer overflow in FreeBSD's NFS server that it could leverage into an RCE vuln. That codebase has been poured over by security-focused software developers and static analysis suite companies for ages, so finding a novel buffer overflow attack seems significant. So it's looking at more than the transitive closure of dependency versions.
Real solutions to this kind of problem will always stem from type or even mathematically safe code from the ground up.
The OWASP top ten are mostly not addressable using type-level logic or proof assistants, unless you really bend over backwards.
I'm aware of this, but also aware of claims that lighter models found similar vulnerabilities. I suspect we'll absolutely find a bunch of stuff that's been overlooked, but it's also not an end to...
Just focusing in on this statement. The model and agent harness at play -- Mythos, I guess -- also found a buffer overflow in FreeBSD's NFS server that it could leverage into an RCE vuln. That codebase has been poured over by security-focused software developers and static analysis suite companies for ages, so finding a novel buffer overflow attack seems significant. So it's looking at more than the transitive closure of dependency versions.
I'm aware of this, but also aware of claims that lighter models found similar vulnerabilities. I suspect we'll absolutely find a bunch of stuff that's been overlooked, but it's also not an end to 0 days in the slightest as I suspect we'll also find a lot more. Hell AI modeling injection attacks are already "a thing" people are exploring where you can, with the right setup and effort, get a major model to take in bad data and regurgitate it later (hot dog eater coder for example, but there's a lot more malicious stuff being tested as well).
The OWASP top ten are mostly not addressable using type-level logic or proof assistants, unless you really bend over backwards.
Not sure I want to dive into this as it'd get mostly off topic, but i'd either:
A. Disagree,
with the caveat that what I was inferring is some very unreachable "ideal" that requires basically starting from scratch at C and above with something mathematically sound/proof based language. Idris for example is a vague proof of concept of this kind of work, but also totally not ready, and again it would be sorta like "fix IP" as far as levels of dedication and upheaval go.
or
B. Agree but with the caveat they're also not solvable by AI.
For example, a bad security config is something that is either type solvable by not modeling error states (role X can NEVER have permission Y at the type/mathematical level), catch-able in a pipeline review (who the hell exposed our API key or gave so and so these rights) or about as likely to be passed over by AI. So in regards to the topic at hand I don't really think AI changes much EXCEPT it's willing to do the tedious work.
An I will say that a large majority of the OWASP 10 are, imo, solvable with just not modelling bad states and following best practices even now with any type driven design. It's just something no one wants to do with every pile of code out there, especially when it comes with other downsides (both real and "but i like my language" type.
Edit-
Ran out of time getting my thoughts together so just pruning the last bit.
The rebuttal I made was to the claim that the models are only finding trivial bugs (eg using ancient software dependencies). That other models + harnesses are as or more capable as whatever shiny...
I'm aware of this, but also aware of claims that lighter models found similar vulnerabilities.
The rebuttal I made was to the claim that the models are only finding trivial bugs (eg using ancient software dependencies). That other models + harnesses are as or more capable as whatever shiny keys that Anthropic has jangled today is neither here nor there, but I do happen to agree with you.
[general claims about OWASP and software security]
To be clear: I considered this an area where I needed to lead consistently by example when I was employed full time in software dev. Because you expressed a desire not to discuss this, I'm not going to continue down this conversational rabbit hole, but suffice it to say that I disagree with your position and could write a dissertation on it.
Maybe? Disclaimer that I’m neither a career software engineer nor have any experience in security research. Yes, but this advantage is asymmetric. In general, bad actors searching for security...
Maybe? Disclaimer that I’m neither a career software engineer nor have any experience in security research.
Obviously, any form of testing that can do that CAN FIND BUGS AND VULNERABILITIES.
Yes, but this advantage is asymmetric. In general, bad actors searching for security gaps can only look in public products and code. Unless there’s a bad actor within an organization (which is an entirely separate issue), defenders get the chance to run new security tests before publicizing new code.
This blog post claims that getting a software product to have 0 vulnerabilities may be possible with AI. I’m heavily skeptical of that claim (even with the “may” conditional), but if it is true, those building a product would get the chance to fix vulnerabilties before attackers get the chance to exploit them.
Real solutions to this kind of problem will always stem from type or even mathematically safe code from the ground up.
I totally agree. While I can see how AI can find security bugs at scale, I’m not at all convinced that it will find even close to all the possible vulnerabilites.
Less than you think. No you can't compromise a product before it goes live (usually) so you get that chance to pass over your code and check for issues and update. The problem is that the point of...
Yes, but this advantage is asymmetric. In general, bad actors searching for security gaps can only look in public products and code.
Less than you think.
No you can't compromise a product before it goes live (usually) so you get that chance to pass over your code and check for issues and update.
The problem is that the point of a 0 day is it's something you've known about for....0 days. So how are you checking for it? Well known footguns AI will probably be good at pruning out, but so are linters, deployment pipelines, and a million other automated things we already have.
As for post live, yes something that's open source could be more vulnerable, but even if it's not it's all about surface area and what you can access and what you know they must be using. You know they're using common APIs, libraries, protocols, connections, and languages, to build so you study those tools in your environment to then attack their live one.
It seems pretty unlikely that a simple bug like log4shell would happen again in a world with widespread AI-driven security reviews? People often skip reviewing their open source dependencies...
It seems pretty unlikely that a simple bug like log4shell would happen again in a world with widespread AI-driven security reviews? People often skip reviewing their open source dependencies themselves (it's too much work) but I'd guess that a lot of companies will start to do their own automated checking of them? These common dependencies are likely to get a lot more review.
It should have been unlikely with basic automated security reviews that existed long before then. You mention that people skip it because "it's too much work" but forget that now it's going to...
It seems pretty unlikely that a simple bug like log4shell would happen again in a world with widespread AI-driven security reviews?
It should have been unlikely with basic automated security reviews that existed long before then. You mention that people skip it because "it's too much work" but forget that now it's going to cost money to do these reviews, and a substantial amount at that.
We're still at the "every company loses billions" phase of AI. I sincerely doubt that even if AI can be near as good as expected that we won't see the human and business issue of "nah that'll take too long/cost too much, push it" again and again.
Edit -
And to be clear in the case of log4j, to my understanding it's EXACTLY the kind of likely candidate for a 0 day because it was a tiny trusted library that was put in everything and not heavily scrutinized. This is a core problem with the nature of code and including other peoples work, because again now you're trusting their AI review caught everything (in a hypothetical).
I think it only becomes unlikely for widely used libraries. Sure, most people using a dependency might be careless, but there will be a few companies that are more cautious and decide to spend the...
I think it only becomes unlikely for widely used libraries. Sure, most people using a dependency might be careless, but there will be a few companies that are more cautious and decide to spend the money on automated security reviews. For a company, this is considerably cheaper than paying staff to do it.
Nowadays people are talking about using dependency cooldowns rather than upgrading dependencies right away. That’s waiting for someone else to hopefully discover any bugs.
Rather than leaving that to chance, it might be nice if there were a way to publish information about what security reviews have been done on a library version. Then you could wait until there have been multiple outside reviews vouching for a new version.
It would be a more useful signal than the number of downloads or the number of stars on GitHub.
Any idea where I could find more details about these vulnerabilities? I'm looking at this page, from the 150 release notes, and I see 41 vulnerabilities, only 3 mentioning Claude. Edit: Now I see...
Any idea where I could find more details about these vulnerabilities? I'm looking at this page, from the 150 release notes, and I see 41 vulnerabilities, only 3 mentioning Claude.
Edit: Now I see post says "includes fixes for 271 vulnerabilities identified during this initial evaluation". Maybe they identified 271 but only fixed 3?
It's unlikely we'll get to "zero bugs" any time soon, that's pretty hyperbolic. Security vulnerabilities being a subset of "bugs".
We're already at a place, though, where agent automation can find more bugs, faster, and for less money, than a team of humans can. Humans can still find things the agents miss, but Mythos is clearly finding more things that humans have missed than the reverse.
The thing that's interesting to me is that we're watching agent bug testing become pretty much mandatory for large software companies. If there's any chance that an AI agent can find security bugs that humans would miss, then they have no choice but to use agents, because malicious actors will otherwise use them to find the same bugs.
I think the post is right to imply that this shifts the advantage towards creators/defenders. Using agents for vulnerability review is (relatively) cheap and accessible and you have the advantage of free access to the source code. Meanwhile on the black hat side you need expertise, funding and expensive black market tools that AI agents could in theory make partially obsolete. And if black hats want to use frontier models to find vulnerabilities, they need to figure out how to defeat ever improving guardrails against malicious use. They can't switch to different models without guardrails because those models aren't good enough (so far) to be useful for security research against code secured by frontier models.
The arms race will continue, probably forever, but AI agents are good news for the white hat side.
Yes and no. AI agents can also create bugs 100x faster than a human.
One thing I've noticed in my vibecoding experiments is that iteration often leads a lot of cruft behind. The kind of stuff that will probably pass tests, but also can provide more opportunities for things to go wrong.
Agreed on AI leaving a lot of cruft behind in rapid iteration, it’s definitely something to watch out for; but just like AI pen testing it’s something AI can fix as well.
My coding iterations look like this actually. Regular cycles of auditing for maintainability, security and performance usually all separately.
(agreed with you on all points; just wanted to say that Mozilla is going all-in on AI just as the bubble is popping -- which is par for course for that brain dead organization -- hence the glazing of AI in this post. Probably)
Mozilla is also going all-in on LOCAL AI: https://blog.mozilla.ai/
They have multiple pretty cool projects that have first-lefvel support for local LLMs: https://github.com/mozilla-ai
Yeah ... I think I'll believe it more when I see it. I was excited for the built-in feature for asking questions about the current page to an LLM. I can't say whether it works well, though, because I needed an about:config change to point it at my local inference server, and in the end wound up being a glorified floating tab. I'll take a look at the links later though in case they've done more compelling work, though; thanks.
edit: had a look, and it seems mostly out of touch, lacking product vision, or trend following. May I ask if anything stood out to you as particularly valuable? It kinda looks like NPO slop to me: stuff that would make donors feel excited, but that serves little to no practical purpose (eg "llamafile helps anyone run an LLM, with no experience!" is not a product, and could be solved technically with fewer tradeoffs and better visibility via other technical approaches).
Not everything has to be a "product" that makes profit and nothing there is super-unique and something that couldn't be done better in their repos. Very few things in the world are.
llamafilehas existed for a LONG time (3+ years) and I haven't heard of a replacement yet.Gotcha. And yeah, no worries -- obviously I'm not critiquing you, it's just a pain to see peoples' donations blown on half-baked ideas seemingly formed by some senior engineer glancing through the HN front page. The "product" comment from me was reflecting how I feel that, when one is spending someone else's money, it becomes more important to have clear and useful objectives for doing so.
Yeah, firefox is one of those applications that is so complex that I have a hard time believing we even will get close soon. Even more so if we extend the concept to other things like incomplete implementations of things causing privacy issues (The tl;dr there is that private browsing and firefox containers are not entirely sandboxed as far as extensions go). This is just an issue I am familiar with that has been sitting for 8 years, there are so many more like that. Many of which will cause other behavior elsewhere once "fixed". So even if they somehow managed to get their hands on a magic LLM that doesn't hallucinate and can handle massive context windows without issues it would be a long time before they even got close finding them all, let alone fixing them.
Considering that such an LLM is just a pipe dream, finding all the bugs in a reliable way is also a pipe dream.
To be clear, I do think LLMs can already be used to find bugs. But not in a way that is going to magically fix security across the board.
Uhhhhhuh.
Obviously, any form of testing that can help find bugs and vulnerabilities is good.
Obviously, any form of testing that can do that CAN FIND BUGS AND VULNERABILITIES.
I do actually see AI helping in some areas to make this stuff better in the long run, but at the same time 0 day's are vastly more likely to get worse, not better. Real solutions to this kind of problem will always stem from type or even mathematically safe code from the ground up.
AI is serving as an odd abstraction layer that's just willing to do the known tedious work of "hey asshole don't use this thing that's been in the library as a footgun since 1980", but there's so so much out there to get wrong still, and that's BEFORE they start finding common failure points for AI after more and more adoption.
Just focusing in on this statement. The model and agent harness at play -- Mythos, I guess -- also found a buffer overflow in FreeBSD's NFS server that it could leverage into an RCE vuln. That codebase has been poured over by security-focused software developers and static analysis suite companies for ages, so finding a novel buffer overflow attack seems significant. So it's looking at more than the transitive closure of dependency versions.
The OWASP top ten are mostly not addressable using type-level logic or proof assistants, unless you really bend over backwards.
I'm aware of this, but also aware of claims that lighter models found similar vulnerabilities. I suspect we'll absolutely find a bunch of stuff that's been overlooked, but it's also not an end to 0 days in the slightest as I suspect we'll also find a lot more. Hell AI modeling injection attacks are already "a thing" people are exploring where you can, with the right setup and effort, get a major model to take in bad data and regurgitate it later (hot dog eater coder for example, but there's a lot more malicious stuff being tested as well).
Not sure I want to dive into this as it'd get mostly off topic, but i'd either:
A. Disagree,
with the caveat that what I was inferring is some very unreachable "ideal" that requires basically starting from scratch at C and above with something mathematically sound/proof based language. Idris for example is a vague proof of concept of this kind of work, but also totally not ready, and again it would be sorta like "fix IP" as far as levels of dedication and upheaval go.
or
B. Agree but with the caveat they're also not solvable by AI.
For example, a bad security config is something that is either type solvable by not modeling error states (role X can NEVER have permission Y at the type/mathematical level), catch-able in a pipeline review (who the hell exposed our API key or gave so and so these rights) or about as likely to be passed over by AI. So in regards to the topic at hand I don't really think AI changes much EXCEPT it's willing to do the tedious work.
An I will say that a large majority of the OWASP 10 are, imo, solvable with just not modelling bad states and following best practices even now with any type driven design. It's just something no one wants to do with every pile of code out there, especially when it comes with other downsides (both real and "but i like my language" type.
Edit-
Ran out of time getting my thoughts together so just pruning the last bit.
The rebuttal I made was to the claim that the models are only finding trivial bugs (eg using ancient software dependencies). That other models + harnesses are as or more capable as whatever shiny keys that Anthropic has jangled today is neither here nor there, but I do happen to agree with you.
To be clear: I considered this an area where I needed to lead consistently by example when I was employed full time in software dev. Because you expressed a desire not to discuss this, I'm not going to continue down this conversational rabbit hole, but suffice it to say that I disagree with your position and could write a dissertation on it.
Maybe? Disclaimer that I’m neither a career software engineer nor have any experience in security research.
Yes, but this advantage is asymmetric. In general, bad actors searching for security gaps can only look in public products and code. Unless there’s a bad actor within an organization (which is an entirely separate issue), defenders get the chance to run new security tests before publicizing new code.
This blog post claims that getting a software product to have 0 vulnerabilities may be possible with AI. I’m heavily skeptical of that claim (even with the “may” conditional), but if it is true, those building a product would get the chance to fix vulnerabilties before attackers get the chance to exploit them.
I totally agree. While I can see how AI can find security bugs at scale, I’m not at all convinced that it will find even close to all the possible vulnerabilites.
Less than you think.
No you can't compromise a product before it goes live (usually) so you get that chance to pass over your code and check for issues and update.
The problem is that the point of a 0 day is it's something you've known about for....0 days. So how are you checking for it? Well known footguns AI will probably be good at pruning out, but so are linters, deployment pipelines, and a million other automated things we already have.
As for post live, yes something that's open source could be more vulnerable, but even if it's not it's all about surface area and what you can access and what you know they must be using. You know they're using common APIs, libraries, protocols, connections, and languages, to build so you study those tools in your environment to then attack their live one.
I don't need your sourcecode if I found an exploit in ohh..i dunno, one of the worlds most in use logging libraries .
It seems pretty unlikely that a simple bug like log4shell would happen again in a world with widespread AI-driven security reviews? People often skip reviewing their open source dependencies themselves (it's too much work) but I'd guess that a lot of companies will start to do their own automated checking of them? These common dependencies are likely to get a lot more review.
It should have been unlikely with basic automated security reviews that existed long before then. You mention that people skip it because "it's too much work" but forget that now it's going to cost money to do these reviews, and a substantial amount at that.
We're still at the "every company loses billions" phase of AI. I sincerely doubt that even if AI can be near as good as expected that we won't see the human and business issue of "nah that'll take too long/cost too much, push it" again and again.
Edit -
And to be clear in the case of log4j, to my understanding it's EXACTLY the kind of likely candidate for a 0 day because it was a tiny trusted library that was put in everything and not heavily scrutinized. This is a core problem with the nature of code and including other peoples work, because again now you're trusting their AI review caught everything (in a hypothetical).
I think it only becomes unlikely for widely used libraries. Sure, most people using a dependency might be careless, but there will be a few companies that are more cautious and decide to spend the money on automated security reviews. For a company, this is considerably cheaper than paying staff to do it.
Nowadays people are talking about using dependency cooldowns rather than upgrading dependencies right away. That’s waiting for someone else to hopefully discover any bugs.
Rather than leaving that to chance, it might be nice if there were a way to publish information about what security reviews have been done on a library version. Then you could wait until there have been multiple outside reviews vouching for a new version.
It would be a more useful signal than the number of downloads or the number of stars on GitHub.
Any idea where I could find more details about these vulnerabilities? I'm looking at this page, from the 150 release notes, and I see 41 vulnerabilities, only 3 mentioning Claude.
Edit: Now I see post says "includes fixes for 271 vulnerabilities identified during this initial evaluation". Maybe they identified 271 but only fixed 3?