31 votes

Project Glasswing: what Mythos showed us

43 comments

  1. [29]
    unkz
    (edited )
    Link
    This seems very consistent with what everyone else is saying, and I think it reinforces what Anthropic has been saying about the danger level. The major skill is building the complete working...

    This seems very consistent with what everyone else is saying, and I think it reinforces what Anthropic has been saying about the danger level. The major skill is building the complete working exploits.

    The chief complaint I’ve been seeing from people is that it is largely on par with other frontier models in terms of finding new bugs, and a competent security researcher isn’t going to benefit significantly from Mythos. This misses the point entirely. The danger is any even remotely competent script kiddie can take Mythos and go directly to exploiting live systems, without having to be a competent security researcher.

    The other big danger is from advanced persistent threats, who are targeting a specific site. Now whenever a new patch lands, any applicable exploits are gonna be live in 15 minutes. Let’s say I’ve mapped out the attack surface on my target — I know they run Linux, nginx, wagtail, postfix, for instance.

    Major threats are already monitoring commits for major things like Linux to reverse engineer exploits, but now this can be done at scale automatically for every stupid NPM or pip package that is included into the requirements.txt or package.json. You get a bug in some obscure xml library and you cue your exploit harness with the knowledge that a commit just landed in beautiful soup and they run wagtail, and you might get an exploit custom tailored to your victim in 15 minutes.

    21 votes
    1. [16]
      turnipostrophe
      Link Parent
      Perhaps the Mythtos AI will change the way we understand the software. Right now, we produce computer programs that have many bugs, and we assume they have bugs, and release anyway. However,...

      Perhaps the Mythtos AI will change the way we understand the software. Right now, we produce computer programs that have many bugs, and we assume they have bugs, and release anyway. However, perhaps it would be possible to prove that the computer program has no bug, as in mathematical proof. Bug repellent. Perhaps this is the incentive to build future programs more cautiously and carefully, to ensure no bug.

      7 votes
      1. [2]
        cutmetal
        Link Parent
        What you're describing is called formal verification. It's so onerous to perform that nothing except the most critical software is put through this process. I doubt that AI bug finding will result...

        What you're describing is called formal verification. It's so onerous to perform that nothing except the most critical software is put through this process.

        I doubt that AI bug finding will result in an increase in formal verification, but maybe! Everything in the world of software development is heavily in flux right now, so the future is very murky.

        But, I think more likely is that, in the future, critical software will add a mythos-level AI bug checker as a CI step, alongside unit and integration tests and linters. Since the tools are automated, just use the same tools an attacker would use to uncover the bugs, but before you even ship the buggy code.

        30 votes
        1. Eji1700
          Link Parent
          I do like to mention Idris in these conversations as an example of something sane and yet possibly heading in the right direction if we do ever decide we want more robust development....

          I do like to mention Idris in these conversations as an example of something sane and yet possibly heading in the right direction if we do ever decide we want more robust development.

          https://www.idris-lang.org/pages/example.html

          7 votes
      2. [11]
        teaearlgraycold
        Link Parent
        Depends how you define bug. I don’t think it’s possible to do this to an absolute degree of perfection. I’m told it’s not possible to build a mathematical system that is flawless (like how...

        Depends how you define bug. I don’t think it’s possible to do this to an absolute degree of perfection. I’m told it’s not possible to build a mathematical system that is flawless (like how standard math can not answer some questions like 1/0=x). I’ve read a nice blog post about writing code without bugs. They present a simple function to add two numbers together. You think it’s perfect, a function can hardly get more simple. But then consider underflows, overflows. Then consider a whole program, or operating system. You’re doomed.

        7 votes
        1. [10]
          skybrian
          Link Parent
          Well, floating-point math has a lot of gotchas but it would would usually lead to wrong results, rather than a security bug from doing the calculation.

          Well, floating-point math has a lot of gotchas but it would would usually lead to wrong results, rather than a security bug from doing the calculation.

          1 vote
          1. teaearlgraycold
            Link Parent
            Unexpected behavior can be used as a starting point for an exploit.

            Unexpected behavior can be used as a starting point for an exploit.

            10 votes
          2. [8]
            unkz
            Link Parent
            That’s kind of the interesting thing with Mythos though, right? There are huge classes of bugs that are usually innocuous and so rarely lead to an exploit that they aren’t worth the time for a...

            That’s kind of the interesting thing with Mythos though, right? There are huge classes of bugs that are usually innocuous and so rarely lead to an exploit that they aren’t worth the time for a security researcher to investigate every possible edge case. Mythos doesn’t care, it will happily enumerate every single variant and chase them down to their conclusion.

            4 votes
            1. [7]
              Eji1700
              Link Parent
              I mean, it does though. It still costs money. Possibly quite a bit more than getting a bunch of researchers in a room

              I mean, it does though. It still costs money. Possibly quite a bit more than getting a bunch of researchers in a room

              3 votes
              1. unkz
                Link Parent
                But it doesn't really care. If you go tell your team of security researchers that you want them to go an enumerate every possible edge case of a floating point operation in a piece of code that...

                But it doesn't really care. If you go tell your team of security researchers that you want them to go an enumerate every possible edge case of a floating point operation in a piece of code that has no obvious security relevance, they might just refuse. Like, these people have careers and interests that they want to actually improve instead of wasting their time. There's a kind of publish or perish incentive here where they want to efficiently produce exploits. Mythos will just go do it, and it'll keep doing it for as long as you have GPUs.

                5 votes
              2. [5]
                Minori
                Link Parent
                Mythos is expensive, but I'd be shocked if it's more than contracting a team of skilled researchers.

                Possibly quite a bit more than getting a bunch of researchers in a room

                Mythos is expensive, but I'd be shocked if it's more than contracting a team of skilled researchers.

                1 vote
                1. [4]
                  Eji1700
                  Link Parent
                  Example case is $100m in tokens. That’s years of skilled researchers.

                  Example case is $100m in tokens.

                  That’s years of skilled researchers.

                  3 votes
                  1. [3]
                    teaearlgraycold
                    Link Parent
                    That's how much they're gifting to each org. I don't think they've burnt through that yet.

                    That's how much they're gifting to each org. I don't think they've burnt through that yet.

                    4 votes
                    1. Eji1700
                      Link Parent
                      We don't know, and personally, I find it suspicious that isn't mentioned. The ceiling is $100m, whats the floor? Further, as with all things AI, how do you know what your potential cost will be?...

                      We don't know, and personally, I find it suspicious that isn't mentioned. The ceiling is $100m, whats the floor?

                      Further, as with all things AI, how do you know what your potential cost will be? If i've only got 50k in tokens and say "look for X" and it only gets halfway, what do I do? "oh there's nothing" or "well it's probably fine?"

                      There's a similar marketing trick with things like forex markets where they give you 300k or whatever to see how good you are.

                      People will "outsmart" them by saying "well i'd only spend $3k", think they did well, then don't realize they would've been margin called and closed out on the first swing because their account will be maxed out not sitting on $297k.

                      I see this as very similar. There's no doubt it's interesting, but "we gave a bunch of market leaders $100m worth of compute and they found stuff" is...sort of a non headline? How much would a different model produce with the same prompts and $100m in tokens? How much did anyone testing this spend? How much did they feel comfortable spending? How much did they OVERSPEND after finding one exploit and then searching for another?

                      Given the comment I was responding to was in comparison to a team of skilled researchers (which lets say could cost you upwards of $1m), I think its fair to ask these kinds of question when you basically gave them 100x that amount in fun money and said go nuts.

                      8 votes
                    2. vord
                      Link Parent
                      It's a good point. But I see from my relatively mild usage of Claude Code that I can, with 0 agents, burn through 500k tokens inside an hour. If they're running 50+agents for hours on end, that's...

                      It's a good point. But I see from my relatively mild usage of Claude Code that I can, with 0 agents, burn through 500k tokens inside an hour.

                      If they're running 50+agents for hours on end, that's gonna be quite pricy, even if they didn't burn $100m.

                      4 votes
      3. DrStone
        Link Parent
        I've never used them, but there are tools like Lean and ROCQ (formerly COQ) for verification of software through formal proofs. What it really comes down to is finding the appropriate tradeoff...

        I've never used them, but there are tools like Lean and ROCQ (formerly COQ) for verification of software through formal proofs.

        What it really comes down to is finding the appropriate tradeoff between the risk of bugs and what is required to achieve it. The cost in time, money, manpower, and company-wide discipline to produce code that is formerly verified correct and defined behavior across all cases does not make sense for most applications until you get into domains like space exploration and critical medical devices. For everything in between, there's a wide range of methodologies, technologies, tests, probes, monitors, verifications, mitigations, safeguards, and processes to pick and choose a risk-appropriate approach.

        And that doesn't even touch on the potential underlying hardware issues. I don't just mean bad drivers/bios/whatever. I'm talking everything from random solar radiation causing a bit flips ("single-event upset") to shouting [at hard drives] in the datacenter affecting performance.

        5 votes
      4. first-must-burn
        Link Parent
        There are formal method systems that do this, but 1) they are not very accessible to the "average developer" 2) they don't scale well to larger systems 3) they work well for pure algorithms but...

        There are formal method systems that do this, but 1) they are not very accessible to the "average developer" 2) they don't scale well to larger systems 3) they work well for pure algorithms but typically don't model the I/O integrations well (or rely on assumptions that can't be formally proven).

        I haven't been keeping with all the research in that space, so maybe there's something new I'm not aware of.

        Even if you had these tools at scale, you'd still have to deal with specification bugs. Computers are great at doing what them are told, and humans are pretty bad at knowing exactly what we want to tell them.

        5 votes
    2. [5]
      d32
      Link Parent
      You addressed the second part, but ignored the first - the one about overblown anthropic hype / marketing around mythos. Remember when GPT-2 was too dangerous to release on the world?

      The chief complaint I’ve been seeing from people is that it is largely on par with other frontier models in terms of finding new bugs, and a competent security researcher isn’t going to benefit significantly from Mythos.

      You addressed the second part, but ignored the first - the one about overblown anthropic hype / marketing around mythos.

      Remember when GPT-2 was too dangerous to release on the world?

      3 votes
      1. [2]
        unkz
        Link Parent
        Hmm, didn’t I address that? I thought I was pretty clear about why I think Mythos is a pretty dangerous tool to give the public right now. We are in a phase of massive bug discovery — once the...

        Hmm, didn’t I address that? I thought I was pretty clear about why I think Mythos is a pretty dangerous tool to give the public right now. We are in a phase of massive bug discovery — once the backlog has been cleared and we have integrated LLM based security scanning into most of our critical pipelines it will be less of a concern.

        Also, look how badly LLMs have wrecked the internet with automated disinformation campaigns just like they predicted. OpenAI wasn’t really wrong about that.

        6 votes
        1. d32
          Link Parent
          Agreed in both parts, just with a caveat: mythos is incremental, not unique.

          Agreed in both parts, just with a caveat: mythos is incremental, not unique.

          1 vote
      2. [2]
        balooga
        Link Parent
        I think the jury’s still out on whether any LLM is “safe” for the world.

        Remember when GPT-2 was too dangerous to release on the world?

        I think the jury’s still out on whether any LLM is “safe” for the world.

        2 votes
        1. vord
          Link Parent
          It's not safe, but not for the reasons they're claiming. It's because their marketing hype is helping drive mass psychosis.

          It's not safe, but not for the reasons they're claiming.

          It's because their marketing hype is helping drive mass psychosis.

    3. [4]
      glesica
      Link Parent
      It's an interesting combination, that LLMs seem poised to both write most of the code, and also audit / exploit that code. So one LLM adds a new feature to library X, another LLM pulls the new...

      It's an interesting combination, that LLMs seem poised to both write most of the code, and also audit / exploit that code. So one LLM adds a new feature to library X, another LLM pulls the new version into product Y, and yet another LLM is watching and immediately attempts to exploit the change.

      I can't help but wonder if the software ecosystem is about to fundamentally change. Like, where are the costs going to fall? Is it on the library author to run a bug-finding model? Or on whoever pulls it into their project (in which case, that's a lot of duplication of effort)? Can the model writing the code self-audit, eventually, so it just doesn't write exploitable bugs?

      Crazy times.

      2 votes
      1. [3]
        unkz
        (edited )
        Link Parent
        I imagine that major players like Microsoft are eventually going to be proactively scanning all public commits to any projects that are in widespread use just for self preservation. Far fewer bugs...

        I imagine that major players like Microsoft are eventually going to be proactively scanning all public commits to any projects that are in widespread use just for self preservation. Far fewer bugs of known classes will even make it into a release.

        Anthropic will no doubt dedicate $X million per year for open source scanning. You can see how they are positioning themselves to be a defacto requirement for all corporate internal code commits though. I wouldn’t be surprised to find cybercrime insurance policies starting to put Mythos contracts as a requirement for getting said insurance.

        3 votes
        1. glesica
          Link Parent
          That's an interesting point. I do wonder if we're heading (back) to a world where developing software is rather expensive.

          I wouldn’t be surprised to find cybercrime insurance policies starting to put Mythos contracts as a requirement for getting said insurance.

          That's an interesting point.

          I do wonder if we're heading (back) to a world where developing software is rather expensive.

          2 votes
        2. Omnicrola
          Link Parent
          Given MS owns github, they will absolutely do this at some point in the next few years. I remember being surprised and then very grateful that github sent me an automated message the minute after...

          I imagine that major players like Microsoft are eventually going to be proactively scanning all public commits to any projects that are in widespread use just for self preservation.

          Given MS owns github, they will absolutely do this at some point in the next few years. I remember being surprised and then very grateful that github sent me an automated message the minute after I had accidently committed an AWS API key in a config file. I can see them having a watchdog AI scanning for common types of security vulnerabilities on repos up to a certain size, then offering a more robust version for a monthly fee.

          2 votes
    4. [3]
      vord
      Link Parent
      If script kiddies with Mythos can find exploits in 15 minutes the developer can have Mythos prevent all those exploits in advance. If it really is that good, we'll see an initial one-off surge and...

      If script kiddies with Mythos can find exploits in 15 minutes the developer can have Mythos prevent all those exploits in advance.

      If it really is that good, we'll see an initial one-off surge and then back to business as usual inside a year.

      2 votes
      1. tauon
        Link Parent
        This assumes that there is sufficient time, money, and willpower on the development side during and after a program has been written. And we all know especially in a business context, developers...

        This assumes that there is sufficient time, money, and willpower on the development side during and after a program has been written.
        And we all know especially in a business context, developers are never cut off from any of those /s

        2 votes
      2. unkz
        Link Parent
        That’s basically what I expect to happen, and what I’m expecting from Anthropic. After all, they don’t have a monopoly on frontier models. Pretty soon OpenAI (and Deepseek, Grok, Meta, etc) will...

        If it really is that good, we'll see an initial one-off surge and then back to business as usual inside a year.

        That’s basically what I expect to happen, and what I’m expecting from Anthropic. After all, they don’t have a monopoly on frontier models. Pretty soon OpenAI (and Deepseek, Grok, Meta, etc) will publish similar results and we will get pre-commit scanning on GitHub so bugs almost never land in a release. Then Mythos won’t be much danger and it will be open to the public.

        2 votes
  2. [10]
    balooga
    Link
    Great writeup as usual from Cloudflare. I thought the table explaining the vulnerability discovery harness was the most interesting part as it shows how consensus is forming in the industry around...

    Great writeup as usual from Cloudflare. I thought the table explaining the vulnerability discovery harness was the most interesting part as it shows how consensus is forming in the industry around the most effective ways to deploy LLMs for real results. Obviously the strength of Mythos is the big headline but I think Cloudflare’s workflow for it is nearly as impressive. Wish I could see the actual prompts they’re using.

    This bit was cool:

    Each task is one attack class paired with a scope hint. Hunters (the agents that actually look for bugs) run concurrently, typically around fifty at once, each fanning out to a handful of exploration subagents. Each hunter has access to tools that compile and run proof-of-concept code in a per-task scratch directory.

    That’s a ton of parallel workers compared to any agentic work I’ve been involved with. I mean, Cloudflare’s a huge enterprise so maybe that’s just a Tuesday for them but I’m still impressed. And having working sandboxes for each one is the icing on the cake. I’m trying to remember, is Glasswing free for preview participants? Because that token count must be astronomical.

    16 votes
    1. [9]
      unkz
      Link Parent
      Yeah, anthropic granted participants $100 million in tokens.

      Yeah, anthropic granted participants $100 million in tokens.

      11 votes
      1. [8]
        vord
        (edited )
        Link Parent
        "Mythos can find numerous security exploits in a code base if you throw $100 million at it." doesn't have quite the same ring to it. Pretty sure if I hired a team of 500+ security researchers...

        "Mythos can find numerous security exploits in a code base if you throw $100 million at it." doesn't have quite the same ring to it.

        Pretty sure if I hired a team of 500+ security researchers full-time for a year they could find as many or more.

        8 votes
        1. [7]
          Diff
          Link Parent
          I suspect you could reproduce these results with even bigger swarms of agents and at a fraction of the cost with a different model. We haven't seen a model with a substantial, hard intelligence or...

          I suspect you could reproduce these results with even bigger swarms of agents and at a fraction of the cost with a different model.

          We haven't seen a model with a substantial, hard intelligence or skill improvement in quite a while. This is similar to what we see with Claude Code as well. The secret sauce isn't in the models, it's in how they're being driven.

          1 vote
          1. [6]
            unkz
            Link Parent
            How would you define “quite a while”? The way I see it, it’s only been a few months since models became capable of producing high quality code.

            How would you define “quite a while”? The way I see it, it’s only been a few months since models became capable of producing high quality code.

            1 vote
            1. [5]
              Diff
              Link Parent
              I've been hearing that exact statement for multiple years now. New models are incremental improvements over old. There have been no generational, game-changing releases that have unquestionably...

              The way I see it, it’s only been a few months since models became capable of producing high quality code.

              I've been hearing that exact statement for multiple years now. New models are incremental improvements over old. There have been no generational, game-changing releases that have unquestionably dominated. It's difficult to assign numbers to this since benchmarks are both heavily gamed and inadequate.

              What models were we working with a few months ago? Gemini 3 Pro? Claude 4.6? Whatever ChatGPT is doing? My usage of them hasn't significantly changed in the new minor releases since. I feel like I could use 3.1/3.0 and 4.7/4.6/4.5 practically interchangeably. I know many feel 4.7 is a downgrade. ChatGPT's goblin babbling has only gotten worse. What differences are you seeing in your usage?

              1. [4]
                unkz
                Link Parent
                Opus 4.5 was Nov 24, 2025. Opus 4.6 was released Feb 5, 2026. In my experience, 4.5 was the first model that was truly useful for writing large pieces of code, so that’s 6 months. 4.6 was a...

                Opus 4.5 was Nov 24, 2025. Opus 4.6 was released Feb 5, 2026. In my experience, 4.5 was the first model that was truly useful for writing large pieces of code, so that’s 6 months. 4.6 was a massive upgrade for me when I got access to the 1 million token context, which has been 3 months. I don’t have a very strong opinion on 4.7 — it seems also very good but I’m unsure if it’s materially better than 4.6.

                Is 6 months quite a while?

                2 votes
                1. [3]
                  Diff
                  Link Parent
                  I wouldn't consider it to be, but I also wouldn't share your assessment of 4.6 or 4.5, outside of the context window buff. The extra context is nice to have but models struggle to actually utilize...

                  I wouldn't consider it to be, but I also wouldn't share your assessment of 4.6 or 4.5, outside of the context window buff. The extra context is nice to have but models struggle to actually utilize a full window.

                  1 vote
                  1. [2]
                    unkz
                    (edited )
                    Link Parent
                    Really? You think 4.1 is a competent coding agent compared to 4.5 or 4.6? In my experience at least, it couldn't do even simple tasks without requiring a ton of manual review and fixing.

                    Really? You think 4.1 is a competent coding agent compared to 4.5 or 4.6? In my experience at least, it couldn't do even simple tasks without requiring a ton of manual review and fixing.

                    1 vote
                    1. Diff
                      Link Parent
                      It was 3.7 Sonnet's watch when the term "vibe coding" was coined, but that's not the hill I'm trying to die on. Things have undeniably improved since then. Incremental improvements eventually add...

                      It was 3.7 Sonnet's watch when the term "vibe coding" was coined, but that's not the hill I'm trying to die on. Things have undeniably improved since then. Incremental improvements eventually add up.

                      The point I was making is that there have been no significant generational leaps with any release. I find Mythos unlikely to have broken that pattern and to have something special in its model that can't be replicated with other modern models given the same harness.

  3. skybrian
    Link
    From the article: [...]

    From the article:

    It's a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult. So rather than trying to benchmark Mythos Preview against general-purpose frontier models, it's more useful to describe what it can actually do, and two features that stood out across the work we did with Mythos Preview:

    • Exploit chain construction - A real attack rarely uses one bug. It chains several small attack primitives together into a working exploit. For instance, it might turn a use-after-free bug into an arbitrary read and write primitive, hijack the control flow, and use return-oriented programming (ROP) chains to take full control over a system. Mythos Preview can take several of these primitives and reason about how to combine them into a working proof. The reasoning it shows along the way looks like the work of a senior researcher rather than the output of an automated scanner.

    • Proof generation - Finding a bug and proving it's exploitable are two different things, and Mythos Preview can do both. It writes code that would trigger the suspected bug, compiles that code in a scratch environment, and runs it. If the program does what the model expected, that's the proof. If it doesn't, the model reads the failure, adjusts its hypothesis, and tries again. The loop matters as much as the bugs it finds, because a suspected flaw without a working proof is speculation, and Mythos Preview closes that gap on its own.

    Some of what we describe above is not entirely unique to Mythos Preview. When we ran other frontier models through the same harness, they found a fair number of the same underlying bugs, and in some cases they got further than we expected on the reasoning side too. Where they fell short was at the point of stitching the pieces together. A model would identify an interesting bug, write a thoughtful description of why it mattered, and then stop, leaving the actual chain unfinished and the question of exploitability open. What changed with Mythos Preview is that a model can now take those low-severity bugs (which would traditionally sit invisible in a backlog) and chain them into a single, more severe exploit.

    [...]

    Mythos Preview represents a clear improvement here, particularly in its ability to chain primitives - combining multiple vulnerabilities into a working proof of concept rather than reporting them in isolation. A finding that arrives with a PoC is a finding you can act on, and it means far less time spent asking "is this even real?"

    Our harnesses are deliberately tuned to over-report, so we see more (and miss less), which comes with a lot more noise. But at triage time, Mythos Preview's output has noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work to reach a fix-or-dismiss decision.

    9 votes
  4. [3]
    Eji1700
    Link
    I’ll stay out of the main topic here because I don’t have much to add, but I do wonder if mythos will also fall victim to the same problems as previous advancements. Namely that once methods used...

    I’ll stay out of the main topic here because I don’t have much to add, but I do wonder if mythos will also fall victim to the same problems as previous advancements.

    Namely that once methods used are identified and built into the model do they keep the gains? Or are other models going to catch up sooner rather than later, or is it just a function of raw compute? What can other models do with $100m in tokens?

    5 votes
    1. [2]
      vord
      Link Parent
      Is all of the improvements to Opus since 4.5 just increasing context window and burning more tokens? It certainly feels that way sometimes.

      Is all of the improvements to Opus since 4.5 just increasing context window and burning more tokens?

      It certainly feels that way sometimes.

      3 votes
      1. Diff
        Link Parent
        It's the tried and true strategy in many areas. CPUs and GPUs as well are being flooded with hundreds of extra watts of power to pull a little extra performance out of a particular architecture's...

        It's the tried and true strategy in many areas. CPUs and GPUs as well are being flooded with hundreds of extra watts of power to pull a little extra performance out of a particular architecture's power/efficiency curve. To the point that many high end parts are becoming fire hazards.

        2 votes