21 votes

Project Glasswing: what Mythos showed us

23 comments

  1. [18]
    unkz
    (edited )
    Link
    This seems very consistent with what everyone else is saying, and I think it reinforces what Anthropic has been saying about the danger level. The major skill is building the complete working...

    This seems very consistent with what everyone else is saying, and I think it reinforces what Anthropic has been saying about the danger level. The major skill is building the complete working exploits.

    The chief complaint I’ve been seeing from people is that it is largely on par with other frontier models in terms of finding new bugs, and a competent security researcher isn’t going to benefit significantly from Mythos. This misses the point entirely. The danger is any even remotely competent script kiddie can take Mythos and go directly to exploiting live systems, without having to be a competent security researcher.

    The other big danger is from advanced persistent threats, who are targeting a specific site. Now whenever a new patch lands, any applicable exploits are gonna be live in 15 minutes. Let’s say I’ve mapped out the attack surface on my target — I know they run Linux, nginx, wagtail, postfix, for instance.

    Major threats are already monitoring commits for major things like Linux to reverse engineer exploits, but now this can be done at scale automatically for every stupid NPM or pip package that is included into the requirements.txt or package.json. You get a bug in some obscure xml library and you cue your exploit harness with the knowledge that a commit just landed in beautiful soup and they run wagtail, and you might get an exploit custom tailored to your victim in 15 minutes.

    12 votes
    1. [13]
      turnipostrophe
      Link Parent
      Perhaps the Mythtos AI will change the way we understand the software. Right now, we produce computer programs that have many bugs, and we assume they have bugs, and release anyway. However,...

      Perhaps the Mythtos AI will change the way we understand the software. Right now, we produce computer programs that have many bugs, and we assume they have bugs, and release anyway. However, perhaps it would be possible to prove that the computer program has no bug, as in mathematical proof. Bug repellent. Perhaps this is the incentive to build future programs more cautiously and carefully, to ensure no bug.

      2 votes
      1. [2]
        cutmetal
        Link Parent
        What you're describing is called formal verification. It's so onerous to perform that nothing except the most critical software is put through this process. I doubt that AI bug finding will result...

        What you're describing is called formal verification. It's so onerous to perform that nothing except the most critical software is put through this process.

        I doubt that AI bug finding will result in an increase in formal verification, but maybe! Everything in the world of software development is heavily in flux right now, so the future is very murky.

        But, I think more likely is that, in the future, critical software will add a mythos-level AI bug checker as a CI step, alongside unit and integration tests and linters. Since the tools are automated, just use the same tools an attacker would use to uncover the bugs, but before you even ship the buggy code.

        15 votes
        1. Eji1700
          Link Parent
          I do like to mention Idris in these conversations as an example of something sane and yet possibly heading in the right direction if we do ever decide we want more robust development....

          I do like to mention Idris in these conversations as an example of something sane and yet possibly heading in the right direction if we do ever decide we want more robust development.

          https://www.idris-lang.org/pages/example.html

          1 vote
      2. [8]
        teaearlgraycold
        Link Parent
        Depends how you define bug. I don’t think it’s possible to do this to an absolute degree of perfection. I’m told it’s not possible to build a mathematical system that is flawless (like how...

        Depends how you define bug. I don’t think it’s possible to do this to an absolute degree of perfection. I’m told it’s not possible to build a mathematical system that is flawless (like how standard math can not answer some questions like 1/0=x). I’ve read a nice blog post about writing code without bugs. They present a simple function to add two numbers together. You think it’s perfect, a function can hardly get more simple. But then consider underflows, overflows. Then consider a whole program, or operating system. You’re doomed.

        5 votes
        1. [7]
          skybrian
          Link Parent
          Well, floating-point math has a lot of gotchas but it would would usually lead to wrong results, rather than a security bug from doing the calculation.

          Well, floating-point math has a lot of gotchas but it would would usually lead to wrong results, rather than a security bug from doing the calculation.

          1. teaearlgraycold
            Link Parent
            Unexpected behavior can be used as a starting point for an exploit.

            Unexpected behavior can be used as a starting point for an exploit.

            5 votes
          2. [5]
            unkz
            Link Parent
            That’s kind of the interesting thing with Mythos though, right? There are huge classes of bugs that are usually innocuous and so rarely lead to an exploit that they aren’t worth the time for a...

            That’s kind of the interesting thing with Mythos though, right? There are huge classes of bugs that are usually innocuous and so rarely lead to an exploit that they aren’t worth the time for a security researcher to investigate every possible edge case. Mythos doesn’t care, it will happily enumerate every single variant and chase them down to their conclusion.

            1 vote
            1. [4]
              Eji1700
              Link Parent
              I mean, it does though. It still costs money. Possibly quite a bit more than getting a bunch of researchers in a room

              I mean, it does though. It still costs money. Possibly quite a bit more than getting a bunch of researchers in a room

              1 vote
              1. [2]
                Minori
                Link Parent
                Mythos is expensive, but I'd be shocked if it's more than contracting a team of skilled researchers.

                Possibly quite a bit more than getting a bunch of researchers in a room

                Mythos is expensive, but I'd be shocked if it's more than contracting a team of skilled researchers.

                1 vote
                1. Eji1700
                  Link Parent
                  Example case is $100m in tokens. That’s years of skilled researchers.

                  Example case is $100m in tokens.

                  That’s years of skilled researchers.

              2. unkz
                Link Parent
                But it doesn't really care. If you go tell your team of security researchers that you want them to go an enumerate every possible edge case of a floating point operation in a piece of code that...

                But it doesn't really care. If you go tell your team of security researchers that you want them to go an enumerate every possible edge case of a floating point operation in a piece of code that has no obvious security relevance, they might just refuse. Like, these people have careers and interests that they want to actually improve instead of wasting their time. There's a kind of publish or perish incentive here where they want to efficiently produce exploits. Mythos will just go do it, and it'll keep doing it for as long as you have GPUs.

      3. DrStone
        Link Parent
        I've never used them, but there are tools like Lean and ROCQ (formerly COQ) for verification of software through formal proofs. What it really comes down to is finding the appropriate tradeoff...

        I've never used them, but there are tools like Lean and ROCQ (formerly COQ) for verification of software through formal proofs.

        What it really comes down to is finding the appropriate tradeoff between the risk of bugs and what is required to achieve it. The cost in time, money, manpower, and company-wide discipline to produce code that is formerly verified correct and defined behavior across all cases does not make sense for most applications until you get into domains like space exploration and critical medical devices. For everything in between, there's a wide range of methodologies, technologies, tests, probes, monitors, verifications, mitigations, safeguards, and processes to pick and choose a risk-appropriate approach.

        And that doesn't even touch on the potential underlying hardware issues. I don't just mean bad drivers/bios/whatever. I'm talking everything from random solar radiation causing a bit flips ("single-event upset") to shouting [at hard drives] in the datacenter affecting performance.

        2 votes
      4. first-must-burn
        Link Parent
        There are formal method systems that do this, but 1) they are not very accessible to the "average developer" 2) they don't scale well to larger systems 3) they work well for pure algorithms but...

        There are formal method systems that do this, but 1) they are not very accessible to the "average developer" 2) they don't scale well to larger systems 3) they work well for pure algorithms but typically don't model the I/O integrations well (or rely on assumptions that can't be formally proven).

        I haven't been keeping with all the research in that space, so maybe there's something new I'm not aware of.

        Even if you had these tools at scale, you'd still have to deal with specification bugs. Computers are great at doing what them are told, and humans are pretty bad at knowing exactly what we want to tell them.

        2 votes
    2. [3]
      glesica
      Link Parent
      It's an interesting combination, that LLMs seem poised to both write most of the code, and also audit / exploit that code. So one LLM adds a new feature to library X, another LLM pulls the new...

      It's an interesting combination, that LLMs seem poised to both write most of the code, and also audit / exploit that code. So one LLM adds a new feature to library X, another LLM pulls the new version into product Y, and yet another LLM is watching and immediately attempts to exploit the change.

      I can't help but wonder if the software ecosystem is about to fundamentally change. Like, where are the costs going to fall? Is it on the library author to run a bug-finding model? Or on whoever pulls it into their project (in which case, that's a lot of duplication of effort)? Can the model writing the code self-audit, eventually, so it just doesn't write exploitable bugs?

      Crazy times.

      1. [2]
        unkz
        (edited )
        Link Parent
        I imagine that major players like Microsoft are eventually going to be proactively scanning all public commits to any projects that are in widespread use just for self preservation. Far fewer bugs...

        I imagine that major players like Microsoft are eventually going to be proactively scanning all public commits to any projects that are in widespread use just for self preservation. Far fewer bugs of known classes will even make it into a release.

        Anthropic will no doubt dedicate $X million per year for open source scanning. You can see how they are positioning themselves to be a defacto requirement for all corporate internal code commits though. I wouldn’t be surprised to find cybercrime insurance policies starting to put Mythos contracts as a requirement for getting said insurance.

        1 vote
        1. glesica
          Link Parent
          That's an interesting point. I do wonder if we're heading (back) to a world where developing software is rather expensive.

          I wouldn’t be surprised to find cybercrime insurance policies starting to put Mythos contracts as a requirement for getting said insurance.

          That's an interesting point.

          I do wonder if we're heading (back) to a world where developing software is rather expensive.

          1 vote
    3. vord
      Link Parent
      If script kiddies with Mythos can find exploits in 15 minutes the developer can have Mythos prevent all those exploits in advance. If it really is that good, we'll see an initial one-off surge and...

      If script kiddies with Mythos can find exploits in 15 minutes the developer can have Mythos prevent all those exploits in advance.

      If it really is that good, we'll see an initial one-off surge and then back to business as usual inside a year.

  2. [3]
    balooga
    Link
    Great writeup as usual from Cloudflare. I thought the table explaining the vulnerability discovery harness was the most interesting part as it shows how consensus is forming in the industry around...

    Great writeup as usual from Cloudflare. I thought the table explaining the vulnerability discovery harness was the most interesting part as it shows how consensus is forming in the industry around the most effective ways to deploy LLMs for real results. Obviously the strength of Mythos is the big headline but I think Cloudflare’s workflow for it is nearly as impressive. Wish I could see the actual prompts they’re using.

    This bit was cool:

    Each task is one attack class paired with a scope hint. Hunters (the agents that actually look for bugs) run concurrently, typically around fifty at once, each fanning out to a handful of exploration subagents. Each hunter has access to tools that compile and run proof-of-concept code in a per-task scratch directory.

    That’s a ton of parallel workers compared to any agentic work I’ve been involved with. I mean, Cloudflare’s a huge enterprise so maybe that’s just a Tuesday for them but I’m still impressed. And having working sandboxes for each one is the icing on the cake. I’m trying to remember, is Glasswing free for preview participants? Because that token count must be astronomical.

    9 votes
    1. [2]
      unkz
      Link Parent
      Yeah, anthropic granted participants $100 million in tokens.

      Yeah, anthropic granted participants $100 million in tokens.

      6 votes
      1. vord
        (edited )
        Link Parent
        "Mythos can find numerous security exploits in a code base if you throw $100 million at it." doesn't have quite the same ring to it. Pretty sure if I hired a team of 500+ security researchers...

        "Mythos can find numerous security exploits in a code base if you throw $100 million at it." doesn't have quite the same ring to it.

        Pretty sure if I hired a team of 500+ security researchers full-time for a year they could find as many or more.

        2 votes
  3. skybrian
    Link
    From the article: [...]

    From the article:

    It's a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult. So rather than trying to benchmark Mythos Preview against general-purpose frontier models, it's more useful to describe what it can actually do, and two features that stood out across the work we did with Mythos Preview:

    • Exploit chain construction - A real attack rarely uses one bug. It chains several small attack primitives together into a working exploit. For instance, it might turn a use-after-free bug into an arbitrary read and write primitive, hijack the control flow, and use return-oriented programming (ROP) chains to take full control over a system. Mythos Preview can take several of these primitives and reason about how to combine them into a working proof. The reasoning it shows along the way looks like the work of a senior researcher rather than the output of an automated scanner.

    • Proof generation - Finding a bug and proving it's exploitable are two different things, and Mythos Preview can do both. It writes code that would trigger the suspected bug, compiles that code in a scratch environment, and runs it. If the program does what the model expected, that's the proof. If it doesn't, the model reads the failure, adjusts its hypothesis, and tries again. The loop matters as much as the bugs it finds, because a suspected flaw without a working proof is speculation, and Mythos Preview closes that gap on its own.

    Some of what we describe above is not entirely unique to Mythos Preview. When we ran other frontier models through the same harness, they found a fair number of the same underlying bugs, and in some cases they got further than we expected on the reasoning side too. Where they fell short was at the point of stitching the pieces together. A model would identify an interesting bug, write a thoughtful description of why it mattered, and then stop, leaving the actual chain unfinished and the question of exploitability open. What changed with Mythos Preview is that a model can now take those low-severity bugs (which would traditionally sit invisible in a backlog) and chain them into a single, more severe exploit.

    [...]

    Mythos Preview represents a clear improvement here, particularly in its ability to chain primitives - combining multiple vulnerabilities into a working proof of concept rather than reporting them in isolation. A finding that arrives with a PoC is a finding you can act on, and it means far less time spent asking "is this even real?"

    Our harnesses are deliberately tuned to over-report, so we see more (and miss less), which comes with a lot more noise. But at triage time, Mythos Preview's output has noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work to reach a fix-or-dismiss decision.

    6 votes
  4. Eji1700
    Link
    I’ll stay out of the main topic here because I don’t have much to add, but I do wonder if mythos will also fall victim to the same problems as previous advancements. Namely that once methods used...

    I’ll stay out of the main topic here because I don’t have much to add, but I do wonder if mythos will also fall victim to the same problems as previous advancements.

    Namely that once methods used are identified and built into the model do they keep the gains? Or are other models going to catch up sooner rather than later, or is it just a function of raw compute? What can other models do with $100m in tokens?