4 votes

Synthesizing multi-agent harnesses for vulnerability discovery

2 comments

  1. [2]
    skybrian
    Link
    From the abstract: Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here. The prompt that their system automatically generated is here.

    From the abstract:

    [...] We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).

    Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here.

    The prompt that their system automatically generated is here.

    3 votes
    1. tauon
      (edited )
      Link Parent
      This is a very cool finding, and/however, does not surprise me much anymore after these two findings in particular from a bit back: Cursor boosts model performance versus other harnesses There are...

      This is a very cool finding, and/however,

      if you have the right harness, you can [do X for almost any given X]

      does not surprise me much anymore after these two findings in particular from a bit back:

      This has been coming on the horizon for a while now.