From the abstract: Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here. The prompt that their system automatically generated is here.
From the abstract:
[...] We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).
Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here.
The prompt that their system automatically generated is here.
This is a very cool finding, and/however, does not surprise me much anymore after these two findings in particular from a bit back: Cursor boosts model performance versus other harnesses There are...
This is a very cool finding, and/however,
if you have the right harness, you can [do X for almost any given X]
does not surprise me much anymore after these two findings in particular from a bit back:
From the abstract:
Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here.
The prompt that their system automatically generated is here.
This is a very cool finding, and/however,
does not surprise me much anymore after these two findings in particular from a bit back:
This has been coming on the horizon for a while now.