9 votes

Synthesizing multi-agent harnesses for vulnerability discovery

Posted April 24 by skybrian

Tags: security, artificial intelligence, papers, vulnerabilities, author.hanzhi liu, author.chaofan shou, author.xiaonan liu, author.hongbo wen, author.yanju chen, author.ryan jingyang fang, author.yu feng, source.arxiv, vibecoding, language models.large

https://arxiv.org/abs/2604.20801

Link information

This data is scraped automatically and may be incorrect.

Published: Apr 24 2026

2 comments

[2]
skybrian (OP)
April 24
Link
From the abstract: Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here. The prompt that their system automatically generated is here.

From the abstract:

[...] We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).

Apparently, if you have the right harness, you can find security bugs with a Chinese LLM. Source code here.

The prompt that their system automatically generated is here.

4 votes
1. tauon
  April 24 (edited April 24)
  Link Parent
  This is a very cool finding, and/however, does not surprise me much anymore after these two realizations in particular from a bit back: Cursor boosts model performance versus other harnesses There...
  
  This is a very cool finding, and/however,
  
  if you have the right harness, you can [do X for almost any given X]
  
  does not surprise me much anymore after these two realizations in particular from a bit back:
  
  Cursor boosts model performance versus other harnesses
  
  There are ten separate harnesses that use Opus better than Claude Code itself does [on TerminalBench]
  
  This has been coming on the horizon for a while now, IMO, in hindsight essentially ever since it was found to be beneficial to send LLMs off in a loop.
  
  2 votes