17 votes

Video models are zero-shot learners and reasoners

Posted September 28 by skybrian

Tags: google, machine learning, video models, research, deepmind, veo.3, github.video zero shot, author.thaddaus wiedemer, author.yuxuan li, author.paul vicol, author.shixiang shane gu, author.nick matarese, author.kevin swersky, author.been kim, author.priyank jaini, author.robert geirhos, short read

https://video-zero-shot.github.io/

Link information

This data is scraped automatically and may be incorrect.

Word count: 262 words

1 comment

skybrian (OP)
September 28 (edited September 28)
Link
From the abstract: At the bottom of the web page, there are a bunch of videos demonstrating Veo 3's ability to solve problems. The tooltip for each video shows the prompt they used. In the paper,...

From the abstract:

We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo 3's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

At the bottom of the web page, there are a bunch of videos demonstrating Veo 3's ability to solve problems. The tooltip for each video shows the prompt they used.

In the paper, they point out that there is an LLM involved:

According to the Vertex documentation [29], the API uses an LLM-based prompt rewriter. This means that for some tasks, the solution is likely to come from the LLM instead of the video (e.g., Fig. 55: Sudoku). We treat the system (rewriter and video generator) as a single black-box entity. However, to isolate the video model’s reasoning abilities, we verified that a standalone LLM (Gemini 2.5 Pro) could not reliably solve key tasks (Fig. 58: Robot navigation, Sec. 4.5: Maze solving, Sec. 4.6: Visual symmetry) from the input image alone.

Also:

Takeaway 3: Frame-by-frame video generation parallels chain-of-thought in language models. Just like chain-of-thought (CoT) enables language models to reason with symbols, a “chain-offrames” (CoF) enables video models to reason across time and space.

8 votes