Cool stuff! It fits the pattern I'm seeing from Meta in this kind of research: not generally the first or flashiest, but well thought out and a little more refined than those that are. It's a...
Cool stuff! It fits the pattern I'm seeing from Meta in this kind of research: not generally the first or flashiest, but well thought out and a little more refined than those that are. It's a diffusion transformer architecture*, pretty much what I'd expect for video synthesis nowadays, but they've tied in the results from pretty much every significant paper on the subject from the last 18 months and put a nice neat bow on it.
What I'm wondering now is what they're planning to do with it. The paper is 92 pages long, and one of the most detailed I've seen on the topic, so that alone is a big win for anyone else working on this - but if they open source the whole thing that'd be much bigger. I'm seeing a lot of parallels to Llama here, and that's been one of the most impactful open source models I'm aware of, so I guess the question in my mind is what their success metrics are? What was their business case for open sourcing Llama, and does the same thinking apply here?
*Interestingly, the guy who originally designed that architecture did so for image generation at Meta, then moved to OpenAI and led the research on their video model - that release really was a "holy shit" moment compared to what anyone else working on this stuff had running six months ago
I'm curious what VFX work will look like in 5 years. Humans will still be involved, but I expect much of what people are paid for will be performed trivially by an AI.
I'm curious what VFX work will look like in 5 years. Humans will still be involved, but I expect much of what people are paid for will be performed trivially by an AI.
I think right now the biggest hurdle for that is continuity. It doesn't matter how good an AI video looks if it doesn't make sense in the context of the larger work.
I think right now the biggest hurdle for that is continuity. It doesn't matter how good an AI video looks if it doesn't make sense in the context of the larger work.
Give it a few months and I imagine we’ll start seeing the tech make its way into the tools the pros use, the same way Photoshop’s picked up things like smart select and generative fill. Full text...
Give it a few months and I imagine we’ll start seeing the tech make its way into the tools the pros use, the same way Photoshop’s picked up things like smart select and generative fill.
Full text to video synthesis is cool, and I’m sure it’ll become its own standalone tool as people get to grips with it, but it’s the most extreme case - which means it unlocks a whole lot of video to video editing options along the way. You’ve got a lot less to worry about on continuity if you’re modifying camera footage rather than generating from scratch.
It's a start, but I think artists will want to give it pictures of each character that appears in the video. When making animated cartoons, animators use a model sheet that shows what a character...
We further post-train the Movie Gen Video model to obtain Personalized Movie Gen Video that can generate personalized videos conditioned on a person’s face.
It's a start, but I think artists will want to give it pictures of each character that appears in the video. When making animated cartoons, animators use a model sheet that shows what a character should look like from different angles. Generated images could be used to make the model sheets too, but I think people will want to get them just right before making the video. It should be a multi-stage process.
Similarly, it would be useful to give it pictures of important objects, as well as background images. Maybe even a map of the scene where you can use waypoints to show the path that each object should take?
I think it's going to take years to build new video editing tools to do all this.
Cool stuff! It fits the pattern I'm seeing from Meta in this kind of research: not generally the first or flashiest, but well thought out and a little more refined than those that are. It's a diffusion transformer architecture*, pretty much what I'd expect for video synthesis nowadays, but they've tied in the results from pretty much every significant paper on the subject from the last 18 months and put a nice neat bow on it.
What I'm wondering now is what they're planning to do with it. The paper is 92 pages long, and one of the most detailed I've seen on the topic, so that alone is a big win for anyone else working on this - but if they open source the whole thing that'd be much bigger. I'm seeing a lot of parallels to Llama here, and that's been one of the most impactful open source models I'm aware of, so I guess the question in my mind is what their success metrics are? What was their business case for open sourcing Llama, and does the same thinking apply here?
*Interestingly, the guy who originally designed that architecture did so for image generation at Meta, then moved to OpenAI and led the research on their video model - that release really was a "holy shit" moment compared to what anyone else working on this stuff had running six months ago
I'm curious what VFX work will look like in 5 years. Humans will still be involved, but I expect much of what people are paid for will be performed trivially by an AI.
I think right now the biggest hurdle for that is continuity. It doesn't matter how good an AI video looks if it doesn't make sense in the context of the larger work.
Give it a few months and I imagine we’ll start seeing the tech make its way into the tools the pros use, the same way Photoshop’s picked up things like smart select and generative fill.
Full text to video synthesis is cool, and I’m sure it’ll become its own standalone tool as people get to grips with it, but it’s the most extreme case - which means it unlocks a whole lot of video to video editing options along the way. You’ve got a lot less to worry about on continuity if you’re modifying camera footage rather than generating from scratch.
It's a start, but I think artists will want to give it pictures of each character that appears in the video. When making animated cartoons, animators use a model sheet that shows what a character should look like from different angles. Generated images could be used to make the model sheets too, but I think people will want to get them just right before making the video. It should be a multi-stage process.
Similarly, it would be useful to give it pictures of important objects, as well as background images. Maybe even a map of the scene where you can use waypoints to show the path that each object should take?
I think it's going to take years to build new video editing tools to do all this.