Absolutely baffled by this one. AI has given me a few moments like the scene in Westworld where Logan is staring in disbelief insisting that "we're not here yet." And yet here we are. This is way...
Absolutely baffled by this one. AI has given me a few moments like the scene in Westworld where Logan is staring in disbelief insisting that "we're not here yet." And yet here we are. This is way beyond any other video tech demo I've seen thus far in terms of coherency, they must have made a rather fantastic improvement in their denoising to adapt it to motion. Incredible job by the OpenAI team.
I’m stuck in the usual spot with all this. Is it impressive? Absolutely. Do I think it’s as impressive as people have made it out to be? Not yet. There’s a lot of uncanny valley/straight up...
I’m stuck in the usual spot with all this.
Is it impressive? Absolutely.
Do I think it’s as impressive as people have made it out to be? Not yet.
There’s a lot of uncanny valley/straight up artifacts in just these demos. Still a very impressive quick prototyping style tool or for someone who doesn’t totally care about quality, but probably expensive enough to not really find that market.
Finally I never trust the marketing. Until it’s in the hands of a large number of professionals it’s concerning that their BEST examples have issues.
I mean, it's not perfect, for sure - they have a whole section in there where they acknowledge this and point out some flaws that they haven't been able to correct for yet. Why it's impressive is...
I mean, it's not perfect, for sure - they have a whole section in there where they acknowledge this and point out some flaws that they haven't been able to correct for yet. Why it's impressive is the contrast it makes to the other attempts currently out there. I'm not sure if you've taken a look at the other videos that have been generated with AI, but prior to this even the best of them looked like wobbly garbage; elements in them constantly blurred and shifted across the whole screen, people looked like nightmarish caricatures, and there was no cohesion or direction in the video's structure.
This new tool represents a substantial increase in capability, one that is taking even people working in the field by surprise. Nobody else has anything close to it, and a lot of people would have likely told you that something of this quality was years away - yesterday. It's not about what the tool can do right now, it's about the dot on the graph of progression charting a way quicker course than anyone expected.
Consider where this technology was but a year ago and this is an impressive improvement, far more than you're suggesting. Where will this specific video creation ai go within another year if this...
Do I think it’s as impressive as people have made it out to be? Not yet.
Consider where this technology was but a year ago and this is an impressive improvement, far more than you're suggesting. Where will this specific video creation ai go within another year if this is the starting point?
Finally I never trust the marketing.
Fair point and one that most should follow... But still, it looks incredible.
We use image generation models at work. When you give them a prompt and they spit out a pretty image, remember that the range of acceptable outputs is very large in that context. It demos very...
We use image generation models at work. When you give them a prompt and they spit out a pretty image, remember that the range of acceptable outputs is very large in that context. It demos very well but it’s not useful outside of stock image/stock video use cases. What artists and engineers actually do is work under a rigorous set of constraints. Getting these models to do a very specific thing, correctly adhere to those constraints, and still maintain photorealism (or whatever style you need) is a very much unsolved problem. In that case the range of valid outputs is relatively tiny.
You're not just commenting on the ones that are examples of its shortcomings, right? The one that turned me into a (freaked out) believer is the video of Tokyo from a train. AI should not be able...
There’s a lot of uncanny valley/straight up artifacts in just these demos.
You're not just commenting on the ones that are examples of its shortcomings, right?
The one that turned me into a (freaked out) believer is the video of Tokyo from a train. AI should not be able to do that yet!
I feel like I'm probably somewhere in the middle here. I did notice several pretty odd artifacts in the woman walking video. If you carefully watch her legs, her gait is pretty uneven and her legs...
I feel like I'm probably somewhere in the middle here. I did notice several pretty odd artifacts in the woman walking video. If you carefully watch her legs, her gait is pretty uneven and her legs swap sides at least once, maybe twice. People in the background of the video also have occasionally unnatural gaits.
However, I'd say that on the whole unless you're looking for it, these things don't really stand out. At a casual glance, scrolling by that video in a social media feed or seeing it playing on a screen somewhere, I'd almost certainly never notice any of those things.
There are also multiple occasions where her feet "teleport" as the foot that's supposed to be moving backward is suddenly the foot that's moving forward.
If you carefully watch her legs, her gait is pretty uneven and her legs swap sides at least once, maybe twice. People in the background of the video also have occasionally unnatural gaits.
There are also multiple occasions where her feet "teleport" as the foot that's supposed to be moving backward is suddenly the foot that's moving forward.
I didn't notice that. To me, it looked more like they were possibly clipping through each other and like she wasn't used to walking in heels. Otherwise, the motion seemed to not be necessarily...
I didn't notice that. To me, it looked more like they were possibly clipping through each other and like she wasn't used to walking in heels. Otherwise, the motion seemed to not be necessarily exactly as expected, but pretty acceptable within the bounds of reality.
Holy crap this looks so smooth and seamless. It has none of the teltale flags I'm used to looking for in AI imagery (mangled hands, bad eyes, glitchy movement). The only glitch I've noticed at...
Holy crap this looks so smooth and seamless. It has none of the teltale flags I'm used to looking for in AI imagery (mangled hands, bad eyes, glitchy movement). The only glitch I've noticed at first glance is the disconnected perspective shift as the camera moves down in the Lagos, Nigeria shot, followed by the distant traffic moving erratically. But it's composed together well enough that you almost don't notice.
Edit: There's some interesting examples further down the page illustrating the mode's current weaknesses with object persistence, impossible physical motion, and difficulty with multiple moving entities (like a puppy pile).
The one with the tiny red pandas looked pretty bad, they just kinda popped out of noise. There were also a few that looked uncanny in ways I could not place at all (like the one with the birds...
The one with the tiny red pandas looked pretty bad, they just kinda popped out of noise. There were also a few that looked uncanny in ways I could not place at all (like the one with the birds sitting on a tree) where I just felt like something unsettling was about to happen even though it never did. But perhaps that was just my expectations because I know these are AI-generated
It's fascinating to me the different ways in which this generator breaks, now that motion is involved. The cat waking up it's owner has 3 legs, the wolf pups merging and splitting, the birthday...
It's fascinating to me the different ways in which this generator breaks, now that motion is involved. The cat waking up it's owner has 3 legs, the wolf pups merging and splitting, the birthday crowd moving naturally and yet still incorrectly.
Unlike previous still image generator errors where the error was often downright horrifying, a lot of these example videos look a lot like someone did it intentionally as a VFX trick for a specific purpose.
Worth mentioning that this one was included to show a current weakness of the model in tracking different actors in a scene. All of them in that row had some sort of flaw they were pointing out,...
The one with the tiny red pandas looked pretty bad, they just kinda popped out of noise.
Worth mentioning that this one was included to show a current weakness of the model in tracking different actors in a scene. All of them in that row had some sort of flaw they were pointing out, described under the video.
The red panda video is in the third row above the glitch reel, although there's a similar video in the glitch reel depicting wolf puppies spawning and collapsing in a very similar way.
The red panda video is in the third row above the glitch reel, although there's a similar video in the glitch reel depicting wolf puppies spawning and collapsing in a very similar way.
You're quite right! I noticed that a lot of people had missed the light grey text under the glitch reel, but in trying to clarify I'm afraid I just spread more confusion. So my apologies there. It...
You're quite right! I noticed that a lot of people had missed the light grey text under the glitch reel, but in trying to clarify I'm afraid I just spread more confusion. So my apologies there.
It does look like the red pandas video shows the same "multiple actor" problem they explain, though not quite as badly as in the puppies video. In a weird way it gives me "Conway's Game of Life" vibes, where different automata can interact, combine, or split away at a moment's notice.
Uh, they are all meh. They all have various issues. The best place to look is the feet. For example, in the first one, when the shot opens up and you can see the feet of the woman, they are...
Uh, they are all meh. They all have various issues. The best place to look is the feet. For example, in the first one, when the shot opens up and you can see the feet of the woman, they are sliding around the street. You can see similar artifacts with the furry creature looking at the candle. The mammoths appear to have two knees. The bird has weird movements with the red feathers on the chest. The ones that look good to me were the one of the waves crashing against the cliff with the light house and the paper fish. I think the latter was good since it was abstract so my brain didn't really look for weirdness. The Astronauts are firmly in the uncanny valley.
This is the first generation of text to video. RunwayML is giving you short clips but nothing compared to this. If this worries you already, give in 3-5 years of development and it'll be...
This is the first generation of text to video. RunwayML is giving you short clips but nothing compared to this.
If this worries you already, give in 3-5 years of development and it'll be terrifying. All the time OpenAI has resources for analysing huge quantities of exceptionally well meta-tagged data, their models will be leaps and bounds ahead of other companies.
All this said, it's bloody amazing. Can you imagine how excited the marketing departments are? Why hire a film crew to make an ad when you can upload an image of the product and have AI generate the whole advert.
Don't panic. If you have serious video skills and learn to use the new AI tools as they develop then you will be light-years ahead of an AI user without video skills. AI isn't going to take your...
Don't panic. If you have serious video skills and learn to use the new AI tools as they develop then you will be light-years ahead of an AI user without video skills. AI isn't going to take your job. If you adapt to it you will be poised to succeed in the future of the field. Just stay calm and keep learning.
It's not my field, but I'm not sure about that. Aren't advertisers going to be picky about having their actual people and/or actual products in the ad, rather than something that sort of looks...
It's not my field, but I'm not sure about that. Aren't advertisers going to be picky about having their actual people and/or actual products in the ad, rather than something that sort of looks like them? And there's only so much control you can get with a prompt. For making still images that are even vaguely what I wanted, I found it took a lot of time.
It seems like it would be used more for scenes that don't matter as much. (Establishing shot, etc.)
It's those old 90s era AT&T commercials, "you will". Just about three decades later than AT&T's marketing department figured on. This tool from Open AI is more than just video. Of course, it'll...
It's those old 90s era AT&T commercials, "you will". Just about three decades later than AT&T's marketing department figured on.
This tool from Open AI is more than just video. Of course, it'll get better. Look at how good it is now. But in a decade (which might as well be more than a century in Computer Years), it'll be basically magic. Right now, the tool needs cloud computer resources. When it can be condensed down to a beefy desktop PC, magic.
It's way more than just video. The really innovative part is how their text analyzer can take plain language input, and give back coherent output. That's moving us closer to the library computer from Star Trek. Which everyone I know has basically always wanted forever. Even now, people expect search engines to read their minds, and not need syntax or modifiers or any of the other code-esque input types on whatever you give the engine to process.
When you have a tool that can process plain language coherently, that's an enormous step. It'll change everything about interacting with computers. People like my mom will be able to use the computer competently. She can only manage the TV remote because I program it for her so it'll find her channels. Sometimes she emails or texts asking me to search stuff for her, and I have to mail her a link to the search results, or the page(s) she wanted the search to find.
She's not a super rare case; lots of people can't comfortably interact with technology.
And as for video ... think about movies. There's another Godzilla flick coming out. Everyone wants Godzilla in the Godzilla movies. They want Kong, they want the monsters, they want the film to be all kaiju all the time. But that's expensive. Staggeringly expensive. Never mind story stuff (separate conversation); every second of screen time for a kaiju is vastly more costly than putting human actors on a dressed set and pointing a camera at them.
And also, never mind that talented movie makers and storytellers can use these tools. What about ordinary people? You love kaiju? Let your pro gamer level desktop computer churn for a few days, maybe a week perhaps, and now you have two hours of nothing but kaiju. No humans filling space, no "boring talk-talk-talk."
Just Godzilla emerging from the waves and spending fifteen minutes literally leveling Tokyo. Not one or two beauty shots, that are only really there for the trailer to sell tickets, before we cut to the terrified people screaming and running and discussing what to do. Just endless "Godzilla knocks a building down, picks up girders and smashes another until it crumbles; uses a freighter to flatten a train station, punts a train the length of the Home Island, etc."
Then the video introduces Mothra, or Kong, or whatever, and the carnage continues. Just wall-to-wall knocking down walls.
I know there are people, a lot of people, around the world, who want that kind of stuff. People who want hours of lightsaber fights, spaceship combat, fantasy armies clashing, undersea explorers building an aquatic city piece by piece, whatever. Anything you can think of, that you wish someone would give a couple hundred million bucks to a professional cinematic effects studio to turn out for you ... this tool will be able to produce in exchange for a bump on your monthly electric bill once it comes down out of the cloud.
Which it will.
Memes are going to be unrecognizable in a few years. Right now it's a handful of cartoonists, and another handful of photoshop experts, who piece together templates for the unwashed masses to use. A few websites that offer text overlays so you can giggle as you put whatever clever BS you think is funny on a screencap or one of those template images.
This tool will produce custom movies, custom movie quality screencaps, that are unique to your "joke". The internet will have to increase by a couple orders of magnitude to shoulder up under the weight of all the memes. Google will buy Hong Kong just to have the factory muscle to churn out storage and processing space so Youtube doesn't crash under the near infinite torrent of custom meme stuff people will be uploading every minute.
This is what happens when ones and zeroes can zip around at incomprehensible scales of speed. Magic. Just with a GPU and keyboard instead of a wand and wizard hat.
Why store the rendered output when your client device can simply re-generate the output from the prompt in real-time?
Google will buy Hong Kong just to have the factory muscle to churn out storage and processing space so Youtube doesn't crash under the near infinite torrent of custom meme stuff people will be uploading every minute.
Why store the rendered output when your client device can simply re-generate the output from the prompt in real-time?
Because that would mean ceding the control over the hardware to the client. That’s more in line with Apple’s vision of personal computing than Google’s. It’s the “thick” vs. “thin” client models...
Because that would mean ceding the control over the hardware to the client. That’s more in line with Apple’s vision of personal computing than Google’s. It’s the “thick” vs. “thin” client models at odds, and it’s not clear which model will succeed long term nor if they can coexist.
Oh I think both models can absolutely coexist. We've bounced back between mainframes, powerful end-devices, and thin clients multiple times now. They solve different jobs and each have a place....
Oh I think both models can absolutely coexist. We've bounced back between mainframes, powerful end-devices, and thin clients multiple times now. They solve different jobs and each have a place.
Something like Chromebooks will continue to make sense in schools because they're cheap, secure, and connected. Whereas powerful end hardware, even on mobile, continues to allow for gaming, on-device processing, and other advanced applications.
Android in particular has moved to on-device algorithms for voice detection, no longer relying on the cloud. And Google's new Gemini Mini is built specifically for local inference. I think it's likely that this trend will only increase as generative models and techniques improve, and AI accelerator chips become more readily accessible in consumer hardware.
So I was telling my brother a couple of years ago that the next election cycle will have seriously good deep fake videos of Biden doing stuff to fuel the right. This comes just in time. I'm sure...
So I was telling my brother a couple of years ago that the next election cycle will have seriously good deep fake videos of Biden doing stuff to fuel the right. This comes just in time.
I'm sure we'll see some of Trump, but nothing can convince that cult.
Tbh, my first reaction on seeing this was that it feels like an irresponsible time to release this. But, well, the world can't just wait because of US elections and there probably isn't going to...
Tbh, my first reaction on seeing this was that it feels like an irresponsible time to release this. But, well, the world can't just wait because of US elections and there probably isn't going to be a "good" time any time soon.
I think the more stuff like this gets around, the better. I follow AI rather closely, and if you had shown me videos of this quality yesterday I would have said there is no way AI could make that...
I think the more stuff like this gets around, the better. I follow AI rather closely, and if you had shown me videos of this quality yesterday I would have said there is no way AI could make that right now. Better to present evidence that it can in a responsible, open way than to wait until someone decides to take advantage of that ignorance.
Today, Sora is becoming available to red teamers to assess critical areas for harms or risks. We are also granting access to a number of visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals.
We’re sharing our research progress early to start working with and getting feedback from people outside of OpenAI and to give the public a sense of what AI capabilities are on the horizon.
i found this this morning by fluke. i’m going to use whisper to (hopefully) pull lyrics from an album — went to the site and was sucked in. it’s so crisp and… real… and dangerous. i love it, but...
i found this this morning by fluke. i’m going to use whisper to (hopefully) pull lyrics from an album — went to the site and was sucked in.
it’s so crisp and… real… and dangerous. i love it, but we’re fucked as a society if we don’t have a quick way for idiots to discern.
I feel the same. It doesn't even matter if videos made by this are clearly marked as fake, many people will not notice (I mean, look at how many people only read news headlines and not actual...
I feel the same. It doesn't even matter if videos made by this are clearly marked as fake, many people will not notice (I mean, look at how many people only read news headlines and not actual content) and the potential ramifications of politicians or celebrities being shown in fake videos made to look real is potentially severe.
its insane. even those ads that were floating around with celebs endorsing crypto scams... its just the tip of the iceberg. On the other end, that Whisper AI was amazing. I was feeding it music...
its insane. even those ads that were floating around with celebs endorsing crypto scams... its just the tip of the iceberg.
On the other end, that Whisper AI was amazing. I was feeding it music and it was spitting out lyrics. Pushing a podcast or other spoken word through it would most likely be flawless. Eventually (not long) we'll have that as a plugin for Kodi or whatever and it'll spit out accurate subtitles in real time. Pretty handy.
But yeah, for the visual stuff and also the voice cloning, I can't see how anybody will lock that down. We, as a people, probably had the same outrage with the printing press being able to copy signatures and stamps, though.
Yeah, the thing is that the technology is cool as fuck, so I want to be able to just enjoy it, I'm just terrified at the lack of regulations. AI imagery will be used in the upcoming US election,...
Yeah, the thing is that the technology is cool as fuck, so I want to be able to just enjoy it, I'm just terrified at the lack of regulations. AI imagery will be used in the upcoming US election, for example (and I'm sure basically every other national election in every other country, too). Faked imagery might not be used on reputable news sources, but it will for sure influence at least some people.
Your point of comparing it to the printing press is interesting and I haven't thought of that. I agree that there are some similarities there. I do think faked videos can lead to more severe consequences than faked or duplicated documents, though (for example, imagine how much of a shit show it would be if someone generated an AI video of a politician or celebrity being a pedophile), but an interesting point nonetheless.
I'd like to say that in 5 years we'll have a handle on AI, but with seeing how quickly AI image generators advanced (went from utter shit to near perfect in just a few years), the possibilities are a bit of an unknown and it scares me. I saw a story the other day where an AI program was used by high schoolers to "nudify" their classmates, so some lines are already being crossed with just images, much less video.
I’m blown away by this, it’s vastly ahead of the Lumiere demo Google published the other day. I’m very impatiently awaiting the research paper - I know they won’t give the whole game away, and...
I’m blown away by this, it’s vastly ahead of the Lumiere demo Google published the other day. I’m very impatiently awaiting the research paper - I know they won’t give the whole game away, and even from the press release it seems like “have an infinite hardware budget, the entire Bing web scraping catalogue of data, and internal access to GPT-4 as a data tagging/prompt handling backbone” is a significant part of the prerequisites, but seeing diffusion transformers (as distinct from ‘normal’ latent diffusion models) suddenly blow past everything else is fascinating.
Patched training is an interesting note for them to drop in there, too - previous state of the art for duration with shared context was about 80 frames, so they’ve blown past that by an order of magnitude; I’m wondering if they’re sharing different patches in both space and time at different places along the time axis to improve consistency, rather than just compressing the axis or iterating full spatial chunks sequentially like others have done. Going to be interesting times for the tech, and even more interesting for the downstream implications when this level of quality becomes commonplace!
I will be interested to see how much copyrighted material can be coaxed from the model. With all these models I still wonder how much it’s just applying filters to different original data. Really...
I will be interested to see how much copyrighted material can be coaxed from the model. With all these models I still wonder how much it’s just applying filters to different original data. Really impressive filters, don’t get me wrong, but still just transforming previous data, mixing filters (including abstract ones) and then repackaging it all.
On a separate level these fill me with dread. I can’t see a good outcome of AI. All I can see is increasing instability, power concentration, and inevitability of global conflict which ultimately destroys us all. I literally can’t see any stable scenario where humanity lives happily ever after or even mediocre ever after.
These models don't store original data in any sort of retrievable format. They store a large multidimensional matrix of numerical weights that, with clever algorithmic coaxing, can produce images...
These models don't store original data in any sort of retrievable format. They store a large multidimensional matrix of numerical weights that, with clever algorithmic coaxing, can produce images that reflect the aggregate of the source material that went into generating the weights. There's nothing in the process that's analogous to applying a filter over a video or image source to transform it into the output; in every case it is creating the image from scratch via application of its internal weights. This can be easily demonstrated by asking it to output an image that is not similar to anything in the original set, like asking it for a pirate aardvark playing mahjong or something similarly random.
If you read the article it explicitly says that's not what it's doing. Essentially, they found a few images that it had "over-trained" on because they were present multiple times in the dataset,...
If you read the article it explicitly says that's not what it's doing.
However, Carlini's results are not as clear-cut as they may first appear. Discovering instances of memorization in Stable Diffusion required 175 million image generations for testing and preexisting knowledge of trained images. Researchers only extracted 94 direct matches and 109 perceptual near-matches out of 350,000 high-probability-of-memorization images they tested (a set of known duplicates in the 160 million-image dataset used to train Stable Diffusion), resulting in a roughly 0.03 percent memorization rate in this particular scenario.
Essentially, they found a few images that it had "over-trained" on because they were present multiple times in the dataset, and found that a fraction of those could be produced very closely.
Also, the researchers note that the "memorization" they've discovered is approximate since the AI model cannot produce identical byte-for-byte copies of the training images. By definition, Stable Diffusion cannot memorize large amounts of data because the size of the 160 million-image training dataset is many orders of magnitude larger than the 2GB Stable Diffusion AI model. That means any memorization that exists in the model is small, rare, and very difficult to accidentally extract.
This is where they state what I said in my earlier comment - that it is producing the image from scratch when it does this, and is just happening to come very close to the source because those images are massively over-represented in the source dataset. Essentially, you can use the AI to produce imperfect approximations of images that are already available in its original format in abundance.
Absolutely baffled by this one. AI has given me a few moments like the scene in Westworld where Logan is staring in disbelief insisting that "we're not here yet." And yet here we are. This is way beyond any other video tech demo I've seen thus far in terms of coherency, they must have made a rather fantastic improvement in their denoising to adapt it to motion. Incredible job by the OpenAI team.
EDIT: Here is the link where they provide more detail, I just came across it: https://openai.com/research/video-generation-models-as-world-simulators
I’m stuck in the usual spot with all this.
Is it impressive? Absolutely.
Do I think it’s as impressive as people have made it out to be? Not yet.
There’s a lot of uncanny valley/straight up artifacts in just these demos. Still a very impressive quick prototyping style tool or for someone who doesn’t totally care about quality, but probably expensive enough to not really find that market.
Finally I never trust the marketing. Until it’s in the hands of a large number of professionals it’s concerning that their BEST examples have issues.
I mean, it's not perfect, for sure - they have a whole section in there where they acknowledge this and point out some flaws that they haven't been able to correct for yet. Why it's impressive is the contrast it makes to the other attempts currently out there. I'm not sure if you've taken a look at the other videos that have been generated with AI, but prior to this even the best of them looked like wobbly garbage; elements in them constantly blurred and shifted across the whole screen, people looked like nightmarish caricatures, and there was no cohesion or direction in the video's structure.
This new tool represents a substantial increase in capability, one that is taking even people working in the field by surprise. Nobody else has anything close to it, and a lot of people would have likely told you that something of this quality was years away - yesterday. It's not about what the tool can do right now, it's about the dot on the graph of progression charting a way quicker course than anyone expected.
Consider where this technology was but a year ago and this is an impressive improvement, far more than you're suggesting. Where will this specific video creation ai go within another year if this is the starting point?
Fair point and one that most should follow... But still, it looks incredible.
We use image generation models at work. When you give them a prompt and they spit out a pretty image, remember that the range of acceptable outputs is very large in that context. It demos very well but it’s not useful outside of stock image/stock video use cases. What artists and engineers actually do is work under a rigorous set of constraints. Getting these models to do a very specific thing, correctly adhere to those constraints, and still maintain photorealism (or whatever style you need) is a very much unsolved problem. In that case the range of valid outputs is relatively tiny.
You're not just commenting on the ones that are examples of its shortcomings, right?
The one that turned me into a (freaked out) believer is the video of Tokyo from a train. AI should not be able to do that yet!
The very first video of the woman walking gets weirder and weirder as it goes on. Watch her feet
It doesn't really look that weird to me.
I feel like I'm probably somewhere in the middle here. I did notice several pretty odd artifacts in the woman walking video. If you carefully watch her legs, her gait is pretty uneven and her legs swap sides at least once, maybe twice. People in the background of the video also have occasionally unnatural gaits.
However, I'd say that on the whole unless you're looking for it, these things don't really stand out. At a casual glance, scrolling by that video in a social media feed or seeing it playing on a screen somewhere, I'd almost certainly never notice any of those things.
There are also multiple occasions where her feet "teleport" as the foot that's supposed to be moving backward is suddenly the foot that's moving forward.
I didn't notice that. To me, it looked more like they were possibly clipping through each other and like she wasn't used to walking in heels. Otherwise, the motion seemed to not be necessarily exactly as expected, but pretty acceptable within the bounds of reality.
I find this is true in general with AI-generated images. Things look okay until you notice what's off, and then it's hard to miss.
Holy crap this looks so smooth and seamless. It has none of the teltale flags I'm used to looking for in AI imagery (mangled hands, bad eyes, glitchy movement). The only glitch I've noticed at first glance is the disconnected perspective shift as the camera moves down in the Lagos, Nigeria shot, followed by the distant traffic moving erratically. But it's composed together well enough that you almost don't notice.
Edit: There's some interesting examples further down the page illustrating the mode's current weaknesses with object persistence, impossible physical motion, and difficulty with multiple moving entities (like a puppy pile).
The one with the tiny red pandas looked pretty bad, they just kinda popped out of noise. There were also a few that looked uncanny in ways I could not place at all (like the one with the birds sitting on a tree) where I just felt like something unsettling was about to happen even though it never did. But perhaps that was just my expectations because I know these are AI-generated
It's fascinating to me the different ways in which this generator breaks, now that motion is involved. The cat waking up it's owner has 3 legs, the wolf pups merging and splitting, the birthday crowd moving naturally and yet still incorrectly.
Unlike previous still image generator errors where the error was often downright horrifying, a lot of these example videos look a lot like someone did it intentionally as a VFX trick for a specific purpose.
Worth mentioning that this one was included to show a current weakness of the model in tracking different actors in a scene. All of them in that row had some sort of flaw they were pointing out, described under the video.
The red panda video is in the third row above the glitch reel, although there's a similar video in the glitch reel depicting wolf puppies spawning and collapsing in a very similar way.
You're quite right! I noticed that a lot of people had missed the light grey text under the glitch reel, but in trying to clarify I'm afraid I just spread more confusion. So my apologies there.
It does look like the red pandas video shows the same "multiple actor" problem they explain, though not quite as badly as in the puppies video. In a weird way it gives me "Conway's Game of Life" vibes, where different automata can interact, combine, or split away at a moment's notice.
Uh, they are all meh. They all have various issues. The best place to look is the feet. For example, in the first one, when the shot opens up and you can see the feet of the woman, they are sliding around the street. You can see similar artifacts with the furry creature looking at the candle. The mammoths appear to have two knees. The bird has weird movements with the red feathers on the chest. The ones that look good to me were the one of the waves crashing against the cliff with the light house and the paper fish. I think the latter was good since it was abstract so my brain didn't really look for weirdness. The Astronauts are firmly in the uncanny valley.
This is the first generation of text to video. RunwayML is giving you short clips but nothing compared to this.
If this worries you already, give in 3-5 years of development and it'll be terrifying. All the time OpenAI has resources for analysing huge quantities of exceptionally well meta-tagged data, their models will be leaps and bounds ahead of other companies.
All this said, it's bloody amazing. Can you imagine how excited the marketing departments are? Why hire a film crew to make an ad when you can upload an image of the product and have AI generate the whole advert.
Don't panic. If you have serious video skills and learn to use the new AI tools as they develop then you will be light-years ahead of an AI user without video skills. AI isn't going to take your job. If you adapt to it you will be poised to succeed in the future of the field. Just stay calm and keep learning.
It's not my field, but I'm not sure about that. Aren't advertisers going to be picky about having their actual people and/or actual products in the ad, rather than something that sort of looks like them? And there's only so much control you can get with a prompt. For making still images that are even vaguely what I wanted, I found it took a lot of time.
It seems like it would be used more for scenes that don't matter as much. (Establishing shot, etc.)
It's those old 90s era AT&T commercials, "you will". Just about three decades later than AT&T's marketing department figured on.
This tool from Open AI is more than just video. Of course, it'll get better. Look at how good it is now. But in a decade (which might as well be more than a century in Computer Years), it'll be basically magic. Right now, the tool needs cloud computer resources. When it can be condensed down to a beefy desktop PC, magic.
It's way more than just video. The really innovative part is how their text analyzer can take plain language input, and give back coherent output. That's moving us closer to the library computer from Star Trek. Which everyone I know has basically always wanted forever. Even now, people expect search engines to read their minds, and not need syntax or modifiers or any of the other code-esque input types on whatever you give the engine to process.
When you have a tool that can process plain language coherently, that's an enormous step. It'll change everything about interacting with computers. People like my mom will be able to use the computer competently. She can only manage the TV remote because I program it for her so it'll find her channels. Sometimes she emails or texts asking me to search stuff for her, and I have to mail her a link to the search results, or the page(s) she wanted the search to find.
She's not a super rare case; lots of people can't comfortably interact with technology.
And as for video ... think about movies. There's another Godzilla flick coming out. Everyone wants Godzilla in the Godzilla movies. They want Kong, they want the monsters, they want the film to be all kaiju all the time. But that's expensive. Staggeringly expensive. Never mind story stuff (separate conversation); every second of screen time for a kaiju is vastly more costly than putting human actors on a dressed set and pointing a camera at them.
And also, never mind that talented movie makers and storytellers can use these tools. What about ordinary people? You love kaiju? Let your pro gamer level desktop computer churn for a few days, maybe a week perhaps, and now you have two hours of nothing but kaiju. No humans filling space, no "boring talk-talk-talk."
Just Godzilla emerging from the waves and spending fifteen minutes literally leveling Tokyo. Not one or two beauty shots, that are only really there for the trailer to sell tickets, before we cut to the terrified people screaming and running and discussing what to do. Just endless "Godzilla knocks a building down, picks up girders and smashes another until it crumbles; uses a freighter to flatten a train station, punts a train the length of the Home Island, etc."
Then the video introduces Mothra, or Kong, or whatever, and the carnage continues. Just wall-to-wall knocking down walls.
I know there are people, a lot of people, around the world, who want that kind of stuff. People who want hours of lightsaber fights, spaceship combat, fantasy armies clashing, undersea explorers building an aquatic city piece by piece, whatever. Anything you can think of, that you wish someone would give a couple hundred million bucks to a professional cinematic effects studio to turn out for you ... this tool will be able to produce in exchange for a bump on your monthly electric bill once it comes down out of the cloud.
Which it will.
Memes are going to be unrecognizable in a few years. Right now it's a handful of cartoonists, and another handful of photoshop experts, who piece together templates for the unwashed masses to use. A few websites that offer text overlays so you can giggle as you put whatever clever BS you think is funny on a screencap or one of those template images.
This tool will produce custom movies, custom movie quality screencaps, that are unique to your "joke". The internet will have to increase by a couple orders of magnitude to shoulder up under the weight of all the memes. Google will buy Hong Kong just to have the factory muscle to churn out storage and processing space so Youtube doesn't crash under the near infinite torrent of custom meme stuff people will be uploading every minute.
This is what happens when ones and zeroes can zip around at incomprehensible scales of speed. Magic. Just with a GPU and keyboard instead of a wand and wizard hat.
Why store the rendered output when your client device can simply re-generate the output from the prompt in real-time?
Because that would mean ceding the control over the hardware to the client. That’s more in line with Apple’s vision of personal computing than Google’s. It’s the “thick” vs. “thin” client models at odds, and it’s not clear which model will succeed long term nor if they can coexist.
Oh I think both models can absolutely coexist. We've bounced back between mainframes, powerful end-devices, and thin clients multiple times now. They solve different jobs and each have a place.
Something like Chromebooks will continue to make sense in schools because they're cheap, secure, and connected. Whereas powerful end hardware, even on mobile, continues to allow for gaming, on-device processing, and other advanced applications.
Android in particular has moved to on-device algorithms for voice detection, no longer relying on the cloud. And Google's new Gemini Mini is built specifically for local inference. I think it's likely that this trend will only increase as generative models and techniques improve, and AI accelerator chips become more readily accessible in consumer hardware.
So I was telling my brother a couple of years ago that the next election cycle will have seriously good deep fake videos of Biden doing stuff to fuel the right. This comes just in time.
I'm sure we'll see some of Trump, but nothing can convince that cult.
Tbh, my first reaction on seeing this was that it feels like an irresponsible time to release this. But, well, the world can't just wait because of US elections and there probably isn't going to be a "good" time any time soon.
I think the more stuff like this gets around, the better. I follow AI rather closely, and if you had shown me videos of this quality yesterday I would have said there is no way AI could make that right now. Better to present evidence that it can in a responsible, open way than to wait until someone decides to take advantage of that ignorance.
It's not released to the general public, though:
I assume the compute requirements for this thing are ASTRONOMICAL.
Trump will eventually pass to darkness yet AI will keep him alive and his nutter followers will believe it is the second coming of the Messiah.
Isn't that the whole point of QAnon lol
What if there's a deepfake of Trump doing something nice to a Democrat? That would really be something.
Post a video of Trump and a puppy, and he's not kicking it, the base will turn on him in a heartbeat
i found this this morning by fluke. i’m going to use whisper to (hopefully) pull lyrics from an album — went to the site and was sucked in.
it’s so crisp and… real… and dangerous. i love it, but we’re fucked as a society if we don’t have a quick way for idiots to discern.
I feel the same. It doesn't even matter if videos made by this are clearly marked as fake, many people will not notice (I mean, look at how many people only read news headlines and not actual content) and the potential ramifications of politicians or celebrities being shown in fake videos made to look real is potentially severe.
its insane. even those ads that were floating around with celebs endorsing crypto scams... its just the tip of the iceberg.
On the other end, that Whisper AI was amazing. I was feeding it music and it was spitting out lyrics. Pushing a podcast or other spoken word through it would most likely be flawless. Eventually (not long) we'll have that as a plugin for Kodi or whatever and it'll spit out accurate subtitles in real time. Pretty handy.
But yeah, for the visual stuff and also the voice cloning, I can't see how anybody will lock that down. We, as a people, probably had the same outrage with the printing press being able to copy signatures and stamps, though.
edit: fuck it -- here's Knight Rider Biden https://i.imgur.com/ItSjgBk.jpeg
edit edit: I wish this didn't look so bad ass https://i.imgur.com/0kIHx3Z.png
Yeah, the thing is that the technology is cool as fuck, so I want to be able to just enjoy it, I'm just terrified at the lack of regulations. AI imagery will be used in the upcoming US election, for example (and I'm sure basically every other national election in every other country, too). Faked imagery might not be used on reputable news sources, but it will for sure influence at least some people.
Your point of comparing it to the printing press is interesting and I haven't thought of that. I agree that there are some similarities there. I do think faked videos can lead to more severe consequences than faked or duplicated documents, though (for example, imagine how much of a shit show it would be if someone generated an AI video of a politician or celebrity being a pedophile), but an interesting point nonetheless.
I'd like to say that in 5 years we'll have a handle on AI, but with seeing how quickly AI image generators advanced (went from utter shit to near perfect in just a few years), the possibilities are a bit of an unknown and it scares me. I saw a story the other day where an AI program was used by high schoolers to "nudify" their classmates, so some lines are already being crossed with just images, much less video.
Story link for reference: https://www.404media.co/what-was-she-supposed-to-report-police-report-shows-how-a-high-school-deepfake-nightmare-unfolded/
I’m blown away by this, it’s vastly ahead of the Lumiere demo Google published the other day. I’m very impatiently awaiting the research paper - I know they won’t give the whole game away, and even from the press release it seems like “have an infinite hardware budget, the entire Bing web scraping catalogue of data, and internal access to GPT-4 as a data tagging/prompt handling backbone” is a significant part of the prerequisites, but seeing diffusion transformers (as distinct from ‘normal’ latent diffusion models) suddenly blow past everything else is fascinating.
Patched training is an interesting note for them to drop in there, too - previous state of the art for duration with shared context was about 80 frames, so they’ve blown past that by an order of magnitude; I’m wondering if they’re sharing different patches in both space and time at different places along the time axis to improve consistency, rather than just compressing the axis or iterating full spatial chunks sequentially like others have done. Going to be interesting times for the tech, and even more interesting for the downstream implications when this level of quality becomes commonplace!
I will be interested to see how much copyrighted material can be coaxed from the model. With all these models I still wonder how much it’s just applying filters to different original data. Really impressive filters, don’t get me wrong, but still just transforming previous data, mixing filters (including abstract ones) and then repackaging it all.
On a separate level these fill me with dread. I can’t see a good outcome of AI. All I can see is increasing instability, power concentration, and inevitability of global conflict which ultimately destroys us all. I literally can’t see any stable scenario where humanity lives happily ever after or even mediocre ever after.
These models don't store original data in any sort of retrievable format. They store a large multidimensional matrix of numerical weights that, with clever algorithmic coaxing, can produce images that reflect the aggregate of the source material that went into generating the weights. There's nothing in the process that's analogous to applying a filter over a video or image source to transform it into the output; in every case it is creating the image from scratch via application of its internal weights. This can be easily demonstrated by asking it to output an image that is not similar to anything in the original set, like asking it for a pirate aardvark playing mahjong or something similarly random.
*and combining different source materials.
There are examples of these models producing nearly perfect replications of the dataset like this.
If you read the article it explicitly says that's not what it's doing.
Essentially, they found a few images that it had "over-trained" on because they were present multiple times in the dataset, and found that a fraction of those could be produced very closely.
This is where they state what I said in my earlier comment - that it is producing the image from scratch when it does this, and is just happening to come very close to the source because those images are massively over-represented in the source dataset. Essentially, you can use the AI to produce imperfect approximations of images that are already available in its original format in abundance.
This is why Altman needs a $7 trillion chip investment
If he keeps putting out this kind of boundary redefining tech then he might get to see a big chunk of that investment.
This is remarkably dangerous.