I was initially very excited to see there is a GitHub repository. But the code is only for the front-end. All of the AI magic is ran through a closed source web API.
I was initially very excited to see there is a GitHub repository. But the code is only for the front-end. All of the AI magic is ran through a closed source web API.
Looks like it’s a wrapper around this distillation of SDXL, if you’re interested in playing with it directly. Published by TikTok’s parent company, interestingly enough - they seem to have some...
Looks like it’s a wrapper around this distillation of SDXL, if you’re interested in playing with it directly. Published by TikTok’s parent company, interestingly enough - they seem to have some serious public-facing research work going on at the moment.
There’s also LCM and SDXL Turbo that take similar-ish approaches, and stable-fast that basically glues together every performance optimisation out there and puts a nice neat bow on it!
Note that the "seed" field determines the initial noise pattern that has a big effect on the end result. You can leave it blank to randomize it or increment it on the same prompt to get...
Note that the "seed" field determines the initial noise pattern that has a big effect on the end result. You can leave it blank to randomize it or increment it on the same prompt to get variations, or you can set it to a fixed seed and then tweak the prompt to do rudimentary text-based image editing.
Maybe I'm not in the loop regarding what a successful image result is, but I tried "a baby ocelot with three heads being held by an old Ronald Reagan" and it really didn't give me something that...
Maybe I'm not in the loop regarding what a successful image result is, but I tried "a baby ocelot with three heads being held by an old Ronald Reagan" and it really didn't give me something that matches what I think of when I see that phrase. It was decent about getting 4 heads in the shot (three for the ocelot and one for Reagan), but associating those heads with the ocelot was a clear challenge. Is this an example of this being done well, or is the impressive bit that it can be done fast?
Odd body configurations are not something AI does well in general, particularly when other terms imply a different configuration. Try asking it for a centaur, and most of the time you'll get a...
Odd body configurations are not something AI does well in general, particularly when other terms imply a different configuration. Try asking it for a centaur, and most of the time you'll get a horse of some sort. It's also bad at extra eyes, extra arms, etc. It's part of the sacrifice made to aggressively hone the end result towards something that is a "more proper" result, since in most scenarios having more than one head is a substandard outcome for the ocelot.
Yeah, they used to be happy to do it when you're not asking, which is why they are trained to not do it, and also web services are often fed an invisible negative prompt specifically to prevent...
Yeah, they used to be happy to do it when you're not asking, which is why they are trained to not do it, and also web services are often fed an invisible negative prompt specifically to prevent deformities.
I added the "three heads" bit when the images were regularly giving me two-headed ocelots when I was asking for one, kind of like this: https://fastsdxl.ai/share/24xaesanrqxs
I added the "three heads" bit when the images were regularly giving me two-headed ocelots when I was asking for one, kind of like this:
I've got some really weird results when you give it extra long prompts. It's impressive in it's own way, but normal looking hands didn't work out too well.
I've got some really weird results when you give it extra long prompts. It's impressive in it's own way, but normal looking hands didn't work out too well.
Yeah, I always test hands first. It's so interesting how difficult the repeating pattern of fingers is for the diffusion models. For some reason, this model has a very high chance to produce 6...
Yeah, I always test hands first. It's so interesting how difficult the repeating pattern of fingers is for the diffusion models.
For some reason, this model has a very high chance to produce 6 fingers. Not 4, 5 or 7. Almost exclusively 6.
I have yet to find an image AI that knows what polygonal masonry looks like. https://fastsdxl.ai/share/p21zqwfcuqoi I've managed to torture it a bit and come up with...
I have yet to find an image AI that knows what polygonal masonry looks like.
Maybe I'm just missing something, but I don't see the appeal. When I've played with the Bing Image Creator (i.e., DALL·E 3), a lack of speed wasn't my biggest complaint. It's been accuracy and the...
Maybe I'm just missing something, but I don't see the appeal. When I've played with the Bing Image Creator (i.e., DALL·E 3), a lack of speed wasn't my biggest complaint. It's been accuracy and the proper handling of more than one subject.
That said, my D&D group has certainly gotten a kick out of using Bing to generate images of moments from our campaign. Even if they are hilariously inaccurate 75%+ of the time.
The hardware cost of letting everyone use these things for free is suuuper unsustainable at the moment, not to mention the environmental impact of all that power. From a business perspective, 10x...
The hardware cost of letting everyone use these things for free is suuuper unsustainable at the moment, not to mention the environmental impact of all that power.
From a business perspective, 10x faster is 10x cheaper (same number of users on a fraction of the hardware). From a user perspective, that means less aggressive enshittification when they come to start monetising these things - and a much smaller step to being able to run them locally, which takes that out of the equation entirely.
The research side is where it gets really interesting, though. Similar result quality at 10x speed basically means the model is 10x “better” at predicting (I’m glossing over a ton of details here, but it’s true-ish in a big picture sense). Those training techniques and the theory that underpins them extend to result quality as well as speed, so they’re going to form part of the foundation for the next generation of models, and so on up from there.
That's the thing, though. In playing with it, this isn't similar quality to what I've been seeing with DALL-E 3. It's way worse. Faster, yes. But I'm not sure the ratio of quality to speed is in...
Similar result quality at 10x speed
That's the thing, though. In playing with it, this isn't similar quality to what I've been seeing with DALL-E 3. It's way worse. Faster, yes. But I'm not sure the ratio of quality to speed is in this thing's favor. If it's 10x faster and 15x worse, we've gained nothing.
Sorry, I meant similar to base SDXL, which it’s created from. There’s some quality loss, depending on style, step count, etc. (quantitatively measured in the paper, if that’s meaningful to...
Sorry, I meant similar to base SDXL, which it’s created from. There’s some quality loss, depending on style, step count, etc. (quantitatively measured in the paper, if that’s meaningful to anyone), but overall it looks to be more than offset by the speed increase. DALL-E 3 is a different system, with different architecture, different training data, and a very different method of parsing user interaction so it’s a lot more difficult to compare directly.
This isn’t me saying “therefore this is the right model to use”, at all - it probably isn’t, for a lot of users, given that prioritising absolute quality is still the goal in most cases - just that from the perspective of the people building these things it’s a meaningful step forward that has applications for both quality and speed.
Come to think of it, I’d be interested in how PixArt-α LCM results look to you? That’s a similar (but not identical) distillation technique applied to a newer diffusion transformer based model.
I threw some of the same prompts at it that I threw at FastSDXL.AI. It seems… better? But, interestingly, it makes some similar mistakes. Here are four samples from all three models. They do...
I threw some of the same prompts at it that I threw at FastSDXL.AI. It seems… better? But, interestingly, it makes some similar mistakes.
At some point, I need to try again with multiple subjects. But I got some horrifying results last night from fastsdxl.ai when I was testing if it would allow political content. (I'm almost certain DALL-E 3 won't and I'm afraid to test it.)
I actually clicked my own link last night and it gave me a completely different image. Hopefully that means they're improving their model. But it was even more terrifying than the example you gave...
I actually clicked my own link last night and it gave me a completely different image. Hopefully that means they're improving their model. But it was even more terrifying than the example you gave because it had the man's face stuck right inside a vaguely dog-shape monster.
This is really interesting, I’ve run a few bits and pieces myself now too and the prompt “a man picking his nose” seems to be a surprisingly good stress test even for larger/slower models! I saw a...
This is really interesting, I’ve run a few bits and pieces myself now too and the prompt “a man picking his nose” seems to be a surprisingly good stress test even for larger/slower models! I saw a lot of extremely bad results there, but they seemed similarly bad in the distilled versions compared to the slower versions, which suggests it’s a more fundamental issue; I imagine there are a solid few papers to be written on reducing those kind of “body horror” error conditions, and I’m betting that improved loss functions are going to be a significant part of it.
I appreciate you taking the time to play around - I didn’t work on any of these specific models, but I do work in the field, so I feel like I’m in a bit of a “can’t see the forest for the trees” mode a lot of the time! It’s good to see things from an outside perspective.
Most of my other tests were seeing if I could improve on DALL-E 3's renderings of the shenanigans of my D&D group. (We've taken to "illustrating" some of the more ridiculous or memorable moments...
Most of my other tests were seeing if I could improve on DALL-E 3's renderings of the shenanigans of my D&D group. (We've taken to "illustrating" some of the more ridiculous or memorable moments of our campaign using the Bing Image Creator.) But I didn't save any of the results of these tests on the small/fast models.
Since these prompts almost always involve more than one subject, DALL-E 3 really struggles with them. If I say "a human man and an elf woman," a lot of the time, both of them are elves. Getting an undead/skeletal dragon also seemed beyond all of the AI models. In fact, if you ask the small/fast models for an image of "a human paladin surrounded by skeletal/undead warriors," the human paladin is almost always also rendered as a skeleton.
That said, I'm not a very good "prompt engineer." I know my sister-in-law has sometimes had better luck rendering scenes from our D&D sessions than I have.
One of the really fascinating bits of watching this whole field develop is that the UI needs to get worked out just as much as the underlying tech. I’m not quite old enough to remember the shift...
One of the really fascinating bits of watching this whole field develop is that the UI needs to get worked out just as much as the underlying tech. I’m not quite old enough to remember the shift from CLI to GUI, but I do remember video games moving from 2D to 3D, and it took a good few years there for the industry to hit on controls and norms that made sense - this feels kind of similar.
The thing that particularly brings it to mind is that pretty much exactly what you’re asking for can be done easily enough with multidiffusion - just scribble in the rough area where you want something and attach a specific prompt to that particular region - but turning that into a meaningful, understandable, and usable UI is a goddamn mess right now. We’ll hopefully be getting past the “everything needs to be a text prompt” world in the coming year; sometimes words alone just aren’t the way to communicate something, either to a computer or to a human, but I don’t blame the big guys for being a bit cautious in how they roll that into their products.
If I could just isolate each subject in my prompt with parentheses, brackets, or quotes, it would go a long way towards helping the AI know what I'm asking for. Presumably. As it is, it's like my...
If I could just isolate each subject in my prompt with parentheses, brackets, or quotes, it would go a long way towards helping the AI know what I'm asking for. Presumably.
As it is, it's like my whole prompt is a series of individual strings (i.e., each word) joined with AND operators. But in my testing with DALL-E 3, there's no difference between "brown-haired human male paladin with a teal-haired Drow female rogue" and "(brown-haired human male paladin) with a (teal-haired Drow female rogue)". It's no surprise it mixes details between subjects if it can't tell which words describe which subject.
Totally makes sense! I will say I've found the chat context handling in Bing's "Designer" mode to be decent - probably better than the compound sentence parsing, as you're seeing - so I had...
Totally makes sense! I will say I've found the chat context handling in Bing's "Designer" mode to be decent - probably better than the compound sentence parsing, as you're seeing - so I had reasonable luck getting it to build up the desired result step by step, waiting for it to generate an intermediate set of images each time:
Please create an image of a brown-haired human male paladin in a D&D setting
...
Add a castle to the background
...
Add a teal-haired Drow female rogue standing next to him
or
Create a digital art image of a mythical dragon
...
Please make the dragon skeletal and undead
It's not perfect, not by a long shot, and you have to pretty much consider the intermediate images as throwaways (it'll use them as a rough style guide, but you definitely can't rely on it retaining details you liked from them in the next iteration), but the results did feel a lot better composed to me than doing it in a single shot.
I'd still say that's not great UX, to be honest: it relies on the user considering the internal state of the system and working around it in a way that is probably less intuitive than giving it as much info as possible up front would be, and the type/wait/iterate loop does feel sluggish and kinda frustrating to me, but with any luck these are the kind of things the product people will be smoothing out as the tech matures.
See, I wasn't even sure that it would be willing to iterate upon anything it creates. The UI suggested, to me, that each rendering stood on its own. Unlike the chat interface of Google Gemini or...
See, I wasn't even sure that it would be willing to iterate upon anything it creates. The UI suggested, to me, that each rendering stood on its own. Unlike the chat interface of Google Gemini or even the free version of Chat GPT.
Edit: I just tested it, and it seemed to take my second prompt as if it were a completely unique request. But perhaps I just need to select one of the four images in particular first...? Or perhaps it's the fact that I'm in a mobile browser...? Hmm...
Aaand I've just realised that the UI they're giving in the link you mentioned at the top of this comment thread, branded "Copilot Designer" on a Bing URL, is completely different to the "Copilot...
Aaand I've just realised that the UI they're giving in the link you mentioned at the top of this comment thread, branded "Copilot Designer" on a Bing URL, is completely different to the "Copilot Designer" UI that's presented by selecting "Copilot" on the top menu of the Bing homepage and then "Designer" from the sidebar - even though images I've generated in the latter do show up in the history of the former - that certainly doesn't help when it comes to figuring out the subtleties here!
Weird. I experimented with using this other technique, but I'm sad to report that it doesn't seem to have improved its ability to handle this particular prompt. What I was going for: a paladin is...
Weird.
I experimented with using this other technique, but I'm sad to report that it doesn't seem to have improved its ability to handle this particular prompt.
What I was going for: a paladin is "knighting" a woman who is kneeling before him in a snowy vineyard in the forest. (She is actually swearing a Paladin oath before/to him and his god, but it turned out to be kind of a knighting ceremony.)
Every attempt I have made it this has resulted in both of them kneeling or just generally getting the placement of things wrong.
Interestingly, when I tried to iterate on it, it was as if each rendering was a completely unique prompt anyway. And it was making her much paler than she was supposed to be.
Faster means it's easier to hone in on an accurate result. Right now these tools need a lot of guidance through in-painting to get a desired result. Think about what you're prompting and the...
Faster means it's easier to hone in on an accurate result.
Right now these tools need a lot of guidance through in-painting to get a desired result. Think about what you're prompting and the result you expect. What I've seen from people interacting with these natural language-guided AIs is similar to the beat-tapping problem. Someone tapping a beat to a song they're thinking of expects a listener to be able to determine which song they're thinking of. But really they haven't given nearly enough information.
In-painting with separate prompts for each in-painted section forces the user of the AI to give more information. You'll start with a canvas and then fix each part that diverged from your expectations one-by-one. It's like programming in that you need to break a big problem into a lot of small ones that you know the computer can handle on its own.
As a tangent to your mention of DALL-E, I had an interesting experience when comparing this model to it. OpenAI refuses to generate images of people picking their noses. This model does not, but...
As a tangent to your mention of DALL-E, I had an interesting experience when comparing this model to it. OpenAI refuses to generate images of people picking their noses. This model does not, but it gives you very, uh, interesting results.
I was initially very excited to see there is a GitHub repository. But the code is only for the front-end. All of the AI magic is ran through a closed source web API.
Looks like it’s a wrapper around this distillation of SDXL, if you’re interested in playing with it directly. Published by TikTok’s parent company, interestingly enough - they seem to have some serious public-facing research work going on at the moment.
There’s also LCM and SDXL Turbo that take similar-ish approaches, and stable-fast that basically glues together every performance optimisation out there and puts a nice neat bow on it!
Oh wow, thanks!
We may use this at work in the future if it's good for our use case.
Oh, this is so nice. Lightning seems so much better than Turbo. Faces look like faces!
Note that the "seed" field determines the initial noise pattern that has a big effect on the end result. You can leave it blank to randomize it or increment it on the same prompt to get variations, or you can set it to a fixed seed and then tweak the prompt to do rudimentary text-based image editing.
Maybe I'm not in the loop regarding what a successful image result is, but I tried "a baby ocelot with three heads being held by an old Ronald Reagan" and it really didn't give me something that matches what I think of when I see that phrase. It was decent about getting 4 heads in the shot (three for the ocelot and one for Reagan), but associating those heads with the ocelot was a clear challenge. Is this an example of this being done well, or is the impressive bit that it can be done fast?
https://fastsdxl.ai/share/s1rsrgjecyp4
https://fastsdxl.ai/share/7461gxy2ne8w
https://fastsdxl.ai/share/36oipespv2cw
Odd body configurations are not something AI does well in general, particularly when other terms imply a different configuration. Try asking it for a centaur, and most of the time you'll get a horse of some sort. It's also bad at extra eyes, extra arms, etc. It's part of the sacrifice made to aggressively hone the end result towards something that is a "more proper" result, since in most scenarios having more than one head is a substandard outcome for the ocelot.
Yeah, they used to be happy to do it when you're not asking, which is why they are trained to not do it, and also web services are often fed an invisible negative prompt specifically to prevent deformities.
I added the "three heads" bit when the images were regularly giving me two-headed ocelots when I was asking for one, kind of like this:
https://fastsdxl.ai/share/24xaesanrqxs
I've got some really weird results when you give it extra long prompts. It's impressive in it's own way, but normal looking hands didn't work out too well.
Yeah, I always test hands first. It's so interesting how difficult the repeating pattern of fingers is for the diffusion models.
For some reason, this model has a very high chance to produce 6 fingers. Not 4, 5 or 7. Almost exclusively 6.
Speedrunning AI image generation is something I wasn't aware I wanted to see at the next AGDQ.
Could be the new Excel games
I have yet to find an image AI that knows what polygonal masonry looks like.
https://fastsdxl.ai/share/p21zqwfcuqoi
I've managed to torture it a bit and come up with https://fastsdxl.ai/share/imiouxwumgu3
It really wants "masonry" to be square. For reference, this is polygonal masonry.
goku wearing a top hat ( https://fastsdxl.ai/share/qgbzs2xo7dl8 ) is a bit bugged out but that's still pretty neat.
Actually, I think this model just has some major shortcuts taken, because "sci-fi eiffel tower" should not produce this piece of shit.
Maybe I'm just missing something, but I don't see the appeal. When I've played with the Bing Image Creator (i.e., DALL·E 3), a lack of speed wasn't my biggest complaint. It's been accuracy and the proper handling of more than one subject.
That said, my D&D group has certainly gotten a kick out of using Bing to generate images of moments from our campaign. Even if they are hilariously inaccurate 75%+ of the time.
The hardware cost of letting everyone use these things for free is suuuper unsustainable at the moment, not to mention the environmental impact of all that power.
From a business perspective, 10x faster is 10x cheaper (same number of users on a fraction of the hardware). From a user perspective, that means less aggressive enshittification when they come to start monetising these things - and a much smaller step to being able to run them locally, which takes that out of the equation entirely.
The research side is where it gets really interesting, though. Similar result quality at 10x speed basically means the model is 10x “better” at predicting (I’m glossing over a ton of details here, but it’s true-ish in a big picture sense). Those training techniques and the theory that underpins them extend to result quality as well as speed, so they’re going to form part of the foundation for the next generation of models, and so on up from there.
That's the thing, though. In playing with it, this isn't similar quality to what I've been seeing with DALL-E 3. It's way worse. Faster, yes. But I'm not sure the ratio of quality to speed is in this thing's favor. If it's 10x faster and 15x worse, we've gained nothing.
Sorry, I meant similar to base SDXL, which it’s created from. There’s some quality loss, depending on style, step count, etc. (quantitatively measured in the paper, if that’s meaningful to anyone), but overall it looks to be more than offset by the speed increase. DALL-E 3 is a different system, with different architecture, different training data, and a very different method of parsing user interaction so it’s a lot more difficult to compare directly.
This isn’t me saying “therefore this is the right model to use”, at all - it probably isn’t, for a lot of users, given that prioritising absolute quality is still the goal in most cases - just that from the perspective of the people building these things it’s a meaningful step forward that has applications for both quality and speed.
Come to think of it, I’d be interested in how PixArt-α LCM results look to you? That’s a similar (but not identical) distillation technique applied to a newer diffusion transformer based model.
I threw some of the same prompts at it that I threw at FastSDXL.AI. It seems… better? But, interestingly, it makes some similar mistakes.
Here are four samples from all three models. They do pretty good with a single subject. Though, inspired by u/Akir, if that single subject is picking his nose, you get weird results from all of them. The two "small AIs" producing results that are sometimes horrifying. (That sample didn't even include this nightmare fuel I got last night.)
At some point, I need to try again with multiple subjects. But I got some horrifying results last night from fastsdxl.ai when I was testing if it would allow political content. (I'm almost certain DALL-E 3 won't and I'm afraid to test it.)
I actually clicked my own link last night and it gave me a completely different image. Hopefully that means they're improving their model. But it was even more terrifying than the example you gave because it had the man's face stuck right inside a vaguely dog-shape monster.
I could be wrong, but I swear that when I tested link-creation, the seed number changed. So it could just be a bug with the link-creation code.
This is really interesting, I’ve run a few bits and pieces myself now too and the prompt “a man picking his nose” seems to be a surprisingly good stress test even for larger/slower models! I saw a lot of extremely bad results there, but they seemed similarly bad in the distilled versions compared to the slower versions, which suggests it’s a more fundamental issue; I imagine there are a solid few papers to be written on reducing those kind of “body horror” error conditions, and I’m betting that improved loss functions are going to be a significant part of it.
I appreciate you taking the time to play around - I didn’t work on any of these specific models, but I do work in the field, so I feel like I’m in a bit of a “can’t see the forest for the trees” mode a lot of the time! It’s good to see things from an outside perspective.
Most of my other tests were seeing if I could improve on DALL-E 3's renderings of the shenanigans of my D&D group. (We've taken to "illustrating" some of the more ridiculous or memorable moments of our campaign using the Bing Image Creator.) But I didn't save any of the results of these tests on the small/fast models.
Since these prompts almost always involve more than one subject, DALL-E 3 really struggles with them. If I say "a human man and an elf woman," a lot of the time, both of them are elves. Getting an undead/skeletal dragon also seemed beyond all of the AI models. In fact, if you ask the small/fast models for an image of "a human paladin surrounded by skeletal/undead warriors," the human paladin is almost always also rendered as a skeleton.
That said, I'm not a very good "prompt engineer." I know my sister-in-law has sometimes had better luck rendering scenes from our D&D sessions than I have.
One of the really fascinating bits of watching this whole field develop is that the UI needs to get worked out just as much as the underlying tech. I’m not quite old enough to remember the shift from CLI to GUI, but I do remember video games moving from 2D to 3D, and it took a good few years there for the industry to hit on controls and norms that made sense - this feels kind of similar.
The thing that particularly brings it to mind is that pretty much exactly what you’re asking for can be done easily enough with multidiffusion - just scribble in the rough area where you want something and attach a specific prompt to that particular region - but turning that into a meaningful, understandable, and usable UI is a goddamn mess right now. We’ll hopefully be getting past the “everything needs to be a text prompt” world in the coming year; sometimes words alone just aren’t the way to communicate something, either to a computer or to a human, but I don’t blame the big guys for being a bit cautious in how they roll that into their products.
If I could just isolate each subject in my prompt with parentheses, brackets, or quotes, it would go a long way towards helping the AI know what I'm asking for. Presumably.
As it is, it's like my whole prompt is a series of individual strings (i.e., each word) joined with AND operators. But in my testing with DALL-E 3, there's no difference between "brown-haired human male paladin with a teal-haired Drow female rogue" and "(brown-haired human male paladin) with a (teal-haired Drow female rogue)". It's no surprise it mixes details between subjects if it can't tell which words describe which subject.
Totally makes sense! I will say I've found the chat context handling in Bing's "Designer" mode to be decent - probably better than the compound sentence parsing, as you're seeing - so I had reasonable luck getting it to build up the desired result step by step, waiting for it to generate an intermediate set of images each time:
or
It's not perfect, not by a long shot, and you have to pretty much consider the intermediate images as throwaways (it'll use them as a rough style guide, but you definitely can't rely on it retaining details you liked from them in the next iteration), but the results did feel a lot better composed to me than doing it in a single shot.
I'd still say that's not great UX, to be honest: it relies on the user considering the internal state of the system and working around it in a way that is probably less intuitive than giving it as much info as possible up front would be, and the type/wait/iterate loop does feel sluggish and kinda frustrating to me, but with any luck these are the kind of things the product people will be smoothing out as the tech matures.
See, I wasn't even sure that it would be willing to iterate upon anything it creates. The UI suggested, to me, that each rendering stood on its own. Unlike the chat interface of Google Gemini or even the free version of Chat GPT.
Edit: I just tested it, and it seemed to take my second prompt as if it were a completely unique request. But perhaps I just need to select one of the four images in particular first...? Or perhaps it's the fact that I'm in a mobile browser...? Hmm...
Aaand I've just realised that the UI they're giving in the link you mentioned at the top of this comment thread, branded "Copilot Designer" on a Bing URL, is completely different to the "Copilot Designer" UI that's presented by selecting "Copilot" on the top menu of the Bing homepage and then "Designer" from the sidebar - even though images I've generated in the latter do show up in the history of the former - that certainly doesn't help when it comes to figuring out the subtleties here!
What I'm seeing for context.
Weird.
I experimented with using this other technique, but I'm sad to report that it doesn't seem to have improved its ability to handle this particular prompt.
What I was going for: a paladin is "knighting" a woman who is kneeling before him in a snowy vineyard in the forest. (She is actually swearing a Paladin oath before/to him and his god, but it turned out to be kind of a knighting ceremony.)
Every attempt I have made it this has resulted in both of them kneeling or just generally getting the placement of things wrong.
Interestingly, when I tried to iterate on it, it was as if each rendering was a completely unique prompt anyway. And it was making her much paler than she was supposed to be.
Faster means it's easier to hone in on an accurate result.
Right now these tools need a lot of guidance through in-painting to get a desired result. Think about what you're prompting and the result you expect. What I've seen from people interacting with these natural language-guided AIs is similar to the beat-tapping problem. Someone tapping a beat to a song they're thinking of expects a listener to be able to determine which song they're thinking of. But really they haven't given nearly enough information.
In-painting with separate prompts for each in-painted section forces the user of the AI to give more information. You'll start with a canvas and then fix each part that diverged from your expectations one-by-one. It's like programming in that you need to break a big problem into a lot of small ones that you know the computer can handle on its own.
As a tangent to your mention of DALL-E, I had an interesting experience when comparing this model to it. OpenAI refuses to generate images of people picking their noses. This model does not, but it gives you very, uh, interesting results.
Mother of God.
I don't know what happened there, but that is horrifying.
404
Weird. It works for me. Even in a private browsing tab.
Works for me now 🤷
Your link appears to be broken.
Edit: it appears to work if you get rid of the .png at the end of it.
Try it again. It somehow works for me now.