If you're interested in the open source implementation of the DALL-E 2 model, it is going very well. The big hurdle right now is the training dataset. These models are trained on 400 million...
If you're interested in the open source implementation of the DALL-E 2 model, it is going very well. The big hurdle right now is the training dataset. These models are trained on 400 million images, that means that even if you have a crazy fast GPU that can process 50 images per second it would take 3 months just to train the model and test it, the good news is that these models can be easily parallelized and if you have 4 GPUs it takes roughly 1/4 of the time (a little bit more due to some sync overhead). But if there's some bug in the code that isn't apparent until you train on a lot of data, you gotta start everything over...
What this means for the end user is that while these models are considered huge and a bit of a pain to reach the quality stated in the papers, after anyone trains a model and releases it publicly, anyone with a decent GPU can run it. The memory requirements for running a pre-trained model are significantly lower, roughly 1/2. So any GPU with 12GB of VRAM should be able to run these models, even less memory required if you apply some techniques that trim down the model (at the cost of slightly lower quality). So if you're shopping for a GPU and plan on running these models on your own hardware, don't skimp on that VRAM!
I also want to take this chance to say that DALL-E 2 is not the best text to image model around. It's just the one with all the publicity since it's the only one people can check the results publicly. Google's Imagen model overtakes it, and it can actually spell text properly. I'm following the open source implementations of these on github/LAION discord and all I can say is that the simplicity of the Imagen model is what stands out for me, it's a simple chain of diffusion models, this enables you to easily implement improvements made on the diffusion field (which is in it's early stage btw, think of GANs pre-2017), a recent example is speeding up the generation of images to a small fraction of diffusion steps (went from 1000 steps to 20-40 steps), a future possible example is using the model to generate video instead of single images, other models could also be adapted to do this, it's just that the simplicity of Imagen makes it much easier to implement.
You might recently have heard of Parti, also made by google, which performs similarly to Imagen, but at a great cost. Parti uses 20B parameters vs Imagen's 3B. This alone puts it out of scope of consumer hardware and therefore out of reach of most researchers. It's mostly an experiment of scaling a previous model and see what happens, like openAI did with GPT-3, but this time I think most researchers consider it a dead-end since it's an outdated architecture and scaling diffusion models will probably yield much better results.
I think this could be very exciting. I like to work on software projects here and there, and always when you get to publishing it there comes a problem: art. It's vogue to have a cute logo/mascot...
I think this could be very exciting. I like to work on software projects here and there, and always when you get to publishing it there comes a problem: art. It's vogue to have a cute logo/mascot breaking up the monotony of text on the webpage, or the github readme. I can try my hand at graphic design, but it's a crapshoot. I could try to commission an artist (and once have), but that's quite expensive in both time and money.
Or another example - you're going to start a blog/personal website. You want something a bit more creative than just text with a nicer and larger font for your header.
You write a blog post. You want some visual item vaguely related to break up the text.
And with the beta version of DALL-E, you get full commercial rights. Each run of DALL-E cost $0.15. It may not get something good in the first try, but run it 50 times and cherry pick the best - that's low effort, low time, low cost, and fantastic results.
I brought this up before, but it'll also be great for stock image-ish things to go with articles. Honestly half the time you don't even care if it's that relevant. And full commercial rights! For 15 cents?
Unlike GPT-3, drawing and design is, uh, quite a bit harder, and much better in isolation (it's easy to stick a DALL-E image with something else - what're you going to do with a GPT-3 poem?). I think there's lots of areas where the requirements for coherence with a subject isn't that high and this is going to amazing for that.
On a more negative side, I could see this decimating certain freelance art markets, somewhat ironically given the recent rhetoric in automation with creative vs menial jobs like long haul driving.
DALL-E is fun to play with and you can sometimes get usable results depending on your standards, but using it is a bit like coming up with search terms to find what you want using Google Image...
DALL-E is fun to play with and you can sometimes get usable results depending on your standards, but using it is a bit like coming up with search terms to find what you want using Google Image search. You might find a good enough image if you use the right prompt, or you might not. There’s no guarantee that it will interpret your query the way you want.
Some limitations I’ve seen:
It can’t draw text. Ask it to draw anything with words on it and it will be garbled. (Google’s Parti Project can do text, though, so maybe a future version will improve this.)
It can’t draw an accurate piano keyboard. The black notes won’t be in alternating groups of two and three keys.
It can do well at portraits, but is more likely to garble things when drawing multiple people in one image.
It’s mostly better than MidJourney, but I prefer MidJourney’s output for science fiction and fantasy environments. I think in part this is because DALL-E has a better sense of scale. It’s easier to get larger than life objects with MidJourney, or just assorted sizes. DALL-E seems more likely to make things the appropriate size and line them up.
I expect there will be multiple services with different varieties of art styles.
Another thing artists can do that these services can’t do yet is draw multiple images with continuity, like you’d usually want when illustrating a story.
I was pretty surprised by the "full commercial rights" section. OpenAI has a history of being... less than open, but their pricing and availability seems reasonable (for a company). I wonder if...
I was pretty surprised by the "full commercial rights" section. OpenAI has a history of being... less than open, but their pricing and availability seems reasonable (for a company).
This is the first time I'm reading their list of "safety" measures in full and it's hilarious/sad how this is basically a step-by-step description of how this technology will be abused. Not "would...
This is the first time I'm reading their list of "safety" measures in full and it's hilarious/sad how this is basically a step-by-step description of how this technology will be abused. Not "would be", will. And probably already has been, who are we kidding.
If you're interested in the open source implementation of the DALL-E 2 model, it is going very well. The big hurdle right now is the training dataset. These models are trained on 400 million images, that means that even if you have a crazy fast GPU that can process 50 images per second it would take 3 months just to train the model and test it, the good news is that these models can be easily parallelized and if you have 4 GPUs it takes roughly 1/4 of the time (a little bit more due to some sync overhead). But if there's some bug in the code that isn't apparent until you train on a lot of data, you gotta start everything over...
What this means for the end user is that while these models are considered huge and a bit of a pain to reach the quality stated in the papers, after anyone trains a model and releases it publicly, anyone with a decent GPU can run it. The memory requirements for running a pre-trained model are significantly lower, roughly 1/2. So any GPU with 12GB of VRAM should be able to run these models, even less memory required if you apply some techniques that trim down the model (at the cost of slightly lower quality). So if you're shopping for a GPU and plan on running these models on your own hardware, don't skimp on that VRAM!
I also want to take this chance to say that DALL-E 2 is not the best text to image model around. It's just the one with all the publicity since it's the only one people can check the results publicly. Google's Imagen model overtakes it, and it can actually spell text properly. I'm following the open source implementations of these on github/LAION discord and all I can say is that the simplicity of the Imagen model is what stands out for me, it's a simple chain of diffusion models, this enables you to easily implement improvements made on the diffusion field (which is in it's early stage btw, think of GANs pre-2017), a recent example is speeding up the generation of images to a small fraction of diffusion steps (went from 1000 steps to 20-40 steps), a future possible example is using the model to generate video instead of single images, other models could also be adapted to do this, it's just that the simplicity of Imagen makes it much easier to implement.
You might recently have heard of Parti, also made by google, which performs similarly to Imagen, but at a great cost. Parti uses 20B parameters vs Imagen's 3B. This alone puts it out of scope of consumer hardware and therefore out of reach of most researchers. It's mostly an experiment of scaling a previous model and see what happens, like openAI did with GPT-3, but this time I think most researchers consider it a dead-end since it's an outdated architecture and scaling diffusion models will probably yield much better results.
I think this could be very exciting. I like to work on software projects here and there, and always when you get to publishing it there comes a problem: art. It's vogue to have a cute logo/mascot breaking up the monotony of text on the webpage, or the github readme. I can try my hand at graphic design, but it's a crapshoot. I could try to commission an artist (and once have), but that's quite expensive in both time and money.
Or another example - you're going to start a blog/personal website. You want something a bit more creative than just text with a nicer and larger font for your header.
You write a blog post. You want some visual item vaguely related to break up the text.
This user asked the full DALL-E to make a mascot for the Ruby programming language. I think quite a few of those are great! I would have zero hesitation putting one of those up in a readme.
And with the beta version of DALL-E, you get full commercial rights. Each run of DALL-E cost $0.15. It may not get something good in the first try, but run it 50 times and cherry pick the best - that's low effort, low time, low cost, and fantastic results.
I brought this up before, but it'll also be great for stock image-ish things to go with articles. Honestly half the time you don't even care if it's that relevant. And full commercial rights! For 15 cents?
Unlike GPT-3, drawing and design is, uh, quite a bit harder, and much better in isolation (it's easy to stick a DALL-E image with something else - what're you going to do with a GPT-3 poem?). I think there's lots of areas where the requirements for coherence with a subject isn't that high and this is going to amazing for that.
On a more negative side, I could see this decimating certain freelance art markets, somewhat ironically given the recent rhetoric in automation with creative vs menial jobs like long haul driving.
DALL-E is fun to play with and you can sometimes get usable results depending on your standards, but using it is a bit like coming up with search terms to find what you want using Google Image search. You might find a good enough image if you use the right prompt, or you might not. There’s no guarantee that it will interpret your query the way you want.
Some limitations I’ve seen:
It’s mostly better than MidJourney, but I prefer MidJourney’s output for science fiction and fantasy environments. I think in part this is because DALL-E has a better sense of scale. It’s easier to get larger than life objects with MidJourney, or just assorted sizes. DALL-E seems more likely to make things the appropriate size and line them up.
I expect there will be multiple services with different varieties of art styles.
Another thing artists can do that these services can’t do yet is draw multiple images with continuity, like you’d usually want when illustrating a story.
I wonder how many artists' jobs just went poof?
I was pretty surprised by the "full commercial rights" section. OpenAI has a history of being... less than open, but their pricing and availability seems reasonable (for a company).
I wonder if open source recreations had an impact on things.
This is the first time I'm reading their list of "safety" measures in full and it's hilarious/sad how this is basically a step-by-step description of how this technology will be abused. Not "would be", will. And probably already has been, who are we kidding.