Discussing AI music - examples and some thoughts
I'm not sure if this would be better for ~music, ~tech, or what, but after messing around with Udio for a bit, I made some stuff I liked and wanted to get folks' thoughts. Imo, it's incredible to be able to get music from a text prompt - it means I, as someone who is mostly ignorant to music production, can have my musical idea and actually render that out as music for someone to hear. I can think "damn that would be cool" and then in kind of a fuzzy way, make it happen then and there. Whether it's good, I don't know. That's not up to me, really, but it is the kind of sound I wanted to happen, so I'm left conflicted on how to feel about it. Figured it would be worthwhile to show folks some of it, and see what they think.
I do enjoy synth and metal, so there's a lot of that in these. Feel free to be as critical as you like. If I can apply your criticism I will try to do it, and if you want to see how that works out, I'll share.
- Cosmoterrestrial
- A Floyd, Pinkly
- Empire's Demise, Foretold
- Metal for Ghosts Bedsheet Edition (the very end of this one is hilariously appropriate)
- Multi-3DS Drifting
And here's a link to my profile, if you would like to browse. It will update too when I put more up.
They're all instrumental. Lyrical music is less appealing to me in general and Udio's voices do sound kinda weird to me more often than not. The way I made the tracks, I would start with a clip combining some genres/moods, and then add to either end of the clip until I had a complete song. Along the way, I could introduce new elements/transitions by using more text/tweaking various settings and flipping "manual mode" on and off. The results were fuzzy; I didn't always get what I wanted, but I could keep trying until I did, or until I got something that sounded "better". I wrote all the titles after the song was finished. The album art is from a text prompt.
I'm not sure what I think, to be honest. On the one hand, a lot of the creative decision-making wasn't mine. On the other, the song would not be what it is without me making decisions about how it came about and what feelings/moods/genres were focused upon/utilized. I think the best I can say is "use the tool and see whether it's enough to count". To me it feels almost 50/50, like I've "collaborated with my computer" rather than "made music". Does it matter? If the sound is the intended sound, the sound I hoped to make and wanted to share, is that enough to say it is "my music"? Is this perhaps just what it looks like to be a beginner in a different paradigm?
When I used Suno, I had a much more rigid opinion. What it produced, I called "computer spit". Because, all I could actually control was telling it to continue, changing the prompt, and giving it structure/genre tags that felt like a coin flip in terms of effectiveness. I had a really hard time trying to get it to keep/recall melody, and my attempts to guide it along felt more like gambling than deliberate decisions. It also couldn't keep enough in context to make the overall song consistent with respect to instrumentation. It's different with Udio, both because you have a lot of additional tools, and because it feels like those tools work more consistently at making the model do what you want. I still call the results "computer spit" where I've shown them off, but I'm unsure now whether the production has enough of myself in it to be something more. Perhaps not on the same level as something someone produced by playing an instrument, or choosing samples/arranging things in software, but also not quite the same as the computer just rolling along, with me going "thumbs up" or "thumbs down". Maybe these distinctions don't actually matter, but I'd be curious if anyone has thoughts along these lines.
I'm intentionally trying to avoid a discussion about the morality of the thing or what political/social ramifications it has, not because I don't care about that but because I'm in the middle of trying to understand the tool and what its results mean. Would you consider what I've posted here work I could claim as my own, or do you think the computer has enough of a role to say it's not? Is my role in the production large enough? Or perhaps you have a stronger position, that nothing the computer can possibly do in this way counts as original music. Does any of this change that position for you? I ask because I've gone through a lot of opinions myself as I've been following things, and one interesting bit is that I have not gotten any copyright notices when I've uploaded the music to Youtube (I did get notices with Suno's music). As far as I can tell, with what is available to me, this is all original.
And of course, the most important one: Did you like it? Is there something you think would make them better? Do they all suffer from something I'm not seeing/hearing? I'm not an expert technician nor a music producer, so perhaps my ignorant ears are leading me astray. Either way, I've had a ton of fun doing this, and the results to my ear are fun to listen to while I'm doing stuff. I wouldn't call any of it the best music I've ever heard, but I can also think of a lot that is worse. I think what I wonder the most is whether it comes off bland/plain. Most of the folks I show things to are a bit too caught up in being astounded/disturbed to really give me much feedback, so perhaps putting the request in this form will work out a bit better - ya'll have time to think on it.
As always, your time and attention is greatly appreciated
Edit: I should clarify. I am not attempting to be a musician. Hence calling it "computer spit" with anything public, and the lack of any effort to pitch it as something I did only on my own. Rather, I recognize the limit of my own understanding, and felt I'd hit a point where my ignorance of production meant I could not judge the results as well as I'd like. That means it's time to engage some folks because folks out there are likely to know what I do not and see things I can't. From that angle, a lot of the discussion is very interesting, and I'll be responding to those in a bit. But there's no need to argue for doing the work - I recognize that. I'm trying to see past my own horizons with a medium I don't put the work into. I'm a consumer of music, not a creator, so getting some perspective from folks more acquainted with creating and with the technology is really what I'm after in sharing the experience.
Edit again: Thank you all for a very interesting discussion. I had a spare evening/morning and this was a good use of it. For the sake of tying a bow on the whole thing, I'll share my takeaways as succinctly as I can manage.
It seems, at present, and at best, the role these tools can play is of a sort of personal noise generator. The output is not of sufficient interest, quality, complexity, etc., to really be regarded the same as human-produced music, is the overall impression I have been left with. And for other reasons, it may be that the fuzziness of it all is a permanent feature, and thus a permanent constraint on how far toward "authentic" the results can ever get. I was trying to avoid a discussion about my own creativity, the value of doing work, societal ramifications, etc., so I'll work on how to present things better. For what it's worth, this has all been part of what I do creatively - my area of study was philosophy, and the goal of that to my mind has always been "achieving clarity". So I am attempting to achieve clarity with things as they develop, as a hobby sort of interest while I'm busy doing completely different stuff and to better protect my own mind against dumb marketing and hype. So once again, I appreciate you all taking the time, and I wish you all well in all the things you do.
As someone who has been creating music since the 90s I can only say that this is the beginning of the end. Not that this generated music is anything good but some of it is getting close to stock background music level.
Of all the things AI could do for us it seems a lot of it is about solving problems that didn't need to be solved. There already is way too much music created by actual humans, more than you can ever listen in a lifetime. Why does anyone think adding a gazillion AI created pieces of muzak is something that is needed?
And to the OP. You clearly love music and you are creative, why not spent some time learning how to actually do this yourself and be in full control? Actually creating something is the greatest thing in the world, prompting AI to do this is pure dystopia to my musician's soul.
I don't see the point.
Thanks to the Internet there is an endless supply of music of all varying genres I can listen to, and real actual artists with a reputation that I follow. Same as in visual media. A real artist, with their own style, catalogue, and reputation, is something I can learn about and engage with over time.
"Art" from a machine - even if the prompt has been tuned by a human - is much too uniform across the board to do this. I can't follow Thomas-C, I can only follow Udio. I can't follow any digital painter, I can only follow Stable Diffusion.
And each model is just the combined and reduced average of all the data it was trained on, so naturally the outputs are... average. "Muzak" is apt.
Every so often there is a remarkable piece. Usually obtained with some transformer or particular prompt that narrows the output to some small subset of the training data. All that's done is rip the interesting style from those creators. I'd rather see their art, instead of the AI regurge that's based on it. What good fortune that it's fundamentally impossible for an LPTM to cite sources for a given output!
So, like, what's the point?
Aside from the obvious that for business the point is to spend less (and we should demand living standards regardless of job status because this will eat everyone's lunch) for individuals the "purpose" is going to be murky.
Honestly there will always be artists and consumers who are more interested in sharing human made things with humans, regardless of profit. Digital painting hasn't killed physical paintings. Nor will AI kill the arts. (It will kill the arts being profitable for most).
Perspective shift: if most humans became expert level physical painters or physical musicians over night - you'd be in a similar position asking what's the point. If the talent pool is oversaturated, would that stop you from creating?
I'd still be painting.
Strong agree, and it's sad. If we agree that commercial AI art will destroy professional artists, then the only art we're really talking about is that of individual creators.
Do you expect individual creators using AI to be more or less fulfilled than individual creators doing the art themselves?
I am morally opposed to recommending AI as a creative outlet.
Creators - most humans, I believe - have an intrinsic need to create. That won't go away if all other people become expert, and it won't go away if AI becomes expert.
My concern with AI is it gives a false creative outlet to people. They put in a prompt and say, "I made this," then they don't go create anything real. There's a hard limit on the quality of the AI output. There's a hard limit on the joy and satisfaction the person can get from "creating" the thing. There's a hard limit on the innovation that can occur (none). None of those problems are present if all people became experts in a given craft overnight.
The "AI evangelist" culture that's appeared is especially concerning to me, since it ignores or outright denies this. Plug a prompt into the AI. Declare that's the best form of creativity. It's more efficient. It's more accessible. It's more rewarding.
It's just not.
Sure, it reduces cost. And you get what you pay for.
What happens when you and all your competitors are running on AI slop? What's your competitive edge?
What happens when all your marketing materials are identical to all your competitors marketing materials, and the general public is numb to AI slop?
What happens when people (regulators) notice the privacy implications of centralized models like these? Is every corporation going to self-host? You know how expensive that'll be?
I don't expect these to doom AI adoption, but I expect them to contain it to some extent in the long run. I think any company that goes hard on AI is going to find themselves in hot water in the future once any/all of these things occur. I also somehow doubt that any jobs lost to AI are going to be returned to humans when any of these occur. More likely they'll be shunted to a smaller group, somehow expected to be more efficient.
Liberation from being influenced by advertising sounds heavenly (unless you rely on a pension fund backed by customer goods corporations).
That's just prompting and using the raw AI output. For someone whose editing skills far surpass their fresh creation skills, AI gives them the pieces to massage together into a collage.
If it's not merely typing in the prompt and selecting your favorite variation, AI allows people to focus on different skills. The canonical example is someone who has written a story and programmed some fun mechanics to make a game. However, they lack the art skills to execute their vision without hiring external help or asking the computer. Hiring for such a project, however, either means paying out a prohibitive upfront cost or altering the game for marketability so that the artist is confident that it'll turn a profit for his points on the package.
I agree, but why would the business pay for AI marketing? Why are businesses paying for AI marketing? I don't think it's wise on their part.
That's a compelling example. The critical part I suppose is the sense of ownership or stake in the work. Writing the prompt and copying the output is trivial, so there's not much stake there and not much fulfillment. The editing - or the programming or whatever other composition - on top of it is where the real work and fulfillment comes from.
Maybe it's semantics, but I'd argue in that case the AI is not the creative outlet; the editing or other composition is. You could do that with other (human-made) assets if you had them. The AI is just a cheap means to get some. Then you start getting into the weeds on copyright and all, which I consider a separate issue and don't really want to discuss here. As far as recommendations - I'd concede it's probably fine to recommend AI as a way to generate assets for some other more specialized creative outlet.
There is something to say that the quality of those assets is lower than if you hired a specialist, same as anything else, but that's probably a fine tradeoff for an independent or noncommercial project.
I also think it's more virtuous to hire a human to do the work, but I recognize that's obviously not possible in many cases as in your example. "Programmer art" is probably more virtuous and authentic, and I'd personally prefer it, but it's certainly not part of the vision so obviously not as attractive to the creator.
I guess that's also not too different from what @Thomas-C did; the workflow they used is a sort of editing process, where they have the machine generate musical snippets and then splice them together. Creating mashups or some other format of editing pre-existing content might feel similar.
I agree with you on that point. As you point out, the AI is a means to generate resources.
Since you and Thomas C mention splicing musical snippets together, that points to another part where AI music may come into play: bypassing the Looney Tunes lawfare involved with clearing your samples legally. Yes, it's blatant copyright laundering, but it also gives your defense attorney an argument of "blame the AI company" in inane sampling suits, where the same waveform with an edit log linking that pitch-shifted bass hit to a real recording would be a slam dunk for the plaintiff. It's not like copyright stops to good folks at /r/mashups, especially since they can monetize live shows & Patreon, but this paves some path for them to openly sell their music.
If humans have the need to create but there's a limit to the quality and personalization that these machine tools will provide, the people who are really passionate about giving their exact voice to the world will only use these tools until they just can't convey the proper voice, and then they'll drop them for tools that give them more control over the creative process. And if the general pool of automated art is so generic, the things that have true voice will be worth paying for because they're different.
Edit: oops, I misread and thought you were the same person I replied to originally. I'll leave the comment here unchanged since it's still relevant, but bear that in mind regarding the tone.
That's more or less what I was trying to get at, without the rambling to try to defend my position. Which brings me back to: what's the point?
As an artist, it's unfulfilling.
As a consumer, it's not engaging.
In a business, I expect it'll underperform and be uneconomical in the long run.
Wouldn't they be better served learning those skills from the start? The AI seems like an unnecessary intermediate tool to make it more likely for the median artist not to pursue manual work. I know that's just speculation, which is why I specifically said I'm morally opposed to recommending AI as a creative outlet. I'm not sure if it's moral or not to use AI as a creative outlet; that's a whole other can of worms with copyright and ownership that I'm not really talking about here.
And that applies to all contexts, not just art. For example using ChatGPT as a learning tool has similar issues: for search, writing, programming, etc. More effective in the medium-long term to do the work for oneself.
So whenever I see something advocating (or which could be perceived to advocate) use of AI for that purpose, and if I have time to engage, I'll pipe up and advise against it.
My honest hope in the end is that people come to value human work more in the long run, including the work of simply making decisions about the minutiae of a creative endeavor. I can say, fooling around as much as I have with these tools has made me more appreciative of the other music I listen to, because their decisions have become more interesting. Even music I once thought was kinda bland, sounds nicer to me, because there are tiny ways they're making more interesting decisions than what the tools did, and the tools being here reminds me constantly of the human's presence in the work. I left this out of the post, because I wasn't trying to discuss myself as much.
I don't care about being in full control because I'm not intending to become a musician. I have creative outlets and one of the interesting things about these tools in particular, is that they utilize one of my outlets for a new purpose and take very little time to use. I like to write, and now writing can make music happen. So what I'm really curious about underneath all of this, is a more philosophical question - if a writer could write in such a way that they made music, would that music "count"? If they know what to say to make a song that is as connected to them, as expressive of what they mean to convey, as "complete" as the thing a musician produces with instruments/software, assuming all that becomes possible, is what the writer produced "lesser"?
I don't think the tools allow precise enough control to really be asking that deep of a question around it, not yet at least. But it certainly seems possible, at least from my admittedly ignorant view, that such a method may one day happen. I can see how I've hit my own horizon, so I would like to try to see beyond it a bit.
Many anti-AI people seem to miss this. Especially for music & images, they're often the means to a larger end. Think of the person who wants AI to make a soundtrack for their movie b/c they don't have the budget to hire a composer. They're not a real musician and do not intend to be.
I'm not sure I read it as "anti-ai" so much as the result of engaging with someone folks don't know. There's not going to be a way for folks to know my attitude until I clarify where things go in different directions, and whether someone is pro- or anti- isn't really my focus. It's part of it, but it's a part I see/hear pretty often, so I try to make it clear I have different questions/am not approaching it in the ways being implied. I sure don't mean to evoke feelings of dystopia or demoralize anyone. I'm not attempting to supplant their talents or usurp anyone's status. The tools are interesting because they make me think of bigger, more complicated things, that might really mean something if they're not impossible.
On the flip side, imagine a musician that wants to make a music video for their new song. They don't want to be a director and don't intend to be.
In this hypothetical scenario, we would get a fully scored movie and a fully shot music video for the labor costs of two people, instead of both of them having to work on the same project and the end result being a single work.
I think that's a positive that a lot of people overlook when they talk about this technology taking everyone's lunch. It can be used to enhance an artist as well.
As someone who dabbles in a few instruments, I find this to be a curious way to think about music composition. I mean, sometimes the process works as you described: I'll have a vibe in mind ahead of time, and then I'll hammer away at the piano until I find something that sounds right.
But other times I'll have nothing in mind at all, and I'll just play until I find something I think sounds nice. Or maybe I'll already have a motif stuck in my head, and then I'll just sit at the piano filling it out. Here's an example of me doing exactly that. You could describe this piece however you'd like, but the piece came first and the description came second.
But in my opinion, the primary creative work in music production comes from composing the melody (followed by composing the harmony, and then from doing any processing work). If you're merely describing instead of composing, then I see your role more analogous to a film director referencing a temp track. The composer gets the credit for the artistic output, not the director.
So to return to your question: I would say that the writer did not "produce" the song, as they didn't do any of the creative work involved in writing music. However, that doesn't necessarily mean that the song itself is lesser; that's a matter of whether one thinks an AI model can be artistic.
I appreciate your point about melody because it's what I was thinking most about trying to figure out what this experience was. I was never comfortable calling any of the content produced "my music", and left a bunch of questions open, because I couldn't arrive at the melody I was looking for by just, well, doing it. At best, I could get an approximation, a fuzzy thing similar to but not exactly what it was I wanted to make happen, and I had very limited ability to make detailed/minute alterations. And thus I got curious about where such an experience sits with people, what the results mean to them and what it would mean if the tools went further. Of course, were I producing the things myself, they'd be very different, but that's not a fruitful discussion because I'm not going to be producing any such tracks. I'm glad some folks think I would be capable (I guess those titles landed more than anything else) but I'm serious about having no desire to do that. I'm exploring a weird thing, as part of where I do invest creative energy, which is into writing, discussion, and the philosophical pursuit of achieving clarity. Calling myself a philosopher is just asking for discussions I don't care to have so I left that sort of stuff out of the initial post, knowing it meant I'd have to do some work to clarify when folks interpreted me as an aspiring artist.
The philosophical questions get at trying to understand, is there a point at which a writer could escape the "film director" role, or is there something fundamental here which means such is never going to be possible. Folks have been gracious enough to lay out some of the technological reasons why this might be, but I'm also interested in the less technological thoughts/opinions folks have around it, and so I appreciate just as much what you've shared, too.
It already has been said by others here:
If this software could generate music following extremely precise detailed prompts it basically becomes a more clunky version of a DAW (digital audio workstation). At that point its you creating the music.
I actually have been working together with people that can't write music themselves but do have a lot of ideas. Basically they were 'prompting' me when sitting together in the studio. A big difference is that I am asking a lot of questions as well in such a situation. Sometimes this becomes a very enjoyable creative proces.
I think current music AI can not go there though as it is not creative itself, it just averages out its initial training data and lets you filter this.
I agree. The kind of music that AI writes is really good at making earworms, but it’s really bad at making highly directed specific musical phrasing. The reason why I have such peculiar tastes in music is that the human element is the most important part to me, and that’s the reason why highly workshopped pop songs are so unappealing to me; they are boring. King Gizzard and the Lizard Wizards has some songs that kinda sounds like music you may have already heard, but they are still special because they are filtered through the experiences, expertise, and skills they have grown through the years. You can’t replicate that through AI, and I think most people agree that you shouldn’t be allowed to do it in any case.
Case in point, here is the first full song I made in Udio: https://www.udio.com/songs/v8tkiHyhoeKi31yo4GFdCx
I think it’s a pretty good song, but it’s very much not anything like I would have expected it to. As a joke I asked it to be about the recent DJT trial, and the lyrics don’t really reflect it. There is just a bit of an allusion to punishment. When it comes to style, it also misses the mark. Admittedly asking for something in the style of Yuki Kajiura is a really broad ask, but this song is closer to something like modern era Do as Infinity or one of many Anime OPs in the late 2010s.
As a musician and music enthusiast, I am not too worried about AI displacing the highest level of musical art. Pretty much every example of AI music I've heard (including the examples provided by the OP - sorry) has been totally unimpressive. It's like muzak, except that it dials into general musical tropes associated with the prompts used to generate it. I could make music of a similar caliber if you gave me those same prompts, and then demanded the music be made with the absolute minimum time and effort possible, perhaps enforcing this rule by periodically giving me painful electric shocks so that I really hurry it up.
I know that AI has been improving very quickly, but I suspect the flaws in AI music generation will not be able to surmount this issue. The features which make a piece of music sound really good, inspired, beautiful, etc. lie so deep in its compositional and timbral qualities that attempting to resuscitate or reinvent these features by finding correlations between other pieces in a data set seems impossible. For example, consider the prompts used for the second track provided: 'rhythmic, jazz pop, lush, melodic, passionate, bittersweet'. Some of these are so general they are almost meaningless. 'Rhythmic'? 'Melodic'? What music doesn't have rhythm and melody? 'Jazz pop' is likewise pretty vague, encompassing a wide range of subgenres, concepts, musical forms, etc. that aren't necessarily cross compatible.
But then you have terms like 'lush', 'passionate', and 'bittersweet', which really do seem to hit on something special about music. Music can't be good simply by being 'rhythmic' or 'jazz pop', but it can be good by being 'lush'. That is true artistic value. But what does it mean for music to be lush? I think many would consider dense instrumentation to provide a sense of lushness. But in contrast I would also consider Ralph Towner's solo guitar work to sound very 'lush' even though it's literally a single instrument, and often a single melodic line being played. Or maybe lush indicates a profusion of compositional detail - but then again, you could have an instrument playing a rather simple repeated melody, and render it 'lush' by piling on effects and filters.
So even though these artistic labels - 'lush', 'passionate', 'bittersweet' - can be applied so quickly and obviously to particular pieces of music, they are not specific concepts like 'a tempo of 128 bpm' or 'two tenor and two alto saxophones playing block chords'. They are more like ambiguously defined goals, which can be traveled to and evoked via myriad paths. This makes them essentially useless for creating AI music. The paths may average out to some weaker and less distinctive middle path, or even to something completely useless - like averaging two roads that go past either side of a lake by driving directly through the lake.
It is possible to use prompts which specify more concrete elements of music, and I think this has a chance of coming closer to producing something interesting. But it's a matter of degree, and I suspect that in order to get something really interesting, you would have to furnish so many details that it would be nearly as easy (and infinitely more rewarding) to just write the music yourself.
Of course, this has all been commentary on music purely as art. I think a lot of AI music pieces are probably sufficiently inoffensive to be used for commercial purposes. So there may be economic effects on career musicians, working to produce lowest-common-denominator music for advertisements and theme parks and low-budget indie games, which of course remain to be seen in the future.
This is a fantastic response and I appreciate you offering it. There is no need to apologize. My own assessment is almost exactly the same - it's background music at best, suited really only to situations in which I am not actively listening.
In another comment I mentioned decisions - you've articulated what I was driving at far better than I could. Creative titles and enjoying myself doesn't mean I have much respect for the output, so to speak - it surprised me, so I wanted to see to what extent was that "just me" or actually something impressive. Can't go hunting for that if you're not ready to hear criticism, so I invited it. I think what impresses me is the broader picture - that for free, and then for potentially ten dollars, there is this weird thing that can sort of make what you tell it to, and this one does music.
My opinion of the results is very rigid and harsh - it isn't art. It can't be, because of those minute decisions people make in rendering their expressions - it is the complete set of those decisions that makes art what it is, and the computer cannot do it. No amount of fuckin with tags could produce something akin to what I typically listen to.
The only way I could see that change, is if the tools became complex in such a way that most people considered it a valid way of producing music, but that is practically the same as what you said, the process would be as involved and so, might as well do the traditional. But that argument does make me wonder, about whether that cuts off the possibility of realizing a tool by which one's ability to write could produce increasingly complex renderings to the point that it is equal to traditional production methods. There may be people who today do not make music, but would, if they could directly translate their skill in a different medium to Music, so to speak. I feel like that is jumping too far ahead, so I haven't thought much further on that.
It's also possible that thought rests on an assumption, of tech that isn't actually possible. Part of sharing the experience was hoping to get clarity if that was the case, if anyone out there could explain that end of it. What I hope for is that folks recognize better what makes things what they are, because at least for me that has been a constant thought in using these things. It makes me think much, much more of the people behind the stuff I like, and appreciate some things I did not recognize before. I don't think someone has to get to the point of articulating it that way, to be that way about it, so I'm interested even in a simple "this sucks because [thing]" or "this part was decent because it did [thing]". It's what it means to understand stuff in context, to me at least.
In using the tags, I can't say I was even thinking that much about what they meant, but rather observed the changes they would bring about and attempt to nudge the model with them - you do generate a godawful amount of clips over time, especially when it just won't do anything halfway interesting, which is often. I went with what I considered "pleasant enough" and just tried to carry it through in a consistent way, as I did with Suno, and found the results much nicer. It does make me wonder how far this way of producing musical noise can go, whether it can become a complex enough process to be recognized as "valid", so to speak. If it can become complex enough, produce results interesting enough for most people (like, actually, for real, a majority), does that really mean anything for people who produce now, with "traditional" methods? Like, maybe we're in a phase where folks are stuck in novelty, but in the long run perhaps it is simply a different input method coming into being - that just means more people making more music, in the end. Again though, I can't really go far with that because I'm admittedly not knowledgeable enough on the tech end to understand what the near future looks like, much less "in the end".
This made me think of art in the context of 3D printing. Can a printed piece be considered art? Though they didn't craft it by hand like a carving, it is a direct representation of an artistic vision.
Now you could argue that directly designing in CAD is not analogous to entering prompts for generation, but in some ways I feel CAD is 'generating' geometry points for you. For example you aren't specifying which individual pixels to print around a curve, you have CAD generate the curve for you based on prompts. In the same way I feel AI music generation could advance to a point where it could feel as natural as asking CAD to curve a corner with a set radius. Maybe not to design an entire song for you, but more as a productivity tool within a DAW or something.
Anyway thanks for the discussion. I haven't considered these ideas before.
After bashing AI music in my other comment, I should probably actually respond to the opener.
This in particular. I've fiddled with "prompting" before. I know it's not trivial and there is nuance to it. It does take effort to get a particular output. And I don't really want to discount that OP put that in, and that effort seems to have come from a real interest in music. That's valuable.
But...
The fact it takes so much effort tells me the tool isn't as powerful or useful as people make it out to be.
The fact that, even with all that effort, all the outputs are so similar to each other and bland tells me the tool isn't as powerful or useful as people make it out to be.
I think that effort would be better spent fiddling with a synthesizer or physical instrument. I think the output will be much more rewarding and lasting. Synthesizer in particular is pretty quick to get a simple melody and beat put together, even if you don't really know what you're doing. In my experience, while it might not leave the same impression as AI muzak, it feels more real and you feel more ownership of the thing.
Critically, there's room to grow and an unlimited skill ceiling when you are the artist. The "skill" ceiling on prompting is low to middling at best.
Work with the AI in a domain in which you have expertise. It's painfully obvious that the AI output is unremarkable at best. Why are we so quick to assume the output is remarkable when it's in a domain where we don't have that expertise?
Software and language are my expertise. I am thoroughly unimpressed with AI's output there. I'm not a painter, and I'm not a musician, but I have fiddled around with AI in those domains. I have no reason to believe AI is any more remarkable there than it is in software or language.
The last paragraph is the most interesting part to me, I'd be interested if you have more to say on that. You're totally welcome to discount my effort, if you feel it should be discounted on the basis of what you understand about how the technology operates and what it produces. I'm more interested in where people are drawing their lines and why, and figured the easiest way to get that going was to just share what I'd encountered and how I experienced it.
So, to reiterate, I'm not trying to discount the effort or the particular direction you gave to the AI. I'm specifically saying that, in general, all AI art in some genre tends to feel more or less the same as all other AI art in that genre. Even when you give the AI specific direction, it's hindered by all the other training data to just be... less, in some vague sense. Less character? Consistency? Intent? I'm not quite sure how to qualify it, and it's in multiple aspects.
The content below grew quite a bit larger than I intended. Sorry. I could probably strip out a lot of the repetition, but I'll leave it all in because I want to emphasize why I think all the different flavors of AI seem to behave the same and have the same limitations.
On software: software development requires precision. For a given problem there are usually only a few reasonable approaches to a solution. There are numerous ways to express that (think variable names and other cosmetic features of the code) but the approach usually only has a couple options.
Similar to asking an AI image generator to show you a blank white screen. There's really only one right answer, and it just fails to do it.
You ask the AI to come up with a solution for any reasonably sized problem, and it fails. It'll get the cosmetics right, it looks passable at a glance, but if you dig deeper the approach is all wrong. Like, it's not even close, even if I specifically ask it to use a certain approach. For simpler problems - especially the "classic" problems used for teaching purposes - it can get away on the numerous examples in its training set from textbooks and internet forums and language documentation etc. For any real-world problem it doesn't have those examples, and it just fails to synthesize any new information.
That "failure to synthesize" is why I always say AI cannot innovate in any domain. It can only produce compositions of things it has already seen. Regression to the mean is also relevant here.
On language: text output of all the AI I've seen is very... formulaic. It reminds me of the mid-tier 3-paragraph persuasive essays I'd write for English class in middle school. I was not a good student, I didn't care about the content. I'd just write filler in the required structure to get a grade. The AI text, even with clear direction to not do this, feels the same. No interest or emotion, no character. Just filler text to pad up a word count.
There are also the "context" limitations. Basically the AI has very little working memory; if you let it write too much or you let the conversation go too long it forgets things and starts repeating itself, contradicting itself, and other hallucinations. I often see people make claims like: "the AI on its own is limited, but in the future the AI will fact check itself, or come up with a plan and then work through the plan" or similar. I don't see that happening, just because it can't keep all the information for any moderate task in memory at once.
The context problem is harder than it might seem due to space complexity. Basically, to increase the context size by some factor, you have to increase the available computer memory by the square of that factor. Say you want to 3x the context size, so the AI can remember three times as much at a time; then you need to 9x the amount of GPU RAM that the computer has available. GPU RAM is expensive, and barring some major shift in the cost of computer memory, I don't see any real improvement there being economically viable even with all the hype and funding available.
So consider those limitations regarding AI music:
"Context" corresponds with the amount of direction, and the length/detail of the music. Generate too much music and it'll start being inconsistent. Ask for many parts playing together and it'll fail. It doesn't understand motif or theme without extreme effort in the prompts. And there's a fundamental limit to how specific you can make your instructions.
"Precision" corresponds with musical fundamentals; keys, chords, time signatures, etc. You can't really give these kinds of directions to the AI because there's really only one right answer. You can only speak in generalities and theme, and the AI will select these things on its own.
And just qualitatively, as I said, all the AI music of a given genre tends to feel the same, just how all the text from ChatGPT feels like filler for a middle school essay. The topic might change but the feeling is still there.
Regarding AI images:
"Context" corresponds with the length of the prompt and the size/detail of the image. There's a limit to the resolution of the image. There's a limit to the amount of detail you can have in the image. There's a limit to how self-consistent the image can be. There's a limit to how many distinct subjects the image can have. And, again, the limit on how specific the instructions can be.
"Precision" corresponds with artistic fundamentals; shape, pose, framing, perspective, etc. You can speak in generalities and the AI will choose these that look passable at a glance, but if you specifically request them or look too close it tends to fail.
And, again, all the output of a certain genre just feels like filler content.
I appreciate you following up.
Do you know if it is possible to achieve higher precision, or is there a limitation of how these models function which means such precision can never be achieved? Is there something in principle which forbids being able to use more precise language and get more consistent results, like an impossible computational limit, or a logical impossibility? I'm trying to expand my knowledge.
For example, allowing myself a moment to think way ahead, I imagine a scenario akin to someone in Star Trek making a song on the Holodeck. They're telling the computer what they want, as the input method for the computer. Everyone involved accepts the results, engages with those results as authentically as anything else; it is "successful" for all intents and purposes. The results there are written to be what they are, but I wonder whether it is actually possible to achieve something like this by way of what is currently developing. As in, pretend there's a model which will consistently do even the most minute "music things", and be consistent enough, that the people hearing the results accept it as "just as good" as things people made. If you can talk about music, you can make it, and what you make gets understood just the same. Ignoring the idea of how it compares to other methods, whether or not it's just as complex or what folks should/shouldn't do, is this level of functionality just not possible? I'm trying to work out where I am technologically ignorant, dispel notions and misconceptions.
So as a disclaimer, this is near the edge of my knowledge so take this as speculation with a grain of salt. I do know this is all active research; I'm not sure anyone knows for certain. If there was an easy way to significantly improve precision we'd already be doing it.
I speculate that there is not. I'll try to explain why I think that in general terms:
My understanding is generally the issues are not hard limits, but diminishing returns. Say we could get 2x performance by 10x the training resources. Is it worth it? What if we could 3x performance by 100x the resources? I don't know what the actual numbers look like, but in any case at some point it's just not economically viable.
The basic way the models work is two phases. First, they train on a huge set of data and categorize concepts in that data into a "embedding space". Imagine there's some coordinate (x, y, z, ...) that corresponds to "red", and another coordinate (a, b, c, ...) that corresponds to "yellow". The training phase runs on a huge set of data to figure out a way to convert from the input to the embedding space, then back to an output.
The second phase is inference, where a particular input is given to the model, then it converts that to the embedding coordinate, does linear algebra on it, and converts the result back to output format. Models that convert between text or sound or image have ways to convert multiple formats to the same embedding space. So you could put text in, convert to embedding space, then get an image out.
The main thing is that the embedding space encompasses all the concepts the AI can deal with. If a concept is missing from embedding space, the AI just cannot understand it. It doesn't interpret it right as input, and it cannot produce it as output.
The reason I speculate there's no way to improve precision is the only way for a concept to be added to embedding space is for it to appear frequently enough, consistently enough, in the training data. If some concept is new, or rare, or overly precise, or inconsistent, it won't be in the embedding space at all and the AI simply cannot understand it.
So the "solution" is to add more training data, and hopefully the rare concepts will occur enough to reserve a spot in embedding space so the AI can understand it. Or maybe there's some trick to the training process to do it. Or maybe we can make the conversion between input/embedding/output more complicated. Or maybe maybe maybe... this is where we run into the issues of diminishing returns.
Another problem: it's well known that using AI-generated content to train new AI makes things worse. Unless there's some way to filter it out, all the AI content flooding the internet over the last year might mean the internet in general is just no longer a good source of training data. We might already have the best training data we ever will.... then how will we 10x or 100x the training resources in the future?
The alternative is to somehow let the model create or refine concepts in embedding space while it's in use. My understanding is this is fundamentally incompatible with the separate training/inference stages of the current paradigm. I expect this is another area of active research, but I suspect any improvement here would require a massive breakthrough and a comprehensive change to how these AI operate. I wouldn't hold my breath.
There are obviously other approaches being researched that I'm not aware of, but I suspect any significant improvement would also require a huge breakthrough and be similarly comprehensive. I suspect the current paradigm is close to, or already past, the maximum efficiency in terms of performance-per-dollar. I suppose we just have to wait and see.
I'm not sure I follow. I know the holodeck, but I'm not sure which parts of that experience you're referring to in the thought experiment.
I could certainly imagine a system with that voice interface - you ask it for some music, it starts playing, you ask it to tweak some things, it does, and you keep refining and refining. There's a question of how long you can keep the game up. The model has limits, and at some point you will run into them. Either the AI will start forgetting things from context that the listeners still remember, or your refinements will become so specific that they're incompatible with the embeddings and the AI just can't obey.
The other big thing is just statistics. When you tell it "play jazz" - that could be anything. There are a huge number of sounds it could play that are jazz. So it'll pick one at random. And, on average, you'd expect it to pick one that's average. You wouldn't expect it to reliably generate something remarkable. This is what I was referring to with "regression to the mean".
I think what you explained about the embedding space and limitations/incompatibilities gets at what what I was thinking about. That as a consequence of the model's relationship to training data and the nature of how it's producing an output, getting a tool capable of being precise in the way I'm thinking of may well not be possible. I really appreciate you taking the time to write it out. I think what I was driving at with my example was just to illustrate a maximal idea of what I was thinking the technology could be, trying to see how far toward that kind of interfacing this kind of tool could go if that makes sense.
Total side tangent, but if you tell the holodeck to make a musician playing a violin, will the resulting music be created by the vibrations of the (photonic) violin strings, or from some other simulated source?
I think canonically it’s the former. Holographic objects have real mass with real physical properties. They make noise in the analog way. But I’m pretty sure it’s also canon that because the holodeck has finite dimensions but can simulate vast spaces, it is pulling off all sorts of weird perspective tricks so that each person inside sees things that look correct from their vantage point. Presumably it’s tracking the eye positions of each participant and distorting the holograms with bizarre impossible geometry so that they continue to look correct to all parties.
So that means the violin and its strings may not actually have the shape and physical properties that they appear to. Maybe the holodeck also compensates for that by altering the elasticity of the strings so they will produce a constant tone even as the violin transmogrifies to accommodate Picard and Riker walking in different directions around the room.
I’m not really building up to a point here, it’s just something weird I hadn’t thought about before.
Hmm. I think it would be much more interesting if AI could be used to generate the individual parts used to make a song, or to highlight a particular section and give it directions (“make it pop!” 😜) and have it play the producer role to “fix” the things you don’t like.
This is closest to what I'm coming to think of as a "realistic" future for these things. I still need to think on what all I've been given, but where before I was imagining holodeck shenanigans, now I'm thinking something like "you can click this and write out what you want to get something you can then work with yourself in the editor". That might be the absolute limit of it, a much smaller purpose than what marketing wants folks to think and not quite the devastation a lot of catastrophizing centers upon. We just have to see how it goes. I am attempting to see how it is going by putting myself on the spot a bit, I'll take the criticisms so hopefully I and maybe others can see where the dust is actually settling, is how I've thought about it.
I like to make mods for a VN game I play.
I can write. I can't draw. I can't compose music.
With AI, I can do all of that with writing. People enjoy my mods.
When people ask "What's the point?" it's for stuff like that.
I'm really interested to see what folks can do with tools that "fill in gaps" like that, because the way you put it is why I was interested in using the thing and seeing what it did. Being able to leverage writing to do other stuff feels like a magic power. It might be more like "beige magic" than "black magic" but it's still cool to experience.
I think AI music (and other art by and large) ia something I would class "a fun toy to mess around with, but the ethics surrounding their creation and use really puts a hamper on enthusiasm or desire to use outside of tinkering." Especially since I have very little artistic or musical talent of my own, it's a fun little widget to generate dumb theme songs for me and my friend's game night.
An ethically trained AI, using only properly licensed or open content, is something that actually sounds quite interesting with a lot of potential as a tool for creation, like open clipart for flyers, without the baggage.
Transparency about the training data imo is paramount regardless of however little effort there is broadly at doing it. I'm messing with things because i can, because they're there and I've got a spare 30 minutes now and then, but it is ever-present, the idea that what the tool is producing ultimately comes from stuff it was not meant to have. If I could have assurance the tools were trained on data that was voluntarily given, free to use, etc., I'd feel more comfortable with calling it something that isn't derogatory. I won't call anything I show folks something nicer than "computer spit" until I have that assurance, I think.
I think that Udio is the first tool where I really felt like I had some creative direction with the music, since not only does it give you the ability to edit in a very granular fashion, it is also much better than others (e.g. Suno) at taking direction for which way the music should turn at any given moment.
The musical composition is eh. Like, don't get me wrong, it's way better than other options and is surprisingly decent improvisationally at times, but the end result is merely average. That said, it's fantastic having the ability to translate your musical shitposts into reality on relatively short timescales with minimal effort, and to end up with an "average" result.
I think it really shines when you write your own lyrics. I've been noodling around with it for that purpose.
The Ballad of Billy Balls - Nautical Folk, humorous, mildly NSFW.
The Author's Thinly-Disguised Something Something - 1950s Big Band, female vocalist
Wizard Brawl - Irish Drinking Song
Thunder Hooves - Mongolian throat singing power metal
The way that it matches the music to the lyrics (particularly in Billy Balls and Wizard Brawl) is very impressive to me, although almost certainly due to my lyrics adhering to common trope structures as much as any actual musical insight on the AI's part. This tool isn't going to make anyone a musician, but it certainly does make music if given proper inputs.
Glad to discuss with someone who used it. I avoided lyrical music both out of preference and a lack of talent so I'm glad to see some examples that involve it. I also tend to think about it as you wrote - musical shitposting. I can say, in a way I had not felt much before, doing all of this did get me really interested in learning more about producing music. I won't do it because my time is demanded elsewhere/I have other pursuits I'd rather go after, but as a lifetime, excessive consumer of music the ease of use and ability to get a "functional" result is appealing as a method of discovery. I especially enjoyed being able to plug in really different genres and just see what it did, because that gave me a sort of "template sound" I could use to find more human music that really suited my taste.
In doing all this I also came across a site, everynoise. If anything the ten dollars was worth the experience that got me to finding that site, because that site led me to a ton of new music I will be buying up and listening to.
I thought the first one was fairly catchy. The rest didn't really click with me. Something about the song structure/build up felt off.
The discussion here and your comments in your post reminded me of a video I saw a couple months ago. "Nobody and the Computer"'s short AI film: Robort. Touches a bit on AI and music. Amusing watch and less than 10 min.
I really do not like to think about a future in which economic incentives/pressures mean larger firms jump the gun and just worsen everything unnecessarily. I don't consider increased profit necessary either so it's a real pickle for them I suppose. The news bits in that video had me imagining a scenario where a worsening economic situation means folks just go and do AI shit and not care that it looks scary/weird, I hope something like that doesn't happen. If it got people to stop watching the news though I might not sweat it.
They all sound surprisingly good, just low-bitrate, which comes with the territory with anything AI-generated. We've come a long way from Dadabots' first stream, and even they're sounding even better. I'd even say they're on-target for what the names seem to imply.
I'm of two minds as a musician about AI art:
The creative process is about creation, being in tune with your piece of music, so that you are driven to create it, and the act of creation drives you to create it further.
Seeing what computers can do is absolutely amazing.
I think a certain element of artistic gatekeeping is acceptable. I will never consider an AI prompt engineer an artist, but am not opposed to considering them a craftsman in a sense, because it is a skill to create a prompt and get a desired output. It's not the same process, requires a different, lesser (but still significant) dedication to one's craft, and I don't think one should claim to do something they basically told another entity to do.
The primary concern I have with AI art in general is how people don't generally appreciate art, and so will be blown away by AI generated works with no care for how they were created. My artistic doomsday scenario is that people who listen to human-made music will be considered like people who consume craft beer or specialty coffee, just snobs of the highest order. I think this is very unlikely to happen as a whole, but will exist to some degree.j
I feel the need to say I'm not a particularly skilled musician, but make music because it is fun to do so. I don't take issue with people making or even publishing prompted AI art as long as they are transparent about the process.
I'd also be pretty bummed with a future in which simply being into music means having to swat away impressions of snobbishness. I feel like I got a taste of that as digital platforms/streaming overtook physical media. The distinction you drew between "artist" and "craftsman" is an interesting one to me, because what it makes me wonder is whether the complexity of the prompting tools could flip the opinion later in time. That depends on things I don't know and from what I picked up in the topic, may not be a realistic possibility, but I am nevertheless curious about that kind of an inflection point. Such a thing will be a bit different person-to-person, so the only real way I'm going to understand it is by just talking to folks about where it is for them.
I used to teach beginner workshops on a variety of software. Courses where you're mostly just learning where things are, their basic function, how the software interacts with the filesystem, not really courses in Doing the Thing if that makes sense, geared around creative stuff like video editing, photo editing, music, visual art, etc. What career I had doing shit with computers was always oriented around being the first step toward other things, if that makes sense, because that world tends to be full of overpromising and marketing trickery. I wanted to be more "legit" and actually get folks started on doing cool things, because I knew enough about the computer to be helpful in that way. Anyway, the workshops were just "here's where things are, here's what you do with them, here's where you can learn more if you're feeling like this could work for you". Doing that sometimes left participants with a weird feeling, that music they once enjoyed was now "lesser" because they realized the extent to which it could have been assembled very quickly. I wasn't totally sure how to talk about it sometimes, but mostly stuck to telling folks that perhaps it was more important, what the music did for them, than how it was put together. I never really settled on how I would think about that, because I was already a voracious consumer of music and just kept going where taste led me. Doing the prompting brought me back to some of those conversations so I wanted to explore a bit and see what I could find.
I do agree with your final piece. Setting appropriate expectations I think is where folks' personal responsibility enters the picture, and part of me posting this was an effort to understand what those expectations might/could/should be given what is presently doable. The talk of what counts as art is fiercely interesting to me personally, but is absolutely not a topic of interest to people I know outside this forum. For the everyday person, the person who is not much concerned with artistic merit nor with being a musician, what I wonder is whether these tools could be presented differently and achieve something positive. Being able to produce your own, personal, pleasant noise could be nothing but fun if the output is kept in that personal realm - a thing you make for you, not really to be shared, not to position yourself as an artist/professional, but purely something you do for your own enjoyment/to enhance something else you do. I think part of the problem is a broader culture of everything being understood as potential pathways to Success, a tendency to think about the maximal and assume people intend to Be the Guy, so to speak. Art doesn't have to hit the height of taste, it doesn't have to be a labor of deep emotion, you can just fuck around with text and have fun listening to the muzak while you mow the yard and that be the entire experience of it, and that seems to be the role these tools can fill decently at this time based on what all has been discussed here/what I've seen elsewhere.
I think the real value will come by integrating these ML tools in a daw format that gives you proper creative controls (like stable-diffusion for Krita), https://wavtool.com does this a little bit but it's not quite there. Imagine a drumkit generator where you can automate the latent space, automatic transcription that works, humanizer, etc ...
Another user mentioned the idea of what it would mean for the software to become a "clunky DAW" and given their full response around that, it makes me very interested in whether folks' impressions will change if that becomes a real thing. I'll check out wavtool a bit and see what I think, I haven't messed with it yet.
What i wonder is if there is a way to bridge things a bit. Like having the text model as a sort of "fuzzy fallback", a way to generate something that you can then use the traditional tools to shape and change. How that compares to the traditional process isn't my concern, I'm thinking more in terms of whether the generative tools could represent a new accessibility option. So, I am trying to understand where their results currently sit with people, whether what is here today really is the beginning of that kind of a path or if there are reasons why that won't be, either technologically, socially, creatively, etc. Whatever folks think is what I'm after. I've appreciated everybody engaging with it and sharing what they've had, the whole topic has been very enjoyable to engage with.
I’ve been playing around with Udio a ton, and I still don’t understand how music discovery is supposed to work. I want to find stuff I like and follow the creators and make playlists of my favorite songs. I know there is a lot of music published on the site but clicking around just leads me to the same handful of tracks. The search feature is basically useless. It feels like if a song doesn’t end up featured in the Staff Picks box, it’s basically invisible. Am I doing it wrong? Or is this just what the “beta” label means?
Imo I think it is the "beta" aspect of it - the site is a wreck for trying to explore and discover folks. Some of the UI/features changed as I was doing stuff, even, so I mostly stuck to making my own and occasionally seeing what was topping their metrics.
Ah what a shame. I made a downtempo chill track I’m really proud of and I was hoping for a way to get it in front of others who are into that sort of thing. And to find similar songs I could curate a playlist of, since I haven’t come across anything else in this style on the site.
I’m not above sharing it here of course, lol. It’d be nice if Udio were better at this so we didn’t feel compelled to spam Tildes to find ears. I like that generative AI can create endless personalized media for individuals, but on the other hand it’s rather depressing that the experience is such a solitary one.
I enjoyed your track. And yeah I'm not really sure how to go about getting things in front of folks on there, my interest was a little more academic in nature too so it wasn't a huge priority/I probably missed out on details there. I figured one post would be fine, but don't intend to share more from this tool again - when another catches my interest, maybe, if I can come up with a decent discussion around it.
The notion of personalized media is an interesting one to me too. I have a lot of that, because a lot of the creative stuff I do isn't really intended to be shared outside a very small group. I did some things to learn about them, to enjoy myself, to have something for a very specific context, etc. Being able to talk more precisely is something I value too, and my approach tends to be to just Do the Thing and see what happens. An example with this tool - when I'm exercising I'm not actively listening. If anything I just put on something I know will be energetic and loud because that enhances the exercise a little. I'm using the music, not really considering it the same as when I'm actively listening. I wouldn't want to bother a musician with something that plain and I'd expect a few would bristle at the idea of their creations being deliberate background noise/filler. The tool suffices there. No one is meant to be experiencing that result except me, maybe a friend might like it for a similar purpose is about as far as it goes in my mind. That's not nearly the kind of stuff being pitched from the world of marketing, and it's also not quite the same as what folks have come to dread about it. I wanted to put something out there to gather up some real folks' thoughts and come to a more nuanced/realistic opinion of it, clear the fog for myself a bit because I knew I'd hit a point where my own ability to judge things wasn't as solid as it could be.