75 votes

GPT-4o

62 comments

  1. [17]
    arqalite
    (edited )
    Link
    I'm not able to watch the demo, but commenters on The Verge said it's borderline revolutionary (again?). I want to see it for myself once I get home. I'm more skeptical of AI as time goes on...

    I'm not able to watch the demo, but commenters on The Verge said it's borderline revolutionary (again?).

    I want to see it for myself once I get home. I'm more skeptical of AI as time goes on (especially scared of model collapse), but if this turns out to be what Google Assistant and Siri failed at becoming, it would be nice.

    Also can someone confirm OpenAI just destroyed Rabbit and Humane's entire reason for being?

    (Let me know how wrong/right I am, please.)

    EDIT: Holy fuck, that's uncannily human. You can still tell it's a bit artificial sometimes, and it's very enthusiastic at all times which a human is not, but it's ridiculously good. I didn't expect such a huge leap from GPT-4. Please watch all the demos, description doesn't do it justice.

    I'm thoroughly terrified, but I'm also excited.

    30 votes
    1. [3]
      skybrian
      Link Parent
      I guess it will be few weeks before we hear from early reviewers about their experiences. I'll be interested in hearing from people who use as an interpreter. In the meantime, even an excellent...

      I guess it will be few weeks before we hear from early reviewers about their experiences. I'll be interested in hearing from people who use as an interpreter.

      In the meantime, even an excellent tech demo is still a just a tech demo - we should treat it with the same skepticism we have of all tech demos.

      28 votes
      1. [2]
        terr
        Link Parent
        I expect it could well be sooner than that. I'm a random nobody and have access to the new update. I tried it out a little bit yesterday and (other than being slower than in the video because...

        I expect it could well be sooner than that. I'm a random nobody and have access to the new update. I tried it out a little bit yesterday and (other than being slower than in the video because presumably everyone with access was trying it out at the same time) it worked fairly similarly to how it shows on the various videos.

        My co-worker and I managed to confuse it briefly by having it switch back and forth between English and Croatian (while I was talking to it in English and my co-worker Croatian), but it caught up after a moment.

        Overall, I was surprised at how well it worked, but find it frustrating how overly enthusiastic it is about everything. I've yet to try out the video function, but I'll play around with it some more when I manage to find some time.

        10 votes
        1. adorac
          Link Parent
          They haven't rolled out the speech-to-speech features yet, so it might've been slower because it was going through the speech → transcription → text response → speech pipeline.

          They haven't rolled out the speech-to-speech features yet, so it might've been slower because it was going through the speech → transcription → text response → speech pipeline.

          8 votes
    2. [6]
      unkz
      Link Parent
      Yes, Rabbit and Humane are entirely obsoleted by this. I'm thoroughly impressed by this tech demo, it's a major leap forward.

      Yes, Rabbit and Humane are entirely obsoleted by this. I'm thoroughly impressed by this tech demo, it's a major leap forward.

      19 votes
      1. [5]
        Hollow
        Link Parent
        But are voice actors obsoleted by this?

        But are voice actors obsoleted by this?

        2 votes
        1. DavesWorld
          (edited )
          Link Parent
          Way back, when CGI was still new, there was this whole segment of "movie lovers" who were vehemently opposed to it. Now, by new CGI, I mean Spielberg dinosaurs new. I eventually decided the...
          • Exemplary

          Way back, when CGI was still new, there was this whole segment of "movie lovers" who were vehemently opposed to it. Now, by new CGI, I mean Spielberg dinosaurs new.

          I eventually decided the anti-CGI folks were basically in favor of randomness. They called it “organic.” They wanted the “organic nature” of how the light would only be just right for that brief window of time where the Earth and Sun were aligned just so. Never mind that the whole film crew (from director on down) would race to set up a shot to take advantage of that “organic” moment while praying for good weather; that was still better apparently. To have to roll the dice to get what you want on film. To have to sometimes give up and try again the next day, after the universe rotated back into that "organic" position they liked so much.

          It turns out, CGI's main advantage was it gave directors more control over the frame. Allowing the director to decide just what she wants the light to be, to use this take even though there was some issue in the background (which they can erase with CGI and thus save for use), and so on. Sure CGI lets empty-headed directors say crap like “big explosion, right here” or “three eyed alien waving its arms while running after the car” or whatever, but it also enables all those things Youtubers desperate for content will mine from DVD behind-the-scenes discs and throw up as a revelation. Stuff where you see “oh, Wolf of Wall Street had CGI in basically every frame we were given.

          Control is a big thing mature creative AI technology will give creators. CGI gave that control only by sitting dozens upon dozens of humans in front of bank after bank of computers to twiddle and coax it out of the aether. AI is going to open up control in a way that changes everything about storytelling presentation.

          The tech is continuing to mature. At some point, you’ll be able to shape an AI voice the way a director can shape a human actor’s performance. “More passion, more fire, let’s go again.” Having a conversation with the (human) actor, while hoping they understand you and do what you ask, only to find what you thought was clear was different from what they took from it, and you all have to keep trying until you get there. Versus tapping in some commands and letting run again to see how that works.

          There’s a semi-famous story from Wrath of Kahn. Nicholas Meyer was the director, and he wanted William Shatner to stop doing Shatner-Kirk stuff with the performance. He wanted some acting, not the Shatner shtick. But he felt having that conversation with Shatner wouldn’t work, would backfire. And then he noticed that if Shatner got bored or tired with the scene, the shtick wore off and some actual acting came out.

          So Meyer had to “wear” Shatner down.

          Take after take after take, Meyer knowing full well they were all bullshit wasted takes. Whole cast and crew, everyone on the film set, just spinning through the motions while Meyer had to play this little unspoken game with Shatner to eventually get the performance he wanted.

          Meyer also talked about Ricardo Montalban too, who he was even more intimidated by since Meyer was a very new director and Montalban had a huge filmography of acting credits. Meyer was afraid to try to direct Montalban, and when he decided he had no choice but to try and broach the subject, go into how exactly he wanted Montalban to perform, he was happily surprised to find Montalban was a creative professional quite willing to work with the director to find a performance Meyer wanted.

          Not every Montalban is going to be kind and generous and approachable. How many "big stars" are above taking any comments from anyone, least of all a director? Ego is a thing, and ego fucks shit up all the time.

          So much of storytelling when you move past writing is collaborative. And that’s fine. Not only is nothing wrong with it, but it can even produce magic when the right people get together and collaborate on a project. However, just because multiple folks are involved doesn’t automatically make it better than a solo project; cinema is rife with tales of meddling that screwed stories up quite badly.

          Kind of how “organic” by not using CGI to just dictate whatever elements you decided were important isn’t automatically better. What matters is what ends up on screen. Sometimes organic gives a better result, sometimes control does.

          Many of the key “storytelling skills” once you get past writing don’t actually involve storytelling. They involve people manipulation. Charm, charisma. Being able to interact with folks. Being able to give orders when you’re in charge and have them followed, rather than dismissed as the cast and crew ignore you because they hate you or think you’re (insert any number of things people think about each other) and so on.

          James Cameron famously went over to England to shoot Aliens and had a hell of a time dealing with the English crew. They didn’t know who this American kid was. Cameron was young at the time, and Terminator hadn’t hit England yet, so as far as they were concerned he was a clueless nobody. Worse, a filthy American. -Edit- Even though he's Canadian, people forget since he lives and works in America.

          Never mind that he was the director, they just weren’t very interested in working with him. So they were difficult and there were problems and delays and all that simply because they weren’t willing to work collaboratively out of personal reasons. He finally had to enlist others to help him convince the local cast and crew he did know what he was doing and cooperate with him.

          Cameron isn’t a fuzzy-feelings people person it turns out. That part’s quite famous too. Loads of people who’ve worked with Cameron have decided he’s not a people person. But you can’t argue he’s a bad storyteller or bad filmmaker; the man’s proven he’s a cinematic genius.

          But what shows up in any story about him and his body of work? “Cast and crew from his projects say he’s difficult to work with.” Just because he’s not fuzzy-feelie with people, he’s “difficult.” And further, only the fact that he is that good at filmmaking and storytelling allows him to overcome this difficulty. This handicap. Which, again, is nothing more than not being charming.

          If you took everything Cameron is except the people part, and dropped him into a Brad Pitt or Tom Cruise type, people would fall all over themselves at every opportunity to herald his boundless genius. “Wow, he knows so much about film and story and how to create an amazing, breathtaking piece of cinema. He’s so easy to talk to, so wonderful on set, it just lets everything flow so well.”

          That’d be the story, the line. Because he’d be charming, the only difference. He’d have that people skill thingy, instead of “only” being a cinematic genius.

          One of the things a mature creative AI is going to do is remove things like that from the equation. Or, at least, from some of the equations. Some projects, some of them student or newcomer projects while others might be big time major projects, are going to come about because someone similar to Cameron won’t have to successfully pass skill checks to charm cast and crew.

          That charmless director will be able to directly order his AI actors, his AI voices, his AI everything, while creating. He could do all of it on a live set with real humans, but he’d have to pass those skill checks he can’t. Because people always place supreme importance on people skills, and will hold it against you hard if you don’t have them.

          How many amazing storytellers have we missed out on simply because they didn’t give good meeting, didn’t have the ability to charm and dazzle in person while sitting down with a financier or actor or crew lead? How many people were just ordinary people who didn’t have the knack of manipulating others successfully who were told off and made to get out of the whole process simply over that lack and no other? How many newcomers were scared away by veterans determined to lord it over the Johnny-come-latelys?

          How many stories have we missed due to people hating on people who aren't good with people? Reference Neil Gaiman's Library of Dream as a hint.

          So much is going to change with mature creative AI technology. You can coax and beg and plead and try to figure out the secret sauce in how to convince an actor, a camera operator, anyone in the project, to listen to you … or you can push buttons and get the exact performance you want, you need, to complete a project and be able to show it to others.

          Right now that sounds silly because the tech isn’t mature so of course you have to have humans involved. Only a human can give a human performance.

          Now.

          What about next year? Next decade? At some point, the tech will mature and first with the voice, and then later with the visuals tied to the voice, you’ll have AI actors audiences are perfectly content with. Who they’ll respond to, react to.

          Does that mean we don’t need human actors? No. It’s just another option, another tool in the box. We use CGI stunt people right now for stunts that are too dangerous to do live, for example. Student directors, low budget directors, socially inept directors, time pressed directors, and more, will all have a lot of use for a cinematic toolbox that lets them shape their ideas into a finished forms.

          Another possibility; lots of modern audiences feel the visual is permanently tied to the character. Meaning, whichever actor was cast is who should always own that role. Actors age, actors retire, actors die. It didn't used to be a thing that you couldn't replace an actor when they moved on for whatever reason, but these days social media melts down just because a key actor dies and "it's disrespectful to replace them."

          The show must go on isn't a modern sentiment apparently. So what if we endrun around it. Human actor "wearing" an AI generated mask on screen. Sure we cast a cast, but we didn't cast their likenesses, we cast their performance. We, the project, own the likeness because the project creates it. James Bond could always look like Bond, for example; and extend that example just as far across however many other franchises or characters as tickles your fancy. Sure actors would come and go, and you'd still have "well I like the Craig Bond better" and all that, but at least Bond would still look like Bond which seems to be pretty important to a lot of folks.

          Plus it would let actors focus on being actors, and let productions focus on great actors. Instead of beauty. When you create the likeness for all the same cost in time and effort as you'd spend on hair and makeup, you can cast whoever the hell you feel is best. Irrespective of what they look like. That popular complaint about looks and beauty outweighing talent would be nipped right in the bud, and it'd be lovely because it's quite tiresome to put up with in my opinion.

          Bottom line, anyone who puts in the time and considerably effort to learn storytelling can probably get to a point where they can write a story. Or a script. But right now, that’s where the fun stops. You have to be a people person of some form to be able to take it further. Even selling it off to someone who is a people person who’ll take the story to the next form requires people skills.

          Back in the day, CGI was this scary thing that was ruining movies. AI, from images to voices to even text, is the new CGI; it’s scary and not understood and it’s “ruining everything.” But it’s going to open up the cinematic palette in new ways. Same as how CGI opened up the filmmaking palette in new ways.

          19 votes
        2. [2]
          LukeZaz
          Link Parent
          I'm no expert on this, but I'd say almost definitely not. There's a lot of nuance to voice acting that this absolutely cannot replicate yet, and even once it can, I have my doubts that the AI or...

          I'm no expert on this, but I'd say almost definitely not. There's a lot of nuance to voice acting that this absolutely cannot replicate yet, and even once it can, I have my doubts that the AI or its operator will be able to reliably adjust the character of the voice enough to adapt to the requirements of a given performance.

          For comparison, I'm reminded of an (admittedly unverifiable) tweet that went around a while ago about how a company tried to hire prompt engineers to make AI art for their project, and how while some of the initial art was alright, the "engineers" were entirely incapable of adjusting the art in response to change requests, and ended up all being fired.

          19 votes
          1. Minty
            Link Parent
            This story is extremely iffy, as competent prompters are entirely capable of such adjustments. Inpainting, outpainting, transfers, all that. This is equivalent to hiring artists that can't draw....

            This story is extremely iffy, as competent prompters are entirely capable of such adjustments. Inpainting, outpainting, transfers, all that. This is equivalent to hiring artists that can't draw. I'd say it actually kind of encourages company execs to invest in generative AI if they know that.

            Regarding voice acting, you can instruct adjustments somewhat (I figure significantly more with GPT-4o), and then generate with redundancy to pick the best fitting performance. That's how I do when generating lullabies (with XTTSv2), that's how people do with ElevenLabs. This would mean that a company would still need to hire someone to listen through it all very patiently, with an ear for this, and possibly still require a voice actor to provide and license voice samples for cloning.

            11 votes
        3. unkz
          Link Parent
          By this, no, but some of the voice cloning technology that is available now is pretty close. In combination with that, it's getting much closer -- being able to provide feedback like in these...

          By this, no, but some of the voice cloning technology that is available now is pretty close. In combination with that, it's getting much closer -- being able to provide feedback like in these videos (speak slower, speak faster, more whispery, more melodic) to an AI equipped with a cloned voice? Already, voice clones are taking jobs from voice actors. I would say it's going to be a few years at most before real voice actors are only ever used because of contractual obligations and "artistic integrity".

          3 votes
    3. spidicaballero
      Link Parent
      They never had a reason to be in the first place, such an unnecessary devices those are.

      They never had a reason to be in the first place, such an unnecessary devices those are.

      10 votes
    4. [2]
      Goodtoknow
      Link Parent
      The voice inflection input and output is revolutionary to me, it can understand emotion and output simulated emotional or different types of voices. Until we see action multi-modality built into...

      The voice inflection input and output is revolutionary to me, it can understand emotion and output simulated emotional or different types of voices. Until we see action multi-modality built into the model, it won't be useful yet as a full personal assistant.

      10 votes
      1. tanglisha
        Link Parent
        It would be so nice to have that quality of voice output read me web pages. Sometimes reading is hard on my eyes.

        It would be so nice to have that quality of voice output read me web pages. Sometimes reading is hard on my eyes.

        3 votes
    5. [4]
      derekiscool
      Link Parent
      This looks like what Google tried to fake a few months back with their "demo" on YouTube - wondering if Google had some insider info and tried to steal some of OpenAI's thunder. They tried the...

      This looks like what Google tried to fake a few months back with their "demo" on YouTube - wondering if Google had some insider info and tried to steal some of OpenAI's thunder.

      They tried the exact same thing with Bing AI previously

      3 votes
      1. teaearlgraycold
        Link Parent
        I think they’re all just working on the same next steps. What’s funny is many major inventions like the telephone got invented by multiple people simultaneously. IIRC someone got to their local...

        wondering if Google had some insider info and tried to steal some of OpenAI's thunder

        I think they’re all just working on the same next steps. What’s funny is many major inventions like the telephone got invented by multiple people simultaneously. IIRC someone got to their local patent office a day too late and lost to Mr. Bell.

        10 votes
      2. [2]
        Jedi
        Link Parent
        They didn’t “fake it,” the demo explicitly stated that the sequences were shortened. Anyway, if anything it’s the exact opposite. It’s not a coincidence that OpenAI announced this the day before...

        They didn’t “fake it,” the demo explicitly stated that the sequences were shortened. Anyway, if anything it’s the exact opposite. It’s not a coincidence that OpenAI announced this the day before Google I/O where they announced Project Astra which is exactly this.

        Now I’m not saying they had insider info (I agree with teaearlgraycold, this is the direction they’ve all been going), but this was definitely timed to try to steal thunder away from Google’s event where they were obviously going to be announcing any advancements in their AI.

        4 votes
        1. derekiscool
          Link Parent
          They did indeed fake it. They stated in the video description, "lactency has been reduced...for the sake of brevity" - which is a wild understatement. They also very heavily implied that the AI...

          They did indeed fake it. They stated in the video description, "lactency has been reduced...for the sake of brevity" - which is a wild understatement. They also very heavily implied that the AI was responding to the users voice and photo inputs in real-time, when in reality the whole thing was done manually and "reconstructed" into a smooth video.

          Instead, the real demo was constructed from “still image frames from the footage, and prompting via text.” 

          https://www.techradar.com/computing/artificial-intelligence/that-mind-blowing-gemini-ai-demo-was-staged-google-admits

          Whether or not you consider this "faking it," it is undeniably intentionally misleading viewers. They were just using ordinary text and image inputs, but implying real-time conversation with an AI

          6 votes
  2. [12]
    balooga
    Link
    I wish OpenAI's branding wasn't so imprecise. I have a paid ChatGPT account and I'm able to select "ChatGPT 4o" as the model, but the experience looks about identical to ChatGPT 4. It is...

    I wish OpenAI's branding wasn't so imprecise. I have a paid ChatGPT account and I'm able to select "ChatGPT 4o" as the model, but the experience looks about identical to ChatGPT 4. It is noticeably faster. I haven't done a deep dive yet on the quality of the output; I'm largely interested in coding assistance and I doubt the multimodal emphasis will help with that, I'm a bit concerned it could actually be weaker in that area. There's no audio/video input so I can't do anything like what's shown in the demos. So... it's cool that I guess I can use the model today but what I have access to feels wholly different from the sizzle reel.

    From a tech standpoint this (the demos, not what's up on ChatGPT right now) feels like a major breakthrough. People have been comparing it to Her and that seems fair. As a human though, I don't think I'll have much tolerance for the personality and voices they've given it. Too bubbly and eager to pepper in one-liners, and at the same time so corporate vanilla. If I'm going to be having long conversations with an AI in the future, it'll need to come off a little more phlegmatic or risk being just exhaustingly tryhard. I guess this is officially the uncanny valley of voice synthesis.

    Also the response time is super fast, but still slow enough that interruption seems to be a frequent occurrence. Looks like it handles that about as gracefully as one would hope, but on a more reflexive level I really bristle at crosstalk. I don't want that to become the norm.

    It does feel like we're on the cusp of some pretty radical changes in the way we use our technology. Five years ago I would've never guessed this was coming so quickly. The demo where they put two AIs side-by-side to have a CSR conversation gave me shivers, though I'm not sure if they were good shivers or bad shivers.

    23 votes
    1. [2]
      honzabe
      Link Parent
      You described very well my first impressions from that demo. I use ChatGPT a lot, and I like how factual and "human, but not too much" it feels. The AI complimenting a guy on his hoodie gives me...

      As a human though, I don't think I'll have much tolerance for the personality and voices they've given it. Too bubbly and eager to pepper in one-liners, and at the same time so corporate vanilla. If I'm going to be having long conversations with an AI in the future, it'll need to come off a little more phlegmatic or risk being just exhaustingly tryhard. I guess this is officially the uncanny valley of voice synthesis.

      You described very well my first impressions from that demo. I use ChatGPT a lot, and I like how factual and "human, but not too much" it feels. The AI complimenting a guy on his hoodie gives me the creeps. Do I detect a slight hint of flirtiness?

      18 votes
    2. Jambo
      Link Parent
      They should be more upfront with the 'try it now' links but at the bottom of the page they describe their plans, pasting here for anyone interested:

      They should be more upfront with the 'try it now' links but at the bottom of the page they describe their plans, pasting here for anyone interested:

      GPT-4o is our latest step in pushing the boundaries of deep learning, this time in the direction of practical usability. We spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As a first fruit of this research, we’re able to make a GPT-4 level model available much more broadly. GPT-4o’s capabilities will be rolled out iteratively (with extended red team access starting today).

      GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.

      Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

      9 votes
    3. [8]
      creesch
      Link Parent
      Just as an fyi, the voice capabilities as shown in the demo are only available in the mobile app as far as I know. I just checked, it has not rolled out for me, it seems. I can select it in chat,...

      Just as an fyi, the voice capabilities as shown in the demo are only available in the mobile app as far as I know. I just checked, it has not rolled out for me, it seems. I can select it in chat, but for voice it seems to still act like the "old" gpt4. Which actually is already pretty decent, but not as good as the demo they showed.

      As a human though, I don't think I'll have much tolerance for the personality and voices they've given it.

      Maybe it is possible to adjust for that a bit in the custom prompt options. But yeah, I largely agree. Also, with the current models both in text and voice, I feel that a lot of the responses are way over the top. Part of it might be just a cultural difference, as ChatGPT is largely trained on US style communication.
      Part of it also might be that a lot of the training data is actually the sort of thing where this sort of over the top communication is often used, like marketing.

      5 votes
      1. [7]
        skybrian
        Link Parent
        I don’t think it’s available yet at all. From the announcement: “We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.”

        I don’t think it’s available yet at all. From the announcement: “We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.”

        7 votes
        1. [6]
          terr
          Link Parent
          It is there (at least for me), I was playing around with it yesterday. In the mobile app I just have to select ChatGPT 4o as my model and then tap the headphone button next to the text entry bar...

          It is there (at least for me), I was playing around with it yesterday. In the mobile app I just have to select ChatGPT 4o as my model and then tap the headphone button next to the text entry bar to get it going.

          3 votes
          1. [5]
            Jordan117
            Link Parent
            That's probably the existing voice mode from last year that transcribes your question to text for the model to process and then reads out the answer with text-to-speech, with a 5-10 second lag...

            That's probably the existing voice mode from last year that transcribes your question to text for the model to process and then reads out the answer with text-to-speech, with a 5-10 second lag time. The new native-audio model is real-time and has a much more realistic voice.

            10 votes
            1. [4]
              terr
              Link Parent
              I do have the realistic voice option, there were 5 or 6 different voices to choose from when I first started it up, definitely including the voices from the promo videos. At least, it sounded that...

              I do have the realistic voice option, there were 5 or 6 different voices to choose from when I first started it up, definitely including the voices from the promo videos. At least, it sounded that way to me. Unfortunately I didn't use last year's voice model previously, so I don't have a baseline to compare it against, at current.

              3 votes
              1. updawg
                Link Parent
                It's still the old version. The new version isn't just the calling portion; it's everything else that goes along with it. It's using 4o, but it's not the part that they were demonstrating in the...

                It's still the old version. The new version isn't just the calling portion; it's everything else that goes along with it. It's using 4o, but it's not the part that they were demonstrating in the videos.

                8 votes
              2. [2]
                Jordan117
                Link Parent
                How's the responsiveness? Is it like talking to a person, or do you have to wait a few moments for it to process each answer? You might also try to ask it to modulate its voice (whisper, sing,...

                How's the responsiveness? Is it like talking to a person, or do you have to wait a few moments for it to process each answer? You might also try to ask it to modulate its voice (whisper, sing, talk fast or like a robot, etc.), which the basic text-to-speech engine can't do.

                1 vote
                1. terr
                  Link Parent
                  The responsiveness is ok unless you're asking it to do something a little more computationally heavy. Typically I'm experiencing a pause of 1-2 seconds after it detects that I've stopped speaking...

                  The responsiveness is ok unless you're asking it to do something a little more computationally heavy. Typically I'm experiencing a pause of 1-2 seconds after it detects that I've stopped speaking and when it responds.

                  That being said, while the voice tone it's using is quite human and conversational, the responses are still very much what I'd expect from any AI text generator. It definitely doesn't have the same personality that was on display in the demo videos, so I'm not sure if that means I'm just getting the new voices but am still getting responses from plain 'ol GPT-4, or if they gave it some custom instructions that give the various voices distinct personalities.

                  Edit: Just tested having it sing to me, and it's definitely still the old version, it just recited "Twinkle twinkle little star" to me. Gave it some nice intonation, but definitely not sung. Seems I was just overexcited!

                  1 vote
  3. [3]
    LukeZaz
    Link
    Am I the only one who felt like the female voices were sounding a little too friendly? Nobody I know talks like that in anything resembling normal conversation, and I can't help but wonder if the...

    Am I the only one who felt like the female voices were sounding a little too friendly? Nobody I know talks like that in anything resembling normal conversation, and I can't help but wonder if the space being very white-male-tech-bro is affecting the outcome again.

    21 votes
    1. creesch
      Link Parent
      No, that certainly is a thing. Already is with the current voice models as well. It might be the "tech-bro" training it issue. Though, it also just might be that they used a lot of available voice...

      No, that certainly is a thing. Already is with the current voice models as well. It might be the "tech-bro" training it issue. Though, it also just might be that they used a lot of available voice material, which often is marketing materials.

      7 votes
    2. nothis
      Link Parent
      It‘s clear they watched Her. Like, it’s basically an off-brand Scarlett Johansson. Can’t blame them, my mind is blown. No way in hell would I have expected to have a phone with that kind of...

      It‘s clear they watched Her. Like, it’s basically an off-brand Scarlett Johansson. Can’t blame them, my mind is blown. No way in hell would I have expected to have a phone with that kind of conversational AI by 2024. I’m sure they’ll do more neutral voices as well.

      5 votes
  4. [2]
    Thomas-C
    Link
    Those demos sound entirely too close to a manager I loathed talking to lol, it's impressive to me it is good enough to make me think in that direction. This kind of functionality is what I was...

    Those demos sound entirely too close to a manager I loathed talking to lol, it's impressive to me it is good enough to make me think in that direction. This kind of functionality is what I was looking forward to, it's a huge step closer to having a computer I can just talk to for doing stuff. Moonshot hope is running such a model locally one day, so I can have a Star Trek computer without it depending on some outside service/connectivity. If I can get a Majel Barrett voice for it I'd be over the moon.

    16 votes
    1. teaearlgraycold
      Link Parent
      The whole time I was thinking how I already hate when humans talk like that - and now we've got computers doing it too?

      The whole time I was thinking how I already hate when humans talk like that - and now we've got computers doing it too?

      19 votes
  5. [9]
    Comment deleted by author
    Link
    1. [8]
      LukeZaz
      Link Parent
      Can I ask why you’re scared? I’m STEM-adjacent but I don’t find it very exciting either. More irritating, really, since I have no trust in OpenAI to do their work ethically.

      Can I ask why you’re scared? I’m STEM-adjacent but I don’t find it very exciting either. More irritating, really, since I have no trust in OpenAI to do their work ethically.

      9 votes
      1. [8]
        Comment deleted by author
        Link Parent
        1. [6]
          fxgn
          Link Parent
          I think they like the AI as their creation, in the same way that a writer can love a book they created or a painter can love their painting. Why does it become disturbing if it's about AI?

          I think they like the AI as their creation, in the same way that a writer can love a book they created or a painter can love their painting. Why does it become disturbing if it's about AI?

          13 votes
          1. [6]
            Comment deleted by author
            Link Parent
            1. [4]
              unkz
              Link Parent
              This is oddly dismissive of the intellectual capabilities of engineers, like they are borderline idiots. Is it really likely that AI engineers do not spend an enormous amount of time thinking...

              This is oddly dismissive of the intellectual capabilities of engineers, like they are borderline idiots. Is it really likely that AI engineers do not spend an enormous amount of time thinking about the consequences of this technology? And that people in the humanities, who for the most part haven't got the faintest idea of how any of these works, would have a better understanding?

              20 votes
              1. [4]
                Comment deleted by author
                Link Parent
                1. [3]
                  winther
                  Link Parent
                  This is pretty clear from the various "mishaps" from these generative models spewing out stereotypes and biased responses. Something people with expertise in other areas have tried to raise...

                  This is pretty clear from the various "mishaps" from these generative models spewing out stereotypes and biased responses. Something people with expertise in other areas have tried to raise awareness and warning on for years, back when it was just called machine learning. Yet largely ignored by the tech companies and then they act surprised when exactly that thing happened.

                  3 votes
                  1. [2]
                    Wes
                    Link Parent
                    I don't think that's true. Bias and safety considerations seem to be one of the biggest areas of consideration when companies release AI models. Tech companies have spoken at length about this...

                    I don't think that's true. Bias and safety considerations seem to be one of the biggest areas of consideration when companies release AI models. Tech companies have spoken at length about this issue in papers, alongside product releases, and in their discussions with governments. It really seems like every new release spends half the page explaining safety issues before discussing the actual technical details.

                    The models have clearly been tweaked to try to minimize bias when possible. Google recently received heat for this by overcompensating and introducing people of color where it didn't make sense to.

                    Very few companies are willing to release a base model that hasn't had safety tuning applied. Unfortunately, this does harm the model's abilities, so a balance needs to be struck.

                    Prompts can also be modified for safety reasons. Some tools will automatically introduce variation suggestions to introduce more diversity.

                    Really it seems like an area of active consideration by all parties involved, whether they be developers, red team members, or independent researchers.

                    5 votes
                    1. winther
                      Link Parent
                      It becomes one of their biggest considerations after the public notices it and the media goes the round on it. Even though there was warnings that these thing would happen several years before...

                      It becomes one of their biggest considerations after the public notices it and the media goes the round on it. Even though there was warnings that these thing would happen several years before these models were released. So clearly, they aren't in practice doing a very good job of it. At least they take pretty big risks and prioritize getting stuff released fast and early, and hope they can fix the problematic stuff later. It is not their top priority.

                      4 votes
            2. fxgn
              Link Parent
              And GPT can predict the next word in a sentence. We understand "what it does" just as well as we understand books. And books can have a huge impact on humankind, both bad or good.

              Books can be read.

              And GPT can predict the next word in a sentence. We understand "what it does" just as well as we understand books. And books can have a huge impact on humankind, both bad or good.

              6 votes
        2. winther
          Link Parent
          That is already happening. There are various AI companion apps out there, virtual girl/boyfriend type of things, and communities around them where people tell they spend hours each day with their...

          That is already happening. There are various AI companion apps out there, virtual girl/boyfriend type of things, and communities around them where people tell they spend hours each day with their personal AI chatbot. I also fear the worst, that this is just going to get even more people lonely as they seek out virtual social interactions instead.

          6 votes
  6. [2]
    updawg
    Link
    I really enjoy that you can tell in the demos that the people are annoyed by how the AI won't shut up lol they're a bit verbose but they're still pretty cool!

    I really enjoy that you can tell in the demos that the people are annoyed by how the AI won't shut up lol they're a bit verbose but they're still pretty cool!

    8 votes
    1. nothis
      Link Parent
      I think it’s clever they kept the demos “messy”. My first question would be whether I can just talk over it and tell it to shut up and get to the point if it’s not going anywhere and they answer...

      I think it’s clever they kept the demos “messy”. My first question would be whether I can just talk over it and tell it to shut up and get to the point if it’s not going anywhere and they answer that question within seconds, lol.

      8 votes
  7. [2]
    winther
    Link
    I might already be old fashioned with regards to technology, but I am not seeing myself using something like this regularly. It would be weird having random conversations with tasks and commands...

    I might already be old fashioned with regards to technology, but I am not seeing myself using something like this regularly. It would be weird having random conversations with tasks and commands with a computer at home, even less in public and at work. Typing stuff on a device has some basic level of privacy. But maybe social norms will change around that in a generation or less.

    5 votes
    1. updawg
      Link Parent
      You can still use it while typing. They're just demonstrating it that way because it's more engaging and it actually shows off the advances better. It's a lot harder to show off that it's typing...

      You can still use it while typing. They're just demonstrating it that way because it's more engaging and it actually shows off the advances better. It's a lot harder to show off that it's typing better than it is to show off that it can now do something that it couldn't do before.

      5 votes
  8. [4]
    aetherious
    (edited )
    Link
    I still haven't tested the new voice models in GPT-4o yet but for me, the most 'human' voices (and responses in general) from AI has been for quite some time with pi.ai. They also have much more...

    I still haven't tested the new voice models in GPT-4o yet but for me, the most 'human' voices (and responses in general) from AI has been for quite some time with pi.ai. They also have much more variations in their voices than GPT-4 had. Pi doesn't sound corporate and overtly friendly like ChatGPT still does and the responses sound so much more like a person. ChatGPT is useful for me in processing data and doing 'work' since it can read documents, but Pi has been much better for a while now.

    Edit: I watched more of the demo and it still seems like ChatGPT still doesn't offer the option to use the voice outputs with text inputs, which is what's kept me from using the voice capabilities in the first place.

    3 votes
    1. Jordan117
      Link Parent
      Note that the new voice mode won't be available for at least a few more weeks; the app has a speech-to-text-to-speech conversation mode from last year but it's very laggy in comparison and the...

      Note that the new voice mode won't be available for at least a few more weeks; the app has a speech-to-text-to-speech conversation mode from last year but it's very laggy in comparison and the voices are far less realistic.

      5 votes
    2. balooga
      Link Parent
      I like Pi too, as a conversational AI. It lacks a lot of the utility that ChatGPT has but the voices and personality are very good. Though it’s worth noting that the GPT-4o demos show a lot of...

      I like Pi too, as a conversational AI. It lacks a lot of the utility that ChatGPT has but the voices and personality are very good. Though it’s worth noting that the GPT-4o demos show a lot of vocal flexibility that Pi doesn’t have: adjusting speed, pitch, emotional inflection, whispering, singing, harmonizing, etc. It wasn’t perfect but Pi is quite one-dimensional in comparison. That said, I think I still prefer Pi’s default tone.

      1 vote
    3. Kind_of_Ben
      Link Parent
      Wow, Pi is seriously impressive. As someone who doesn't use GPT, I didn't expect to be able to play with stuff like this so soon. Thanks.

      Wow, Pi is seriously impressive. As someone who doesn't use GPT, I didn't expect to be able to play with stuff like this so soon. Thanks.

      1 vote
  9. [7]
    ackables
    Link
    Does anyone know if the AI is transcribing your speech and inputting the text into the LLM, or if it is directly listening to your speech?

    Does anyone know if the AI is transcribing your speech and inputting the text into the LLM, or if it is directly listening to your speech?

    2 votes
    1. [4]
      mantrid
      Link Parent
      It's speech to speech. From the linked page:

      It's speech to speech. From the linked page:

      Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

      With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

      21 votes
      1. [3]
        balooga
        Link Parent
        A thought I just had... does this mean that your tone will also be an input parameter? Will you get a different response based not only on what you say, but how you say it? Will the AI mirror your...

        speech to speech

        A thought I just had... does this mean that your tone will also be an input parameter? Will you get a different response based not only on what you say, but how you say it? Will the AI mirror your emotional presentation? More concerningly, will accent also be a factor in how it responds?

        4 votes
        1. [2]
          mantrid
          Link Parent
          They say that using the speaker's tone of voice and similar factors as input was one of the reasons for using a speech-to-speech model. Now that you mention it, I am also concerned about it might...

          They say that using the speaker's tone of voice and similar factors as input was one of the reasons for using a speech-to-speech model.

          Now that you mention it, I am also concerned about it might respond to certain accents or speech patterns.

          4 votes
          1. Wes
            Link Parent
            It's definitely an interesting consideration. You certainly don't want to accidentally build a racist AI. I do wonder though if it might also have positive effects as well, such as responding with...

            It's definitely an interesting consideration. You certainly don't want to accidentally build a racist AI. I do wonder though if it might also have positive effects as well, such as responding with appropriate language and cultural norms that the speaker would expect.

            It had never even occurred to me that a speech-to-speech model might be possible. I'll need to give this one some consideration to better understand it. The early demos do seem awfully promising though.

            3 votes
    2. [2]
      derekiscool
      Link Parent
      Scratch this - I was reading what they used to do. Apparently it is speech to speech.

      According to OpenAI's site - it uses 3 different models in a pipeline.

      - The first model is a simple model to transcribe your voice to text.
      - The second model to take the input text and reply with output text
      - The third model to take the output text and convert it to auduo.

      Scratch this - I was reading what they used to do. Apparently it is speech to speech.

      1 vote
      1. terr
        Link Parent
        It does, however, transcribe the conversation and provide that once you've closed the speech function.

        It does, however, transcribe the conversation and provide that once you've closed the speech function.

  10. [3]
    jcrash
    Link
    Gee after they came out and said specifically they weren't releasing anything today ... why lie?

    Gee after they came out and said specifically they weren't releasing anything today ... why lie?

    2 votes
    1. unkz
      Link Parent
      Didn't they say they weren't releasing a search engine, and that something else was in store?...

      Didn't they say they weren't releasing a search engine, and that something else was in store?

      https://www.indiatoday.in/technology/news/story/sam-altman-denies-reports-of-openai-launching-google-search-competitor-says-something-else-is-in-store-2538231-2024-05-12

      In Short

      Reports of OpenAI planning a Google Search competitor had surfaced online
      Sam Altman has denied these reports
      He added that new announcements regarding ChatGPT are in store
      
      23 votes
    2. Eji1700
      Link Parent
      Because it’s run by marketers not engineers

      Because it’s run by marketers not engineers

      3 votes