16 votes

Introducing Whisper (OpenAI speech recognition model)

16 comments

  1. [3]
    DataWraith
    (edited )
    Link
    I'm amazed at how well it works in my testing so far. For example, I ran a scene from the V for Vendetta movie through the medium model. I specifically kept the clip around because the...

    I'm amazed at how well it works in my testing so far.

    For example, I ran a scene from the V for Vendetta movie through the medium model. I specifically kept the clip around because the alliterations are fun and difficult to say (and presumably to recognize), and the following was the result:

    Detecting language using up to the first 30 seconds. Use `--language` to specify the language
    Detected language: english
    [00:00.000 --> 00:09.000]  I can assure you I mean you no harm. Who are you? Who? Who is but the form following the function of what and what I am is a man in a mask.
    [00:09.000 --> 00:18.000]  Well I can see that. Of course you can. I'm not questioning your powers of observation, I'm merely remarking upon the paradox of asking a masked man who he is.
    [00:18.000 --> 00:20.000]  Oh, right.
    [00:20.000 --> 00:30.000]  But on this most auspicious of nights, permit me then, in lieu of the more commonplace soubriquet, to suggest the character of this dramatic persona.
    [00:30.000 --> 00:32.000]  Voila!
    [00:32.000 --> 00:45.000]  In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of fate, this visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished.
    [00:45.000 --> 01:03.000]  However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin, vanguarding vice and vouchsafing the violently vicious and voracious violation of volition.
    [01:03.000 --> 01:17.000]  The only verdict is vengeance, a vendetta, held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous.
    [01:17.000 --> 01:27.000]  Verily, this vicissoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you, and you may call me V.
    [01:27.000 --> 01:30.000]  Are you like a crazy person?
    [01:30.000 --> 01:33.000]  I am quite sure they will say so.
    

    🤯

    Edit: The actual clip I used can be found on YouTube here. The transcription by the model is not perfect (e.g. 'dramatic persona' instead of 'dramatis persona'), and the punctuation has small flaws, but I'm still hella impressed.

    9 votes
    1. [2]
      Diff
      (edited )
      Link Parent
      For reference, here's Google's take. Just in terms of words, it's close, but a lot of completely missing words and incorrect words. Far away from Whisper and its seriously impressive formatting. I...

      For reference, here's Google's take.

      I can assure you I mean you know how you are who who is but the form following the function of what and what I am is a man in a mask
      well I can see that of course you can I'm not questioning your powers of observation I'm merely remarking upon the paradox of asking a masked man who he is
      but on this most auspicious of nights permit me then in lieu of the more commonplace soubriquet to suggest the character of this tremendous persona
      in humble vaudevillian veteran cost vicariously as both victim and villain by the vicissitudes of fate this desires no mere veneer of vanity is a vestige of the Vox Populi now vacant to vanished
      however this valorous visitation of a bygone vexation stands vivified and has vowed to vanquish these venal and virulent vermin fan guarding vice and vouchsafing the violently vicious and voracious violation of
      the only verdict is vengeance a vendetta held as a votive not in vain for the value and veracity
      of such shall one day vindicate the vigilant and the virtuous
      verily this vichyssoise of verbiage veers most verbose so let me simply add that it's my very good honor to meet you and you may call me V
      are you like a crazy person
      I am quite sure they will say so

      Just in terms of words, it's close, but a lot of completely missing words and incorrect words. Far away from Whisper and its seriously impressive formatting. I was tempted to try to add formatting to Google's transcript just to try and make it more readable but even that's difficult when words are missing and some words become conjoined in the transcript.

      AFAICT the only (edit: +1 with the error DataWraith mentions) error Whisper made is "vicissoise," and I find that impressive in its own respect, making a typo instead of recognizing an incorrect word entirely. If this was a turing test, I would have passed it instantly on nearly that alone.

      6 votes
      1. Macil
        Link Parent
        It's been agonizing reading auto-generated subtitles on YouTube and TikTok videos because they make distractingly bad text just like this often. I'm so excited if Whisper might improve...

        It's been agonizing reading auto-generated subtitles on YouTube and TikTok videos because they make distractingly bad text just like this often. I'm so excited if Whisper might improve auto-generated subtitles across the internet as much as shown here.

        5 votes
  2. FlippantGod
    Link
    "Moreover, it enables transcription in multiple languages, as well as translation from those languages into English." Excuse me? Now this I have to try.

    "Moreover, it enables transcription in multiple languages, as well as translation from those languages into English."

    Excuse me? Now this I have to try.

    5 votes
  3. [10]
    Wes
    Link
    The demos are extremely impressive. I assume these transcriptions work in real time, and don't require specific voice training to build a model of the speaker. I've tried using speech recognition...

    The demos are extremely impressive. I assume these transcriptions work in real time, and don't require specific voice training to build a model of the speaker.

    I've tried using speech recognition through tools like VoiceAttack in the past. I never had much luck. Maybe it's my condenser microphone but the accuracy has always been extremely poor. I'd love to try with a model like this instead, and see if voice commands and text input finally become viable methods of interaction.

    3 votes
    1. [5]
      Greg
      Link Parent
      I ran a quick test against the first chapter of A Study in Scarlet from Gutenberg (9556-1601) - using the large model with CUDA enabled on a 3080 took just under seven minutes to crunch a 16...

      I ran a quick test against the first chapter of A Study in Scarlet from Gutenberg (9556-1601) - using the large model with CUDA enabled on a 3080 took just under seven minutes to crunch a 16 minute audio file, so more than twice as fast as real time. That's audiobook quality, so I'm not sure whether background noise and/or poor encoding would force it to work harder.

      According to the GitHub repo the speeds of the smaller models increase as powers of two, meaning that the fastest (lowest accuracy) model would be ~64x faster than real time on the hardware I'm using - and presumably a lot more on server grade GPU cores.

      6 votes
      1. [4]
        Greg
        Link Parent
        I've done a run with the smaller models to compare speed and quality. In this very limited test, the relative speeds I saw were around half as fast as the docs would suggest compared to the large...

        I've done a run with the smaller models to compare speed and quality. In this very limited test, the relative speeds I saw were around half as fast as the docs would suggest compared to the large model, but I'm sure the dataset they generated the estimates from is far more rigorous than the three runs here!

        Full output, all from the same MP3 file linked above:

        Initial reaction is that the large model is fast enough and 'better enough' to be the right choice for anything a human will be reading, but the smaller models could be very useful for feeding into further machine processing (e.g. a locally hosted voice assistant).

        I find it interesting that the large model was the only one to filter out the preamble, as well. It's read in a different voice to the main text, so I'm not sure if it was deliberately rejected for that reason, deliberately rejected by some kind of context inference, or accidentally rejected.

        1 vote
        1. [3]
          FlippantGod
          (edited )
          Link Parent
          I thought the authors claimed Large has the best background noise and voice removal. The timestamp starts at the very beginning, rather than starting from the beginning of the main voice. I'll try...

          I thought the authors claimed Large has the best background noise and voice removal. The timestamp starts at the very beginning, rather than starting from the beginning of the main voice.

          I'll try concatenating different "voices" from the same speakers and running those through Large later, and see if any of those are removed. Also, sticking a different language in the first 30 seconds.

          heavily edited my comment

          1. [2]
            Greg
            Link Parent
            Those sound like some interesting tests for sure! I just re-checked the audio against the text above and noticed another nuance: only the line is in the "preamble voice", and that only shows up in...

            Those sound like some interesting tests for sure!

            I just re-checked the audio against the text above and noticed another nuance: only the line

            File 1. No Adverts from audiobooksforfree.com

            is in the "preamble voice", and that only shows up in the tiny.en run. The second part of what I would consider the preamble

            A Study in Scarlet by Arthur Conan Doyle. Part 1, being a reprint from the reminiscences of John H. Watson, M.D., late of the Army Medical Department

            is actually read by the main narrator, and shows up in both of the second two but not in the large. Either there's some more context sensitivity going on in the large model, or it happened to hit a bug here that made it look smarter than it is! Given that it does show in the other two models, I'd lean towards the former, but it's too small a pool to be certain.

            [Edit] And if it is indeed context sensitivity, it’s going to depend very much on your use case whether that’s a good or a bad thing.

            1 vote
            1. FlippantGod
              Link Parent
              I've observed interesting differences between medium and large runs. Both skip some (different) valid lines for no apparent reason (and all laughter). Also, while I would say large is generally...

              I've observed interesting differences between medium and large runs. Both skip some (different) valid lines for no apparent reason (and all laughter). Also, while I would say large is generally better, especially at non-english sentence structures, some translations, while correct, aren't quite as good as the medium model.

              1 vote
    2. [4]
      FlippantGod
      Link Parent
      Certainly not real time/interactive running the medium model on my CPU (default device). I'll try smaller models (and on the GPU if possible) momentarily. No adjustments to the models is required....

      Certainly not real time/interactive running the medium model on my CPU (default device). I'll try smaller models (and on the GPU if possible) momentarily. No adjustments to the models is required.

      In my limited testing, the results are pretty good. Maybe transcription is faster than translation, but my 5-minute clip is still going so I can't confirm right now.

      I've noticed errors with non-english sentence structures, which I expected, and no detection of nonverbal vocalizations (laughter is not detected) which I did not expect.

      The formatting and punctuation is generally amazing. Foreign names are detected seemingly as pronounced.

      1 vote
      1. [3]
        DataWraith
        (edited )
        Link Parent
        From what I understand, the model works on 30s snippets of audio in order to achieve the quality it gets, so it won't run interactively no matter what GPU you throw at it. Translation does not...

        Certainly not real time/interactive

        From what I understand, the model works on 30s snippets of audio in order to achieve the quality it gets, so it won't run in real-time interactively no matter what GPU you throw at it.

        Translation does not seem to be noticeably slower than transcription in my testing, though the quality of the translation seems to be worse than if you transcribe it first and then run the text through a separate translator.

        2 votes
        1. FlippantGod
          Link Parent
          Theoretically possible to be real time for offline work; audio already exists. In this case it matters a lot less, but it does enable certain workflows.

          Theoretically possible to be real time for offline work; audio already exists. In this case it matters a lot less, but it does enable certain workflows.

          2 votes
        2. FlippantGod
          Link Parent
          The transcription is looking to be more accurate than the translation! However, the impressive formating (for English) has regressed, and I would not expect any machine translation tools to...

          The transcription is looking to be more accurate than the translation! However, the impressive formating (for English) has regressed, and I would not expect any machine translation tools to improve in that function.

          Interesting.

  4. FlippantGod
    (edited )
    Link
    An update on translation quality†: Manual Transcription & Manual Translation (A)1 missing filler everything else used the Whisper Transcription Medium sized model results Whisper Translate Medium...

    An update on translation quality†:

    Manual Transcription & Manual Translation (A)1 missing filler

    everything else used the Whisper Transcription Medium sized model results

    Whisper Translate Medium (A-)1 incorrect subject, 1 missing filler
    DeepL (A-)1 incorrect tense, 1 less than ideal translation, several acceptable but incorrect tenses
    Custom model (B+) 1 incorrect sentence, 1 missing filler, 1 acceptable but incorrect subject, 1 translation win over DeepL and Whisper
    Papago (B-)2 incorrect sentences, 1 incorrect filler, 1 incorrect subject, 1 less than ideal translation, several acceptable but incorrect tenses, 1 translation win over DeepL and Whisper
    Google(C)2 incorrect subjects, 2 incorrect tenses, 1 incorrect sentence, 2 less than ideal translations
    Lingvanex (C-)2 incorrect subjects, 2 incorrect tenses, 2 incorrect sentences

    Others (F)

    † my sample size is really small because I've been busy and just comparing results rather than processing more samples

    2 votes
  5. teaearlgraycold
    Link
    Would love to see someone make this into a VLC plugin

    Would love to see someone make this into a VLC plugin

    1 vote