3 votes

OpenAI's WebRTC problem

2 comments

  1. skybrian
    Link
    From the article: [...] [...] [...] [...] [...] [...]

    From the article:

    WebRTC is a poor fit for Voice AI.

    But that seems counter-intuitive? WebRTC is for conferencing, and that involves speaking? And robots can speak, right?

    [...]

    WebRTC aggressively drops audio packets to keep latency low. If you’ve ever heard distorted audio on a conference call, that’s WebRTC baybee. The idea is that conference calls depend on rapid back-and-forth, so pausing to wait for audio is unacceptable.

    …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate. After all, I’m paying good money to boil the ocean, and a garbage prompt means a garbage response. It’s not like LLMs are particularly responsive anyway.

    [...]

    Let’s say it takes 2s of GPUs to generate 8s of audio. In an ideal world, we would stream the audio as it’s being generated (over 2s) and the client would start playing it back (over 8s). That way, if there’s a network blip, some audio is buffered locally. The user might not even notice the network blip.

    But nope, WebRTC has no buffering and renders based on arrival time. Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture.

    [...]

    It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection. While we try to run CDN edge nodes close enough to every user to minimize RTT, it adds up.

    [...]

    All of this nonsense is because WebRTC needs to support P2P. It doesn’t matter if you have a server with a static IP address, you still need to do this dance.

    [...]

    WebRTC practically encourages you to fork the protocol. There’s so many limitations that I’ve barely scratched the surface. The browser implementation is owned by Google and tailor made for Google Meet, so it’s also an existential threat for conferencing apps.

    Sad Fact: That’s why every conferencing app (except Google Meet) tries to shove a native app down your throat. It’s the only way to avoid using WebRTC.

    [...]

    Honestly, if I was working at OpenAI, I’d start by stream audio over WebSockets. You can leverage existing TCP/HTTP infrastructure instead of inventing a custom WebRTC load balancer. It makes for a boring blog post, but it’s simple, works with Kubernetes, and SCALES.

    I think head-of-line blocking is a desirable user experience, not a liability. But the fated day will come and dropping/prioritizing some packets will be necessary. Then I think OpenAI should copy MoQ and utilize WebTransport, because…

    QUIC FIXES THIS

    1 vote
  2. kacey
    Link
    I'll try to add more later, but if they hate the audio transport so much, why not push audio buffers over RTCDataChannel? It's how you do arbitrary data transfer with WebRTC. Maybe it gets QoS'd...

    I'll try to add more later, but if they hate the audio transport so much, why not push audio buffers over RTCDataChannel? It's how you do arbitrary data transfer with WebRTC. Maybe it gets QoS'd badly by ISPs ...?

    Also I don't follow how Discord implements a bunch of extra protocols for their web clients ... I guess they're saying that they wrote their own WebRTC server, and instead of putting in placeholders for the features they didn't like, they implemented a full stack ...?

    (the author probably can't say much due to NDAs, so some imprecision is likely understandable)