5 votes

Nvidia's AI puts video calls on steroids

5 comments

  1. [2]
    Greg
    (edited )
    Link
    So if I'm understanding correctly it's effectively deepfaking your motions onto a photo of yourself. You're trading bandwidth for compute, so probably needs fairly significant hardware at both...

    So if I'm understanding correctly it's effectively deepfaking your motions onto a photo of yourself. You're trading bandwidth for compute, so probably needs fairly significant hardware at both ends (no doubt something Nvidia will be happy to help with!), but the results do seem to be an extremely impressive jump from the status quo.

    I'm excited for the VR implications - presumably this technique would work with a face tracker like the one HTC released the other day in order to cleanly generate a photorealistic avatar. That's big step towards real, immersive telepresence, which I find very cool!

    [Edit] I quickly skimmed the actual paper, and it looks like they trained the model on a DGX A100, but I didn't see a mention of the runtime hardware they used.

    4 votes
    1. Amarok
      Link Parent
      Usually you won't need fancy hardware for playback, but yes, I can imagine it'll take some crunching to do the compression in the first place. Depends on how clever they can make the algorithms,...

      Usually you won't need fancy hardware for playback, but yes, I can imagine it'll take some crunching to do the compression in the first place. Depends on how clever they can make the algorithms, and this needs to happen real-time to hit full potential. Probably not the sort of thing a mobile phone can do easily, at least at present - though hardware will catch up eventually. It might be doable as a custom chip dedicated to this purpose if the compute load isn't optimal on generic processors - and having this sort of video conferencing would justify the expense of those chips.

      3 votes
  2. Amarok
    Link
    Link to the paper. (PDF) Abstract Cutting the total internet bandwidth used by video calls down to a tenth of what it is now is a pretty big deal. Pied Piper can eat their hearts out, this is more...

    Link to the paper. (PDF)

    Abstract

    We propose a neural talking-head video synthesis model and demonstrate its application to video conferencing. Our model learns to synthesize a talking-head video using a source image containing the target person’s appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervisedly. Extensive experimental validation shows that our model outperforms competing methods on benchmark datasets. Moreover, our compact keypoint representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth. Besides, we show our keypoint representation allows the user to rotate the head during synthesis, which is useful for simulating a face-to-face video conferencing experience. Our project page can be found at https://nvlabs.github.io/face-vid2vid.

    Cutting the total internet bandwidth used by video calls down to a tenth of what it is now is a pretty big deal. Pied Piper can eat their hearts out, this is more powerful than the fictional compression technology covered in HBO's Silicon Valley. This will also have similar implications for the size of stored video - I'm sure Youtube's servers wouldn't mind getting 10x more out of their existing storage.

    3 votes
  3. [2]
    Wes
    Link
    This is mighty impressive. I can only imagine how much bandwidth this might have saved if integrated with Zoom about a year ago. AI is both very scary and very exciting. Nvidia continues to show...

    This is mighty impressive. I can only imagine how much bandwidth this might have saved if integrated with Zoom about a year ago.

    AI is both very scary and very exciting. Nvidia continues to show they have some of the best engineers in this space. I could see them licensing this to video chat applications.

    2 votes
    1. Amarok
      Link Parent
      The presentation blew right by the implications for low-bandwidth areas too. This will make video conferencing possible in places that have very poor, slow connections.

      The presentation blew right by the implications for low-bandwidth areas too. This will make video conferencing possible in places that have very poor, slow connections.

      2 votes