33 votes

What we discovered on ‘Deep YouTube’ – The video site isn’t just a platform. It’s infrastructure.

7 comments

  1. skybrian
    (edited )
    Link
    https://archive.is/mjWAz The paper is here. Here's the abstract: A baseline study isn't so exciting, but it's useful to know what the denominator is.

    https://archive.is/mjWAz

    Despite its global popularity, YouTube (which is owned by Google) veils its inner workings. When someone studies, for example, the proliferation of extreme speech on YouTube, they can tell us about a specific sample of videos—their content, view count, what other videos they link to, and so on. But that information exists in isolation; they cannot tell us how popular those videos are relative to the rest of YouTube. To make claims about YouTube in its entirety, we either need key information from YouTube’s databases, which isn’t realistic, or the ability to produce a big-enough, random sample of videos to represent the website.

    That is what we did. We used a complicated process that boils down to making billions upon billions of random guesses at YouTube IDs (the identifiers you see in the URL when watching videos). We call it “dialing for videos,” inspired by the “random digit dialing” used in polling. It took a sophisticated cluster of powerful computers at the University of Massachusetts months to collect a representative sample; we then spent another few months analyzing those videos to paint what we think is the best portrait to date of YouTube as a whole. (We use a related, slightly faster method at this website to keep regularly updated data.)

    The paper is here. Here's the abstract:

    YouTube is one of the largest, most important communication platforms in the world, but while there is a great deal of research about the site, many of its fundamental characteristics remain unknown. To better understand YouTube as a whole, we created a random sample of videos using a new method. Through a description of the sample’s metadata, we provide answers to many essential questions about, for example, the distribution of views, comments, likes, subscribers, and categories. Our method also allows us to estimate the total number of publicly visible videos on YouTube and its growth over time. To learn more about video content, we hand-coded a subsample to answer questions like how many are primarily music, video games, or still images. Finally, we processed the videos’ audio using language detection software to determine the distribution of spoken languages. In providing basic information about YouTube as a whole, we not only learn more about an influential platform, but also provide baseline context against which samples in more focused studies can be compared.

    A baseline study isn't so exciting, but it's useful to know what the denominator is.

    33 votes
  2. [6]
    Minty
    Link
    I do wonder if YT engineers noticed a huge cluster "dialing for videos" non stop for mont— oh.

    I do wonder if YT engineers noticed a huge cluster "dialing for videos" non stop for mont—

    10,016 videos.

    oh.

    10 votes
    1. Greg
      Link Parent
      In fairness to them, it does look like the hit rate was about 1 in 1.8 billion, so even with the five-order-of-magnitude optimisations they mentioned that's about 150 requests per second sustained...

      In fairness to them, it does look like the hit rate was about 1 in 1.8 billion, so even with the five-order-of-magnitude optimisations they mentioned that's about 150 requests per second sustained for three months. I imagine a decent chunk of the timeline was just balancing how many machines (and IP addresses) they had to allocate to work that's basically idling vs how picky the YouTube security systems are about obvious bot requests like that.

      A few hundred (or thousand) requests per second isn't going to show up even as a rounding error for a site the size of YouTube, but they'll be set up to detect and throttle obvious non-human behaviour anyway because it tends to be good practice above and beyond any worries about traffic volume.

      16 votes
    2. [4]
      Grzmot
      (edited )
      Link Parent
      I made an oof and judged the article without reading it based on not just a quote, but an incomplete quote. Shame on me.

      Considering how they stress that it only took a supercomputer moments, they could've definitely let this run for longer.

      I made an oof and judged the article without reading it based on not just a quote, but an incomplete quote. Shame on me.

      3 votes
      1. [3]
        Minty
        Link Parent
        Hm? Both the paper and the pop article say "It took a sophisticated cluster of powerful computers at the University of Massachusetts months to collect a representative sample". I only got...

        Hm? Both the paper and the pop article say "It took a sophisticated cluster of powerful computers at the University of Massachusetts months to collect a representative sample". I only got surprised by how few videos they have actually found.

        When you think about it, that was inevitable. The ID-space is extremely sparse. It says 14 * 10⁹ videos total and I'd eyeball the ID-space at (26+10+2)¹¹ so uhhh 17 million requests per successful hit on average.

        The download and processing of just 10⁴ packets of audio+metadata was the easy part.

        9 votes
        1. [2]
          PleasantlyAverage
          Link Parent
          If I'm not mistaken, then Youtube uses base64 which would theoretically mean a hit rate of 1 in 5 billion requests.

          If I'm not mistaken, then Youtube uses base64 which would theoretically mean a hit rate of 1 in 5 billion requests.

          2 votes
          1. Greg
            Link Parent
            Although now I’m re-reading it I’m not actually sure whether that means 1.8x109 notional keys searched per hit after expanding their optimisations (they mention some other statistical tricks...

            …we generated our random sample by randomly guessing YouTube IDs, aided by the case insensitivity of YouTube’s search engine. Each of our searches generates a string of ten random alphabetical characters followed by one random character from the set of sixteen characters allowed in the eleventh position. Each query is thus 210 possible video guesses. Producing our set of 10,016 videos required generating 18,260,259,669 case-insensitive IDs, equivalent to 18,698,505,901,056 case-sensitive guesses. Dividing the number of video IDs we tested by the number of hits we found, there were 1,866,863,608 guesses in between hits.

            Although now I’m re-reading it I’m not actually sure whether that means 1.8x109 notional keys searched per hit after expanding their optimisations (they mention some other statistical tricks further up), or if that’s the actual number of calls they executed to find each hit. I was assuming the former, but I can definitely see them needing a good chunk of IP addresses if it’s the latter.

            5 votes