33
votes
What we discovered on ‘Deep YouTube’ – The video site isn’t just a platform. It’s infrastructure.
Link information
This data is scraped automatically and may be incorrect.
- Title
- What We Discovered on 'Deep YouTube'
- Authors
- Ryan McGrady
- Published
- Jan 26 2024
- Word count
- 1973 words
https://archive.is/mjWAz
The paper is here. Here's the abstract:
A baseline study isn't so exciting, but it's useful to know what the denominator is.
I do wonder if YT engineers noticed a huge cluster "dialing for videos" non stop for mont—
oh.
In fairness to them, it does look like the hit rate was about 1 in 1.8 billion, so even with the five-order-of-magnitude optimisations they mentioned that's about 150 requests per second sustained for three months. I imagine a decent chunk of the timeline was just balancing how many machines (and IP addresses) they had to allocate to work that's basically idling vs how picky the YouTube security systems are about obvious bot requests like that.
A few hundred (or thousand) requests per second isn't going to show up even as a rounding error for a site the size of YouTube, but they'll be set up to detect and throttle obvious non-human behaviour anyway because it tends to be good practice above and beyond any worries about traffic volume.
Considering how they stress that it only took a supercomputer moments, they could've definitely let this run for longer.I made an oof and judged the article without reading it based on not just a quote, but an incomplete quote. Shame on me.
Hm? Both the paper and the pop article say "It took a sophisticated cluster of powerful computers at the University of Massachusetts months to collect a representative sample". I only got surprised by how few videos they have actually found.
When you think about it, that was inevitable. The ID-space is extremely sparse. It says 14 * 10⁹ videos total and I'd eyeball the ID-space at (26+10+2)¹¹ so uhhh 17 million requests per successful hit on average.
The download and processing of just 10⁴ packets of audio+metadata was the easy part.
If I'm not mistaken, then Youtube uses base64 which would theoretically mean a hit rate of 1 in 5 billion requests.
Although now I’m re-reading it I’m not actually sure whether that means 1.8x109 notional keys searched per hit after expanding their optimisations (they mention some other statistical tricks further up), or if that’s the actual number of calls they executed to find each hit. I was assuming the former, but I can definitely see them needing a good chunk of IP addresses if it’s the latter.