14 votes

What's in a link? A recipe for using the web to find a spectacular amount of information about music submissions.

Tags: theory

This discussion is old hat for the l2t mods, but I'd like to get it written down here on tildes so when the time comes to develop these features we've got a record of all the tricks ready to help whoever wants to code it all. It's surprisingly easy to do this now.

First, we're only going to concern ourselves with legal, legit streaming links. That limits the number of sites we need to support to the following...

  1. Youtube 2. Bandcamp 3. Soundcloud 4. Spotify 5. Google Play

Sure, there are others, but they don't offer free streaming, so they aren't particularly useful for widespread music sharing on social media sites like reddit and tildes. Even on reddit, very little of the music shared ends up coming from pay-for services - it's almost entirely coming from youtube, bandcamp, and soundcloud, in that order. So those are the APIs we need to be dealing with in order to extract useful information. It's also worth noting that over time, some of these will die, and new ones will arise to take their place, and they will change their APIs from time to time breaking services built on top of them.

Yes, sometimes youtube has pirate streams of music. That's their problem to solve, not ours. The closest we could come to 'helping' in that regard would be verifying that the video posted is on the artist and/or label's official channels. This is not easy, but it is possible. Frankly, I don't think it's worth the effort. It's hard to code and will have a messy false positive rate. A lazier solution we've used in listentothis for years is simply having a blacklist for channels that spam/rip/repost artist's music without permission - and we can get you a copy of our blacklist and whitelist if you like, so that isn't starting from scratch.

Getting all the music information about an artist is a two-step process.

The first step is querying the metadata provided by the sites listed above through their API calls. The relevant information we need for this is simply the name of the artist and the name of the track (or album, if it's an album link). There's plenty of other information available (some of which we will want, like the youtube views and various popularity metrics such as plays, scrobbles, listens, monthly listeners, heat indexes) but that information isn't needed unless you intend to start dividing up the music into sub-categories using other ~tilds or #tags. Eventually we will want to do that (subs like listentothis can't really exist without the popularity numbers) but that's a problem for further in the future, once tildes is a lot more active. For now, let's just concentrate on making the sidebar of a music submission take people's breath away.

The second step is using the name of the artist and the name of the track (or album) to lookup information about that artist in public databases.

The motherlode of music data resides in Musicbrainz. This has become the de-facto open-source database of record - you might remember its humble beginnings when it was cddb and freedb, embedded in most cd-ripping tools to provide lookups of the artist/track information. It's grown into a wikipediesque monster since then. It knows almost everything there is to know about every artist who has ever released so much as a single or an EP, and does well even for obscure and new independent artists. It's also being updated by-the-minute with new artist information.

Musicbrainz has a public API, and they allow dozens of queries per minute, so it would be possible to use their free service - but I think that's the wrong way to do this. Musicbrainz does allow you to set up your own copy of their database, and provides scripts to download nightly updates of the data, so it's possible to run this locally. For a hassle-free setup, they do provide a virtual machine that's ready to go, just download and boot it up on your network. The VM also has their full API and web services (looks exactly like their official site), so with the VM you can query it locally through the API just like you would using the remote site (and you could have a failover between the local copy and the official site). Running just a local copy of the database rather than the VM, you won't get the API. The database is around 30GB right now and grows very slowly. Musicbrainz local copies also provide the option of querying their SQL directly, without the need to use the API.

What data can we get out of this monster database?

  1. Artist search with a confidence rating for best match
  2. Complete discography and artist bio in excruciating detail
  3. A fantastic collection of every relevant link to other sites
  4. The most relevant collection of 'genre tags' available anywhere

Let's also not forget they come with an army of developers and a great support forum. I think the case for using musicbrainz as tildes' prime music authority kinda makes itself. :P

There's really no need for another data source. Musicbrainz doesn't do popularity numbers yet but they are planning to do it soon. The Listenbrainz project is, basically, an attempt to reinvent last.fm as an open-source service. Last.fm itself isn't likely to survive, they've been struggling financially for several years. Listenbrainz hopes to allow people to import that data before the site goes under.

So what do we build out of this mountain of data? Easy - the laziest submission process for music anywhere on the internet.

I think the goal here for the users is to be simply pasting a music link into the submission form, and letting tildes do all of the rest of the work for them. The tags and the title can be auto-populated by the lookup, and then tweaked by the user. That'll give a sense of uniformity to the titles, and it makes submitting on mobile almost effortless.

Once the submission is created, the sidebar can be populated with the musicbrainz information. I think a good start would be to show the name of the artist, the name of the album, the name of the track (if applicable), and the release year - possibly even the record label and genre tags (big bucket generic tags like rock, jazz, folk, nothing overly specific). I'd follow that up with the relevant artist links to their own website, their official bandcamp/youtube/twitter/facebook, and possibly the links to discogz and lyric wiki if present. I'd close it with the artist's bio - just a blurb, that ends with a 'read more on wikipedia' link.

If/when we have all of this working, we can worry about the next step - finding a way to determine the relative popularity of any given submission. That's a far, far harder problem to solve.

Here's a quick link listing to all of the relevant APIs and their documentation for easy reference.

  1. Musicbrainz XML API
  2. Youtube API
  3. Bandcamp API
  4. Soundcloud API
  5. Spotify / Echonest API (second only to musicbrainz for raw amount of data)
  6. Google Play API

Edit: Also, we can run a local copy of the Discogs database which will give us even more. If we have Musicbrainz and Discogs local, that's almost everything without the need to have Tildes connecting to other sites.

9 comments

  1. [9]
    arghdos
    Link
    Holy shit, did bandcamp decide to have an API again while I wasn't looking???? FWIW, I would start with the goal of having a last-posted date for artists / songs, and go from there.

    Holy shit, did bandcamp decide to have an API again while I wasn't looking????

    If/when we have all of this working, we can worry about the next step - finding a way to determine the relative popularity of any given submission. That's a far, far harder problem to solve.

    FWIW, I would start with the goal of having a last-posted date for artists / songs, and go from there.

    1 vote
    1. [8]
      Amarok
      Link Parent
      Yeah, that's important for functions like repost-filtering. Without some kind of repost filter, when new albums from popular artists get released there will be an avalanche of submissions, often...

      Yeah, that's important for functions like repost-filtering. Without some kind of repost filter, when new albums from popular artists get released there will be an avalanche of submissions, often the same tracks from the same album multiple times each. We've traditionally blocked it outright instead of taking the megathread approach used in other discussion subs. It'd be nice if the people submitting all of that could get redirected to the comments of the first submission instead.

      3 votes
      1. [7]
        cfabbro
        (edited )
        Link Parent
        One of the things I think needs to be added to ~ that will help with this issue, that reddit lacks, is better contextual URL parsing. E.g. If someone submits http://youtube.com/watch?v=INcwU-Xk0M4...

        One of the things I think needs to be added to ~ that will help with this issue, that reddit lacks, is better contextual URL parsing.

        E.g. If someone submits http://youtube.com/watch?v=INcwU-Xk0M4 and someone else submits https://www.youtube.com/v/INcwU-Xk0M4 later, reddit doesn't recognize they are the same video even though they are. Often times the YouTube URL also includes additional tracking and metadata info, E.g. https://www.youtube.com/watch?v=INcwU-Xk0M4&s=twitter.com which reddit also doesn't even recognize as being the same as the plain ?v=INcwU-Xk0M4 submission.

        IMO this is often why there so many repeat submissions in a row on reddit when a new "hot" video gets posted to YouTube and why there is so much reposting of content in general. It's not necessarily that people are submitting at the same time, hoping they can usurp the first poster's spot, or even trying to "karmawhore" by reposting something that was front page the previous day... it's that they literally do not know it's already been posted because reddit doesn't recognize it as the same video and so doesn't notify them it's been submitted already. That and reddit search sucks balls so you can't always even manually verify if something has been submitted already or not.

        2 votes
        1. [6]
          Deimos
          Link Parent
          In general, this can usually be handled as long as the site has set up a "canonical link" in their HTML (YouTube doesn't seem to have it, which is annoying, but we could probably do some custom...

          In general, this can usually be handled as long as the site has set up a "canonical link" in their HTML (YouTube doesn't seem to have it, which is annoying, but we could probably do some custom handling).

          For an example of a site that actually does it properly, I can use Wikipedia. Say that someone submits the mobile link because they're on their phone: https://en.m.wikipedia.org/wiki/PHOX

          In their HTML is:

          <link rel="canonical" href="https://en.wikipedia.org/wiki/PHOX"/>
          

          So that can be used to change the link to the "proper" link for the content.

          3 votes
          1. [3]
            cfabbro
            (edited )
            Link Parent
            Speaking of custom handling, can we please please please auto-convert submitted mobile links to their canonical/desktop version, or better yet simply take people to the appropriate place depending...

            Speaking of custom handling, can we please please please auto-convert submitted mobile links to their canonical/desktop version, or better yet simply take people to the appropriate place depending on their device & user-agent?

            E.g. browsing ~ on desktop: always take me to desktop versions of a submitted link's site, browsing on iOS: take me to mobile version when available.

            I despise being automatically taken to mobile versions of sites when clicking submitted links while I'm on desktop simply because the mobile URL is what the user submitted. Often times it's not always easy to get back to the desktop version as well, especially if the site's URLs for mobile and desktop aren't similar. With Wikipedia it's not so bad since you just need to remove the .m but on most newspaper websites, mobile and desktop URLs are in a completely different format so it's impossible to manually edit them to take you to the correct version of the site. They often don't respond to useragent overrides like "Request Desktop Site" either which makes it even more annoying.

            p.s. Also, fuck Google AMP links. Why would anyone ever submit that shit? ... and yet they do.

            1 vote
            1. [2]
              Deimos
              Link Parent
              Definitely, yes. What'd I'd like to happen is: whenever someone submits a link, that triggers some scraping to do a few things: Checks to see if the link redirects anywhere, or if the destination...

              Definitely, yes. What'd I'd like to happen is: whenever someone submits a link, that triggers some scraping to do a few things:

              • Checks to see if the link redirects anywhere, or if the destination page has a canonical url that's different than the one submitted. If either of those are true, "edit" the link to the canonical/destination (and leave a record of what it was originally and what it was changed to). This also means that link-shorteners can't really be used, they'll just be replaced with whatever they point to (which is a good thing).
              • Scrape the site's favicon if we don't already have it (so we don't get those blue dashed squares that some link topics still have).
              • Any other relevant scraping to get the type of metadata like @Amarok's describing in this post.
              3 votes
              1. Amarok
                Link Parent
                I really love that idea. It'll also protect the user's privacy - external redirectors won't have the opportunity to grab user information if they are cut out of the equation this way.

                This also means that link-shorteners can't really be used, they'll just be replaced with whatever they point to (which is a good thing).

                I really love that idea. It'll also protect the user's privacy - external redirectors won't have the opportunity to grab user information if they are cut out of the equation this way.

                1 vote
          2. [2]
            Amarok
            Link Parent
            With youtube, it's the video ID that you really want. No matter what messy URL links back to their site, that video ID is unique to each video, and can be reliably used as the canonical element in...

            With youtube, it's the video ID that you really want. No matter what messy URL links back to their site, that video ID is unique to each video, and can be reliably used as the canonical element in a URL-matching scheme. It has to appear in the URL for youtube to function.

            The other most relevant bit is the channel ID. That'll come with any video ID lookup. The channel ID is useful identifying spammers and for blocking/banning problematic channels at the source. Most (not all) of the spammers and freebooters rely on their channel attracting subscribers, so if the channel itself is blocked, they can't easily uproot and just create a new channel at the drop of a hat. Youtube itself is making that sort of channel-cycling activity harder all the time.

            The channel ID also has a lot of potential use grouping content. If you have ~rec.arts.basketweaving and you tally up the youtube channel IDs that are frequently posted there, you've a good bead on what's popular on youtube for basketweavers. That can be useful if you ever intend to have tildes suggest related content.

            1 vote
            1. arghdos
              Link Parent
              I was about to suggest -- for certain media sources (youtube, spotify, soundcloud, etc.) they have internal representations (video / track / channel ID) that could be used as the "canonical" URL...

              I was about to suggest -- for certain media sources (youtube, spotify, soundcloud, etc.) they have internal representations (video / track / channel ID) that could be used as the "canonical" URL that @Deimos discussed above.

              1 vote