15 votes

How do Reddit, Lemmy, Tildes, etc. process, store, and serve ranked threads/links?

I'm familiar with how the ranking algorithms work on a high level. What I'm curious about is to how those algorithms are actually applied.

How do these platforms actually apply the ranking algorithms so that the user sees the threads appropriately ordered? My knowledge is limited to PHP and MySQL, so I'm looking at it through the lense of those systems. I've thought of a few possible ways, but all of them seem pretty resource intensive.

  1. Maintain a table of threads with all relevant information required to calculate ranking, as well as ranking itself. A server side script executing on a routine basis every X minutes (cron job?) updates the rankings on all the threads, so they can be easily ordered. However, people most likely don't care about threads >Y days old, so those can be excluded or automatically deranked somehow.

  2. Maintain a table of threads with all relevant information required to calculate ranking. When a user visits, the last X threads (again, users probably don't care about really old stuff) are pulled out of the database and ran through a ranking and sorting algorithm, reordered and displayed to the user. This seems the most resource intensive?

I am by no means a professional developer, but I've been dabbling recently and the concept of how these large quantities of data are ranked both perplexes and interests me.

12 comments

  1. [2]
    Bauke
    Link
    The way Tildes does it is by having a last_interesting_activity_time column for the Activity sort (and a last_activity_time column for the All Activity sort). Then when you go to the topic listing...

    The way Tildes does it is by having a last_interesting_activity_time column for the Activity sort (and a last_activity_time column for the All Activity sort). Then when you go to the topic listing it sets to order by that column (depending on which sort you're using).

    To actually set that last_interesting_activity_time Tildes has a script that is run as a systemd service. In that script you can see a process_message function that uses messages from the Redis event stream. So any time something happens in Redis that script sees it and does its magic.

    The All Activity sort works a little differently using a PostgreSQL trigger so any time a comment is created, deleted or removed it will set the topic's last_activity_time column accordingly.

    I just quickly figured this out from browsing the source code so there's definitely still missing pieces, but for an overview I think it's good enough.

    13 votes
    1. switchgear
      Link Parent
      This is exactly what I was looking for, thank you. Seems like a very clever and efficient way to do it. It's not updating all threads in the database in bulk to rank them, it's more or less just...

      This is exactly what I was looking for, thank you. Seems like a very clever and efficient way to do it. It's not updating all threads in the database in bulk to rank them, it's more or less just stopping the ranking of each thread when it becomes irrelevant.

      3 votes
  2. [2]
    freestylesno
    Link
    I'm sure they all handle it differently but tildes source code is available. This may be a good place to start. https://gitlab.com/tildes/tildes/-/tree/master/tildes/sql/init/triggers/topic_votes

    I'm sure they all handle it differently but tildes source code is available. This may be a good place to start.

    https://gitlab.com/tildes/tildes/-/tree/master/tildes/sql/init/triggers/topic_votes

    3 votes
    1. Bipolar
      (edited )
      Link Parent
      You can also find the old Reddit source code online from when they were open source, don’t know about Lemmy (horrible name)

      You can also find the old Reddit source code online from when they were open source, don’t know about Lemmy (horrible name)

      6 votes
  3. [7]
    g33kphr33k
    Link
    Once you scale, you have a data table for everything, all cross linked and indexed. Tags? A table. A link? A table. Keywords? Table. You name it, table. A user then shows interests and they will...

    Once you scale, you have a data table for everything, all cross linked and indexed.

    Tags? A table. A link? A table. Keywords? Table. You name it, table.

    A user then shows interests and they will get tagged with keywords of interest. Keyword overlap to those tables. Now you feed posts using those words. It becomes a really big and complex algorithm.

    The datalakes these folks are running are humongous. They have every scrap of data on you, stored. The code behind it is complex but hopefully my stupidly simplified version helps.

    2 votes
    1. [6]
      DiggWasCool
      Link Parent
      You get a table! You get a table! You get a table! I get a table! EVERYBODY GETS A TABLE!!! /Sorry for a reddit type of joke!

      You get a table! You get a table! You get a table! I get a table! EVERYBODY GETS A TABLE!!!

      /Sorry for a reddit type of joke!

      2 votes
      1. [5]
        g33kphr33k
        Link Parent
        Too many tables and not enough chairs in my opinion. All joking aside, the more you store the more you can customise and give people what they want. It's how you'll curate accurate feeds to keep...

        Too many tables and not enough chairs in my opinion.

        All joking aside, the more you store the more you can customise and give people what they want. It's how you'll curate accurate feeds to keep people engaged. Once you get that right, you can get the users on a death-scroll and feed them the correct ads.

        1 vote
        1. [4]
          vord
          Link Parent
          But its also not. The one thing all of the algorithmic feeds have taught me is that the algorithms are only good for showing me things I already like. Discovery is much less good, and while it can...

          It's how you'll curate accurate feeds

          But its also not. The one thing all of the algorithmic feeds have taught me is that the algorithms are only good for showing me things I already like. Discovery is much less good, and while it can be satisfied to some degree with 'other like you like this,' its utterly incapable of filtering new content.

          I think trying to precisely target falls prey to the same sort of issues that not targetting at all does.

          1. [3]
            g33kphr33k
            Link Parent
            That's why they scatter in other content, capture what you hover on for longer than something else or click through to, then also add that to your profile of "feed them this shit, they like it"....

            That's why they scatter in other content, capture what you hover on for longer than something else or click through to, then also add that to your profile of "feed them this shit, they like it". Stuff you zip past starts to fall to the side. It's very very clever and just human nature.

            They also do this with paid stuff, try and feed it to you. At least Google Ads allow you to check a box that says that it's not relevant or you see it too often, but they will keep trying.

            1. [2]
              vord
              Link Parent
              Yea the point is though, that it doesn't work, at least from a proper enjoyment perspective. The best algorithms on the planet won't outcompete the hedonistic treadmill. Its why clickbait rules....

              Yea the point is though, that it doesn't work, at least from a proper enjoyment perspective. The best algorithms on the planet won't outcompete the hedonistic treadmill. Its why clickbait rules. It works with the data-driven 'eyecatch' framework. And to an algorithm, thats the same as proper enjoyment and fulfillment.

              No algorithm to date is capable of sending fulfillment over cheap bait.

              Throwing random crap works till it doesn't. It can completely kill a vibe. Autogenerated playlists in particular suffer from this.

              2 votes
              1. g33kphr33k
                Link Parent
                I definitely cannot argue against this. I wish people would get more exhausted and realise what's feeding them and pull away for a change. That's a different discussion though.

                I definitely cannot argue against this.

                I wish people would get more exhausted and realise what's feeding them and pull away for a change. That's a different discussion though.

                1 vote
  4. devilized
    Link
    I remember reading about this for Reddit. They use PostgreSQL for their actual database, but had a tiny number of tables. They're mostly storing JSON in it, which Postgres can handle just...

    I remember reading about this for Reddit. They use PostgreSQL for their actual database, but had a tiny number of tables. They're mostly storing JSON in it, which Postgres can handle just natively.

    SQL databases provide great data consistency, can be slow at scale. So they use Cassandra as a persistent cache. It's a NoSQL database that can handle shading at a global scale. When you're making API calls to or from Reddit you're more likely hitting their Cassandra DB instead of Postgres.

    2 votes