121 votes

The voting on topics and comments now ends when they're 30 days old and all individual vote records are deleted, retaining only the count

This is a privacy-related update that I've always intended to implement on Tildes, and I finally spent some time on it this week.

Keeping eternal records of everything that every user ever voted on is some of the most sensitive data that sites with a voting system have. Your voting history says a huge amount about you, your interests and opinions, and can even serve as a decent proxy for showing what times you were active on the site, what posts you were reading, and how long you spent reading the comments on each of them. In exchange for these major privacy implications, you get the tiny benefit of being able to tell which old posts you voted on (if you even go back to old posts).

So now, to match up with Tildes's general approach of deleting as much sensitive data as possible after 30 days, the voting on posts closes when they're 30 days old. After a post's voting is closed, the records of which individual users voted on that post are deleted, but the count of how many votes there were is kept. So old posts will continue showing their same "scores" exactly the same as before, but there will be no record of which individual users cast those votes.

However, this isn't a purely positive update: the main downside is that the voting does need to be closed (otherwise there would be no way to prevent people from voting again after their first vote is deleted), which prevents the occasionally useful ability to vote on old topics or comments. Overall though, voting on older posts is extremely rare, with less than 1% of the votes on Tildes ever made on something that was over 30 days old at the time of voting.

When the "delete old sensitive data" job runs for the first time after this update later today, 97% of the voting data in the database will be deleted. That's a massive decrease in the amount of sensitive data the site is retaining, and something that most sites would never consider doing, because of the value of that data for behavior analysis and ad-targeting.

35 comments

  1. [2]
    ThatFanficGuy
    Link
    Good thinking, and good on you for taking the step. Just to be clear: if I make a comment on a post that's two months old, it will still have the 30-day vote life, correct? Also: is there a...

    Good thinking, and good on you for taking the step.

    Just to be clear: if I make a comment on a post that's two months old, it will still have the 30-day vote life, correct?

    Also: is there a performance benefit to cleaning up this big a part of the database?

    24 votes
    1. Deimos
      Link Parent
      Yes, you can still comment on posts of any age (and I have no intention to change that), and each comment gets its own 30-day voting window. It's always based on the comment's age, not the age of...

      Yes, you can still comment on posts of any age (and I have no intention to change that), and each comment gets its own 30-day voting window. It's always based on the comment's age, not the age of the topic it's in.

      There's probably not any significant performance benefit right now, but over the long term it'll definitely help not to have massive tables full of voting data growing infinitely. That should help for both performance as well as keeping the total size of the database down.

      28 votes
  2. dredmorbius
    Link
    @Deimos: Thank you so much for this. I've worked on a few social media sites and projects over the years. One of the first, around the turn of the millennium, had me realising that the tracking of...

    @Deimos: Thank you so much for this.

    I've worked on a few social media sites and projects over the years. One of the first, around the turn of the millennium, had me realising that the tracking of individual voting / rating activity presented a hugely sensitive dataset that could give rise to tremendously intrusive profiling.

    My master scheme to prevent this from happening by boycotting working for such projects over the next decade ... failed miserably.

    I'm glad to see awareness rising, and that you've implemented cleansing of personal voting data on Tildes.

    This despite many regrets over the potential utility of such data in coming up with a potentially better collaborative filtering / rating / ranking system. I no longer think those benefits are worth the costs.

    Regarding the 30-day voting window: so far as rating/ranking of comments goes, this is all but certainly sufficient. The majority of value from moderation comes from the early votes, up or down. WIth time, the tendency is for negatives to accrue. The potential value of using votes (or likes, stars, etc., depending on site mechanics) as a sort of bookmarking system, also really doesn't merit the risks (though the mechanism can be useful). But the loss is small.

    I'm pretty sure there's an exponential decay in voting activity after the first few hours of a post in any regard.

    15 votes
  3. [6]
    Bauke
    Link
    Would this mean this recent MR now becomes impossible (for lack of a better word) to implement? I guess if that were implemented it would only show 30 days worth of votes you've made?

    Would this mean this recent MR now becomes impossible (for lack of a better word) to implement? I guess if that were implemented it would only show 30 days worth of votes you've made?

    12 votes
    1. Deimos
      Link Parent
      That MR is actually what pushed me to do this now. People shouldn't be using that upcoming "what I've voted on" listing to keep track of things over the long term, and I didn't want anyone to...

      That MR is actually what pushed me to do this now. People shouldn't be using that upcoming "what I've voted on" listing to keep track of things over the long term, and I didn't want anyone to start getting into that habit. If they want to keep track of a post for a long time, that's what Bookmark is for.

      So yes, I'm still planning to implement it (hopefully this week), but it will only show posts from the last 30 days that you've voted on.

      28 votes
    2. [4]
      emdash
      Link Parent
      I guess if a post/comment is important to you, enough to warrant wanting to keep it later, either the bookmarking functionality on Tildes or even in your browser suffices? That raises the question...

      I guess if a post/comment is important to you, enough to warrant wanting to keep it later, either the bookmarking functionality on Tildes or even in your browser suffices? That raises the question though, are you intending this to also apply to Tildes bookmarks at some point as well @Deimos? Or are those going to be exempt from this?

      I appreciate the privacy aspect but I also would occasionally like the usability of going back to a post/comment I consider worth revisiting in the future too.

      11 votes
      1. Deimos
        Link Parent
        Bookmarks will definitely never be deleted without the user specifically choosing to.

        Bookmarks will definitely never be deleted without the user specifically choosing to.

        17 votes
      2. [2]
        Bauke
        Link Parent
        I think Deimos answered your questions in his comment. :P

        I think Deimos answered your questions in his comment. :P

        If they want to keep track of a post for a long time, that's what Bookmark is for.

        6 votes
        1. emdash
          Link Parent
          Yeah, he posted his comment while I finished typing mine (our comments are separated by a mere 80 seconds)—it happens, oh well!

          Yeah, he posted his comment while I finished typing mine (our comments are separated by a mere 80 seconds)—it happens, oh well!

          7 votes
  4. [4]
    ainar-g
    (edited )
    Link
    I am not against this feature, but 30 days sounds like an interval that is too small for my taste. Just as a data point, I mostly don't have the time to listen to every track in ~music, so every...

    I am not against this feature, but 30 days sounds like an interval that is too small for my taste. Just as a data point, I mostly don't have the time to listen to every track in ~music, so every once in a while I sort by most votes from last 14, 30, or even 60 days, listen to it, and vote. Probably not a lot of people do that, but still. 90 days would be more comfortable to me.

    12 votes
    1. [3]
      AugustusFerdinand
      Link Parent
      To be fair at that point your vote no longer matters. Your vote isn't going to move the post up or down the sorting. There, thankfully, isn't a score being kept here, no karma total to accrue, no...

      To be fair at that point your vote no longer matters. Your vote isn't going to move the post up or down the sorting. There, thankfully, isn't a score being kept here, no karma total to accrue, no measuring stick to compare your profile to another.

      10 votes
      1. [2]
        ainar-g
        Link Parent
        It does though. E.g. another person looking through the posts of the week or month later and only checking out stuff that has at least N votes. It's a pattern I used to use on Reddit, and somebody...
        To be fair at that point your vote no longer matters. Your vote isn't going to move the post up or down the sorting. (…)

        It does though. E.g. another person looking through the posts of the week or month later and only checking out stuff that has at least N votes. It's a pattern I used to use on Reddit, and somebody else may use it here.

        5 votes
        1. AugustusFerdinand
          Link Parent
          Privacy of all vs a few users that set an arbitrary vote minimum for viewing older content

          Privacy of all vs a few users that set an arbitrary vote minimum for viewing older content

          7 votes
  5. [5]
    Zeerph
    Link
    Potential confusing use case If using all-time activity sort and a thread comes up that the user has yet to see, it's not immediately obvious why the user can't vote on it until they click on the...

    Potential confusing use case

    If using all-time activity sort and a thread comes up that the user has yet to see, it's not immediately obvious why the user can't vote on it until they click on the thread and scroll all the way down to the bottom. While there is a date that the topic was posted, without foreknowledge a user could not know that voting on the topic or post was expired.

    e.g. This post was bumped back to the top of the activity feed, and having missed it the first time around, I wanted to vote on it, but couldn't.

    Solution(s)?

    Perhaps making the "vote" button show differently on hover to indicate that the post is beyond the time limit. Maybe an "X" or a strikethrough would be useful?

    Show a box on hover that says that voting has expired? Or show the expiry date for all votes?
    Examples include: Voting expired or ## days left to vote or Best voted on by YYYY-MM-DD (not that anyone can actually agree on a dating format)

    Something like these labels on hover might, potentially, make the initial encounter of an out-of-date post less confusing, and it would be a good reminder if the lack of ability to vote is encountered again.

    Addendum

    On a slightly unrelated note, it's a little hard to know when a topic is locked without going to the bottom of said topic (unless something has changed since I last saw a locked topic). Perhaps changing the background of the topic to diagonal stripes would be a good visual cue.

    11 votes
    1. [2]
      Eylrid
      Link Parent
      That example post brings up something else. It was edited after voting was locked. In this case it was a benign edit to update video links. But someone could abuse this by editing in a message...

      That example post brings up something else. It was edited after voting was locked. In this case it was a benign edit to update video links. But someone could abuse this by editing in a message that the original voters don't agree with. The same thing can be done on reddit, and doesn't seem to be much of a problem there (that I'm aware of). There are some differences between there and here, though. Posts on reddit take six months to be archived, by which time the vast majority of posts will rarely be seen again. The 30 day limit here is fresher, and because of the activity sort old posts can come back to the front.

      10 votes
      1. Deimos
        (edited )
        Link Parent
        Yeah, it's definitely possible, but if someone edits a post in a malicious way, that's its own problem that wouldn't really be solved by leaving voting open anyway (a lot of the original voters...

        Yeah, it's definitely possible, but if someone edits a post in a malicious way, that's its own problem that wouldn't really be solved by leaving voting open anyway (a lot of the original voters will never see the change or update their votes). If necessary, the post can be removed and/or other actions taken, such as banning the user if it's egregious.

        4 votes
    2. [2]
      Deimos
      Link Parent
      Thanks, that's all good feedback. I'll see what I can figure out for indicating that voting is closed (added an issue: https://gitlab.com/tildes/tildes/issues/608) And the notification about a...

      Thanks, that's all good feedback. I'll see what I can figure out for indicating that voting is closed (added an issue: https://gitlab.com/tildes/tildes/issues/608)

      And the notification about a topic being locked is quite large/bright and at the top of its comments.

      4 votes
      1. Zeerph
        Link Parent
        Somehow I missed that last time I looked at a locked topic. Maybe it's because I'm using the Dracula theme and the colours do not stand out so much? Although I'm curious, why not display that a...

        And the notification about a topic being locked is quite large/bright and at the top of its comments.

        Somehow I missed that last time I looked at a locked topic. Maybe it's because I'm using the Dracula theme and the colours do not stand out so much? Although I'm curious, why not display that a thread is locked before the user goes into the locked topic?

        1 vote
  6. [9]
    HoolaBoola
    Link
    How big percentage of the databases was the voting history in total, approximately? I'd imagine there's a lot of data which now won't be needed. And kudos to you, it does make sense from both the...

    How big percentage of the databases was the voting history in total, approximately? I'd imagine there's a lot of data which now won't be needed.

    And kudos to you, it does make sense from both the user and the service's point of view (assuming that, like Tildes, the service doesn't want to profit off people's voting history)

    8 votes
    1. [8]
      Deimos
      Link Parent
      It's not a huge chunk of the data overall, since each individual row is very small (just a user ID, post ID, and time). Currently, the comment votes table and its indexes are about 90 MB total,...

      It's not a huge chunk of the data overall, since each individual row is very small (just a user ID, post ID, and time).

      Currently, the comment votes table and its indexes are about 90 MB total, and topic votes and indexes are about 35 MB. The comments table alone without indexes is 400 MB.

      8 votes
      1. HoolaBoola
        Link Parent
        Ah, gotcha. So not that much, but not an irrelevant amount either.

        Ah, gotcha. So not that much, but not an irrelevant amount either.

        2 votes
      2. [6]
        MetArtScroll
        Link Parent
        Thanks a lot for this change, but I have a question. Is there a reason for saving the vote time on Tildes at all? For the fact of voting, the IDs should suffice.

        Thanks a lot for this change, but I have a question.

        It's not a huge chunk of the data overall, since each individual row is very small (just a user ID, post ID, and time).

        Is there a reason for saving the vote time on Tildes at all? For the fact of voting, the IDs should suffice.

        1 vote
        1. Deimos
          (edited )
          Link Parent
          It's useful data for a lot of reasons. As one example, I wouldn't have been able to figure out that less than 1% of votes have ever been cast on posts older than 30 days without it. The upcoming...

          It's useful data for a lot of reasons. As one example, I wouldn't have been able to figure out that less than 1% of votes have ever been cast on posts older than 30 days without it. The upcoming page that will let you view posts you voted on uses it for ordering, so that the posts are displayed in the order that you voted on them, instead of the order they were posted.

          Like @emdash says, it's also very important for looking into voting patterns for issues like vote-cheating. If there's a post with suspected vote-cheating but it isn't being investigated until it's hours or days old and a lot of people have voted by that point, it's essential to be able to look and see "these 5 accounts voted on the post immediately after it was posted", instead of just having a huge list of users with no information about timing.

          10 votes
        2. [4]
          emdash
          Link Parent
          Could be quite useful in determining or correlating spam activity or sockpuppet accounts I guess—i.e. an unusually elevated level of activity across multiple users on consistent occasions would be...

          Could be quite useful in determining or correlating spam activity or sockpuppet accounts I guess—i.e. an unusually elevated level of activity across multiple users on consistent occasions would be a good indicator of some link between the accounts.

          4 votes
          1. [3]
            MetArtScroll
            Link Parent
            Yes, but this looks like that infamous “to protect from bad behaviour, we will track everything everyone does” approach. I believe the administration benefits are more than outweighed by the...

            Yes, but this looks like that infamous “to protect from bad behaviour, we will track everything everyone does” approach. I believe the administration benefits are more than outweighed by the privacy costs.

            1 vote
            1. [2]
              emdash
              Link Parent
              Well that’s a far swing to the extreme. I don’t think recording one snippet of information like that is remotely comparable to what you describe. It doesn’t have to be a case of “let’s record...

              Well that’s a far swing to the extreme. I don’t think recording one snippet of information like that is remotely comparable to what you describe.

              It doesn’t have to be a case of “let’s record every bit of information”, nor does it need to be “let’s destroy usability and platform quality by recording as little as possible”. There are many nuanced middle grounds along that road, and I don’t think you can infer intent (malicious or otherwise) from the recording of a vote timestamp.

              9 votes
              1. MetArtScroll
                Link Parent
                Which seems to support my point that there is no need to record vote timestamps. By Tildes' ideology, if certain private data are not absolutely necessary, then they should rather not be recorded....

                I don’t think you can infer intent (malicious or otherwise) from the recording of a vote timestamp.

                Which seems to support my point that there is no need to record vote timestamps. By Tildes' ideology, if certain private data are not absolutely necessary, then they should rather not be recorded.

                However, I agree that the issue is minor. If Tildes is not going to pass its data to other entities, then the worst thing is that someone would be able to infer the most recent 30 days' activity patterns.

                1 vote
  7. [2]
    Adys
    Link
    Is there a way to opt out of this, at least for posts? On HN, I can go to https://news.ycombinator.com/upvoted?id=<myusername> to view all the submissions I've upvoted. On Reddit as well. This is...

    Is there a way to opt out of this, at least for posts?

    On HN, I can go to https://news.ycombinator.com/upvoted?id=<myusername> to view all the submissions I've upvoted. On Reddit as well. This is a pretty big deal for me because these pages build up "bookmarks of interesting things" I can o back to.

    For example, 4 months ago, I upvoted this submission: http://www.zachtronics.com/zachademics/ -- via https://news.ycombinator.com/item?id=20350125.

    Edit: I see Deimos has already replied about this concern. I get that there's a bookmarks feature, but that would push me to actually have to bookmark every single item I upvote, which is both impractical and unreliable (I might forget)…

    7 votes
    1. Deimos
      (edited )
      Link Parent
      No, there won't be any way to opt out. Looking at your past votes isn't even currently a feature that exists on Tildes, but once it does, if keeping a full voting history is important to you, I'd...

      No, there won't be any way to opt out. Looking at your past votes isn't even currently a feature that exists on Tildes, but once it does, if keeping a full voting history is important to you, I'd suggest setting up a script that scrapes/saves your votes page every day or so. Then you can keep that data locally if you want to have a longer history of it.

      Edit: I intend to add RSS support to various pages on Tildes eventually, so maybe using an RSS reader to keep track of your votes long-term might be a reasonable approach too.

      13 votes
  8. [2]
    Eylrid
    Link
    Are you going to do the same thing with labels?

    Are you going to do the same thing with labels?

    6 votes
    1. Deimos
      Link Parent
      Probably not exactly the same thing, but I'd like to do something with them eventually, yes. Labels are far less common (there were 550 comment labels this month, and over 22,000 comment votes),...

      Probably not exactly the same thing, but I'd like to do something with them eventually, yes.

      Labels are far less common (there were 550 comment labels this month, and over 22,000 comment votes), so it's not nearly as much sensitive data. However, there also aren't many reasons to need to keep info about who applied the labels over the long term, so I think it would be best to anonymize them too.

      It's probably important to allow at least Exemplary and Malice labels to still be applied regardless of the post's age though, so they might need to be treated differently if we ever add restrictions.

      5 votes
  9. balooga
    Link
    Thank you! This is actually something I'd been wondering about, and I appreciate your explanation of the change.

    Thank you! This is actually something I'd been wondering about, and I appreciate your explanation of the change.

    5 votes
  10. [2]
    umbrae
    Link
    This information is often useful for anti abuse purposes. Specifically I am thinking about the potential for political or state sponsored manipulation and analysis of that. I imagine this is not a...

    This information is often useful for anti abuse purposes. Specifically I am thinking about the potential for political or state sponsored manipulation and analysis of that.

    I imagine this is not a problem on tildes now but did you consider the risk of not having this info for anti evil reasons anymore? What makes the trade worth it for you?

    4 votes
    1. Deimos
      (edited )
      Link Parent
      Yeah, I've definitely been thinking about it. Overall, it's nice data to have for retroactive analysis, but I don't think it's very effective for actual prevention. There are other approaches that...

      Yeah, I've definitely been thinking about it. Overall, it's nice data to have for retroactive analysis, but I don't think it's very effective for actual prevention. There are other approaches that I think can probably be similarly effective that don't require keeping detailed, sensitive behavioral data on every single user indefinitely.

      And honestly, Tildes will probably never be large enough to need to worry about anything on the level of state-sponsored campaigns. I think that will mostly stick to the extremely mainstream platforms where it's easier to hide in the crowd and has larger target audiences.

      9 votes
  11. timo
    Link
    It's smart and sensible! I really appreciate the focus on privacy!

    It's smart and sensible! I really appreciate the focus on privacy!

    3 votes