34 votes

Seems like all socials are being scraped for AI and personal/aggregate data. Is Tildes?

I was just reminded of that again when going back and looking at some of my old posts on reddit which is openly selling online data. Prompted me to use Redact which erases and overwrites comments before deleting them. But that got me wondering if the same is true of Tildes? And how would we know?

21 comments

  1. Moonchild
    Link
    Selling? No, I don't believe Deimos is doing that. Scraping? Obviously. Think real hard about the world.

    same is true of Tildes?

    Selling? No, I don't believe Deimos is doing that. Scraping? Obviously.

    how would we know?

    Think real hard about the world.

    56 votes
  2. json
    Link
    Scraping for data has been happening on every website for at least 25 years because of search engine web crawlers.

    Scraping for data has been happening on every website for at least 25 years because of search engine web crawlers.

    53 votes
  3. [2]
    Grumble4681
    (edited )
    Link
    I think a safe assumption is anything you put online that is publicly accessible is being scraped for AI or can be. The alternative is to close off your community. I think at one point in the past...

    I think a safe assumption is anything you put online that is publicly accessible is being scraped for AI or can be. The alternative is to close off your community. I think at one point in the past Tildes was set up so that only those who were signed up could see what was posted, but it's set up now so that anyone can read it. Ultimately you just have to accept some loss of control over things you say or put out there when you do it in public. To close off the community makes it harder for the community to grow. I'm sure there's some old discussions on here somewhere around the time that change was made that probably discusses in more detail the pros and cons of that decision. I don't think any selling of data is occurring on Deimos part, I'm merely discussing the fact that the data is publicly available and anyone can access it.

    I don't say anything here specific enough that should tie back to me personally, and I used a randomly generated username so that it couldn't be associated with data I posted on other sites. I suppose it could be possible for something to identify a writing pattern to be able to identify me across different platforms, though I'm not sure that is realistic currently.

    33 votes
    1. Nsutdwa
      Link Parent
      I suspect your last point about being identifiable on the basis of your writing characteristics is very possible. I also suspect that with the right expertise, it's almost trivial. That said, I...

      I suspect your last point about being identifiable on the basis of your writing characteristics is very possible. I also suspect that with the right expertise, it's almost trivial. That said, I don't think the knowledge is widespread or easily available online (yet, and I stand open to correction on either point). I have personally toyed with the idea of using an LLM to act as an intermediary to obfuscate my own writing style. You can easily imagine an input box that could pop up as an overlay to the Tildes textbox, take in all my guff and transform it, thus cloaking my identity. You lose your own voice, for sure, which means it probably wouldn't be useful for places where that's important, but if you wanted to say controversial things online, I suppose it could be a useful tool - like using a VPN but for your writing identity!

      8 votes
  4. [5]
    vord
    Link
    Just remember, as I tell my kids: If it touches the internet, assume that the world can see and it's there forever. Nowadays that should probably just be extended to 'If it's on a screen'. Even...

    Just remember, as I tell my kids:

    If it touches the internet, assume that the world can see and it's there forever. Nowadays that should probably just be extended to 'If it's on a screen'.

    Even E2E encrypted messaging doesn't stop the recipient from snapping a screenshot and sharing on social media.

    21 votes
    1. [3]
      Comment deleted by author
      Link Parent
      1. [2]
        CptBluebear
        Link Parent
        Oof, yeah the military dogs is a hard sell, but I've seen Dune II as a citizen in the EU! Partial hit!

        there's only so many people in the EU who have worked alongside dogs in the military, AND seen Dune II.

        Oof, yeah the military dogs is a hard sell, but I've seen Dune II as a citizen in the EU! Partial hit!

        4 votes
        1. [2]
          Comment deleted by author
          Link Parent
          1. RheingoldRiver
            Link Parent
            Ah well I wasn't sure, but now that you mention Karl...

            Ah well I wasn't sure, but now that you mention Karl...

            4 votes
    2. [2]
      Moonchild
      Link Parent
      I think the mention of E2EE weakens the point, and that even without that it's far too strong. Cryptographically you can, at least to a first approximation, send somebody an end-to-end encrypted...

      I think the mention of E2EE weakens the point, and that even without that it's far too strong. Cryptographically you can, at least to a first approximation, send somebody an end-to-end encrypted message in such a way that they can be assured that it was you that sent it, and yet they cannot prove to anybody else that you actually sent it. Practically speaking, trust is obviously something you have to reason about, both in real life and computer security—obviously, somebody you tell a secret of yours to can spill it. (How's it go?—'two can keep a secret, if one is dead'?)

      More generally, we are faced with signals whose samples are noisy, imperfect, and doctored; and we have to reason about them. (Ohh, there is some juicy preliminary research into conceptual models for this at my work, which I sadly cannot share yet.) You can't have it both ways. It can't simultaneously be all noise ('nothing on the internet can be trusted') and no noise ('everything you do, everybody will know about it, with certainty, forever'). Certainly, there are asymmetries between actors (this is what the research at my work is into)—a large corporation or government has a lot more signal than you do—but 1) those asymmetries are not absolute, and they are not that large, and 2) extremely-resourced actors are not the only ones you need to think about.

      If you see a picture depicting two people having sex, what do you make of it? Did somebody feeling horny punch those two celebrities' names into this month's 'ai image generator'? Is a trusted friend telling you they think your partner is cheating on you? There are higher-order considerations, too. Did they intentionally ask the image generator to make a less ... appealing picture, to make it seem more realistic? And isn't celebrity gossip a waste of your time anyway?

      In the case of the large actors (since we are talking about them right now), the operative thing for them is scale. If tildes were invite-only, it would be a complete waste of google's time to send somebody to infiltrate and scrape it. (Jilted ex? Suspected of terrorism by the nsa? Different story.) Which means that, at a personal level, it's as 'easy' as opting out of scale. Structurally is harder—you have to deal with the scale, and that requires game theory and information theory. 4chan succeeded in making google's ocr see racial slurs where there were none once, but has so far failed to repeat the feat.

      Incidentally, the internet is neither here nor there considering how many 'security' cameras there are all over the place. Fun gait analysis research. Don't want to be tracked? Put rocks in your shoes. Or wear heels instead of flats (or vice versa). But then, you probably know if there are security cameras at your friend's house, and there almost certainly aren't any in the middle of the woods.

      8 votes
      1. vord
        (edited )
        Link Parent
        You make fine points, but I don't think any negate mine. I'm not talking about the specifics of encryption, I'm talking about trust. Of course trust is essential. It's the most essential. And...
        • Exemplary

        You make fine points, but I don't think any negate mine.

        I'm not talking about the specifics of encryption, I'm talking about trust. Of course trust is essential. It's the most essential. And while you can have more trust in people you know...people you know are also more likely to rape you than a stranger, because they abuse the trust.

        Yes, I know it won't actually be there forever to anybody but highly motivated nation-state actors. But for the most part it's practical advice: There's no taking back what you send out. Even if there's a delete button, or supposed 'self destruct,' you have to trust that every single recipient of every single thing you send didn't archive it.

        For publicly facing things, that's an impossible assumption. And while you can dump the ex and file a lawsuit for re-sharing the nudes you sent them in confidence, it doesn't change that once that trust is broken, anybody on the planet could have seen, and you have no way of knowing. And you can't know who will break your trust ahead of time.

        And yes, I have major problems with the mass internet-connected surveillance network controlled by a tiny handful of big players. Because I don't trust every single person working at Amazon, the local police, the state police, the federal police, the various three-letter agencies, random hackers, corporations that would like to mine surveillance data to make a buck, and the various international agencies that may also seek to exploit this info.

        But people are terrified of getting an easily-replaceable Amazon package stolen off their porch, so mass surveillance it is.

        And that's ultimately why you should use handles instead of real names on the internet. It serves as a first-order privacy abstraction. Even if it can be discovered easily, it at least rules out 90% or more of the population that wouldn't dive deeper.

        5 votes
  5. [6]
    winther
    Link
    The robots.txt blocks a number of known scrapers, but that only stops those that play nice.

    The robots.txt blocks a number of known scrapers, but that only stops those that play nice.

    16 votes
    1. [5]
      Ganymede
      Link Parent
      Which means none of them.

      Which means none of them.

      3 votes
      1. Wes
        Link Parent
        Most large companies respect robots.txt, including OpenAI and Google. I can't find official information for Facebook or Claude, but I do see a number of lists include both Claude-Web and...

        Most large companies respect robots.txt, including OpenAI and Google. I can't find official information for Facebook or Claude, but I do see a number of lists include both Claude-Web and anthropic-ai.

        Scraping is legal in the US where most of these companies operate, but it is generally more polite to respect robots.txt anyway.

        16 votes
      2. [3]
        skybrian
        Link Parent
        I would expect Google to drop a website from their index if they're excluded in robots.txt. (Although, some search results will still appear due to links pointing to the website.) Is there any...

        I would expect Google to drop a website from their index if they're excluded in robots.txt. (Although, some search results will still appear due to links pointing to the website.)

        Is there any evidence that they don't do that? It doesn't seem like it would be hard to test.

        4 votes
        1. Minty
          Link Parent
          Not evidence, but I've seen plenty of websites with robots.txt blocking everything beyond the front page, and google showing only the front page in results, blind about everything deeper. Looks...

          Not evidence, but I've seen plenty of websites with robots.txt blocking everything beyond the front page, and google showing only the front page in results, blind about everything deeper. Looks nice to me.

          Because anyone can check.

          The ones scraping content for machine learning or personal data selling, I don't quite see why would they bother playing nice.

          1 vote
  6. unkz
    Link
    Yes, probably, since by googling for site:tildes.net there are results. If google is scraping it, they are probably using it for their model at least.

    Yes, probably, since by googling for site:tildes.net there are results. If google is scraping it, they are probably using it for their model at least.

    13 votes
  7. [2]
    xk3
    (edited )
    Link
    One thing I noticed when backing up my personal data via scraping Tildes is that if you don't have a cookie you can only scrape the first page of my user. I think Tildes is pretty small so any...
    • Exemplary

    One thing I noticed when backing up my personal data via scraping Tildes is that if you don't have a cookie you can only scrape the first page of my user. I think Tildes is pretty small so any custom script to scrape data is probably not worth it to companies

    8 votes
    1. Amarok
      Link Parent
      That's right. User histories stop at the first page of comments unless you are logged in, then you can go further back on a user history page. No reason we have to make it easy for a search engine...

      That's right. User histories stop at the first page of comments unless you are logged in, then you can go further back on a user history page. No reason we have to make it easy for a search engine to archive all of a user's own comment history in a way that someone can search through it all later. Threads here can be scraped easily enough, but that's keeping all the comments from all the users in-context of the thread.

      5 votes
  8. NoblePath
    Link
    I don’t know the answer, but I say, I hope so! We are a shining example of how awesome internet discussion should be. If I were emporer, they’d train only on tildes and Winnie the Pooh books. (For...

    I don’t know the answer, but I say, I hope so! We are a shining example of how awesome internet discussion should be. If I were emporer, they’d train only on tildes and Winnie the Pooh books. (For English AIs)

    4 votes
  9. raze2012
    Link
    In the grand scheme of things, we can't. But dumb scrapers can lead to higher server load or erratic loading behavior from the same IP or other similar oddities while overall not increasing the...

    how would we know?

    In the grand scheme of things, we can't. But dumb scrapers can lead to higher server load or erratic loading behavior from the same IP or other similar oddities while overall not increasing the amount of posts/comments.

    These can be mitigated on other sites with stuff like captchas, but it's a game of cat and mouse.

    3 votes
  10. boxer_dogs_dance
    Link
    I think I remember seeing in a discussion that people not logged in can only see recent content...

    I think I remember seeing in a discussion that people not logged in can only see recent content...

    1 vote