19 votes

Is it OK to scrape Tildes?

I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that curls up my user page and downloads fresh stuff from there periodically. So for my personal case, the question is that is this allowed / welcome practice?

The generic question is that is it welcome to scrape Tildes' public pages, in general?

10 comments

  1. [5]
    Deimos
    Link
    Sure, that's definitely fine. There are already quite a few bots (mostly from search engines and similar services) scraping the site regularly. The site is public now, there's not really any way...

    Sure, that's definitely fine. There are already quite a few bots (mostly from search engines and similar services) scraping the site regularly. The site is public now, there's not really any way to prevent scraping (or anything wrong with it).

    22 votes
    1. [3]
      cfabbro
      (edited )
      Link Parent
      It should probably be noted there are a few exceptions in terms of the application of said scraping though: https://docs.tildes.net/code-of-conduct

      (or anything wrong with it)

      It should probably be noted there are a few exceptions in terms of the application of said scraping though:

      Do not maliciously attempt to counteract other users' attempts to delete or edit their content, such as by deliberately re-posting content they want to be deleted.

      Do not post anyone's sensitive personal information (related to either their real world or online identity) with malicious intent.

      https://docs.tildes.net/code-of-conduct

      10 votes
      1. [2]
        alexandria
        Link Parent
        That doesn't say don't archive other people's posts, just don't repost it if they've deleted it.

        That doesn't say don't archive other people's posts, just don't repost it if they've deleted it.

        1 vote
        1. cfabbro
          (edited )
          Link Parent
          Exactly, which is why I specifically said "there are a few exceptions in terms of the application of said scraping"... scraping is not against the rules, but using scraped data to do either of...

          Exactly, which is why I specifically said "there are a few exceptions in terms of the application of said scraping"... scraping is not against the rules, but using scraped data to do either of those things is.

          3 votes
    2. unknown user
      Link Parent
      Thanks!

      Thanks!

      5 votes
  2. clerical_terrors
    Link
    People have already done something like this if you're interested. Deimos seems largely ok with it but sending him a message couldn't hurt.

    People have already done something like this if you're interested. Deimos seems largely ok with it but sending him a message couldn't hurt.

    5 votes
  3. [5]
    Comment deleted by author
    Link
    1. unknown user
      Link Parent
      Thanks! Well, off the top of my head, I'd fetch my page, see if there are new stuff, and just write the inner html of that stuff to a file whose name is the comment id. If the most recent comment...

      Thanks!

      By the way, how exactly do you plan on scraping your profile? You should share your methods when you get something put together, if you don't mind.

      Well, off the top of my head, I'd fetch my page, see if there are new stuff, and just write the inner html of that stuff to a file whose name is the comment id. If the most recent comment is already downloaded, nothing to do, so the script terminates immediately, no need for hashing or anything. The initial scraping would be a big batch (but I'd add a reasonable delay between each fetch), where I'd have to follow the "Next" links until I hit the bottom of it (with delays of a minute or more to not overwhelm the server). I'd only invoke this script manually, after I make 5-10 comments. IDK if that'd count as a heavy load, I do have a considerable backlog of comments since last July, but it shouldn't be much different from me browsing my comments.

      Because Tildes is a Good Website, nothing fancy would be needed; all it'd probably take is cURL + something that can parse HTML, so it'd probably end up being a small Bash script or a Python 3 script with urllib3. I'll definitely share it here if I ever do it.

      BTW I have something similar for HN which uses the Firebase API, if anybody's interested, I publish it somewhere for you to use. It is a couple pages of Python.

      (I have a scraper that I used to generate RSS from my uni's horrible websites, and it is such a mess to deal with that messy pile of JS shit. I used PhantomJS and Perl's RSS library to make something work, and it helped me soo much back then, thanks to it I could avoid the need for a Facebook account to join the FB group of my department.)

      5 votes
    2. [3]
      kfwyre
      Link Parent
      I'll add my voice into the ring and say that I'm also interested in having a way to backup my profile. I'd love for this kind of thing to be built into the site--a kind of "Tildes Takeout" option....

      I'll add my voice into the ring and say that I'm also interested in having a way to backup my profile. I'd love for this kind of thing to be built into the site--a kind of "Tildes Takeout" option.

      Also, this is somewhat off-topic, but I figure the crowd here would be a good one to ask: does such a backup tool exist for reddit? I have an old account over there I would like to delete, but I haven't done so yet because I want to save my posts and comments from it first. I've looked up a couple of different options, but they're either too techy for me to get off the ground, or too untrustworthy for me to plug in my account info.

      3 votes
      1. heady
        (edited )
        Link Parent
        I used power delete suite You don't have to delete anything to use the backup feature, just untick the boxes for deleting/editing comments and submissions when you run it.

        I used power delete suite

        You don't have to delete anything to use the backup feature, just untick the boxes for deleting/editing comments and submissions when you run it.

        3 votes
      2. Algernon_Asimov
        Link Parent
        Why not ask the people on Reddit themselves? I know from personal experience that the folks at /r/Help are... well... helpful.

        I figure the crowd here would be a good one to ask: does such a backup tool exist for reddit?

        Why not ask the people on Reddit themselves? I know from personal experience that the folks at /r/Help are... well... helpful.