19 votes

Is it OK to scrape Tildes?

Posted April 27, 2019 by unknown user

Tags: ask.discussion, technology, scraping.web

I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that curls up my user page and downloads fresh stuff from there periodically. So for my personal case, the question is that is this allowed / welcome practice?

The generic question is that is it welcome to scrape Tildes' public pages, in general?

10 comments

[5]
Deimos
April 27, 2019
Link
Sure, that's definitely fine. There are already quite a few bots (mostly from search engines and similar services) scraping the site regularly. The site is public now, there's not really any way...

Sure, that's definitely fine. There are already quite a few bots (mostly from search engines and similar services) scraping the site regularly. The site is public now, there's not really any way to prevent scraping (or anything wrong with it).

22 votes
1. [3]
  cfabbro
  April 27, 2019 (edited April 27, 2019)
  Link Parent
  It should probably be noted there are a few exceptions in terms of the application of said scraping though: https://docs.tildes.net/code-of-conduct
  
  (or anything wrong with it)
  
  It should probably be noted there are a few exceptions in terms of the application of said scraping though:
  
  Do not maliciously attempt to counteract other users' attempts to delete or edit their content, such as by deliberately re-posting content they want to be deleted.
  
  Do not post anyone's sensitive personal information (related to either their real world or online identity) with malicious intent.
  
  https://docs.tildes.net/code-of-conduct
  
  10 votes
  1. [2]
    alexandria
    April 28, 2019
    Link Parent
    That doesn't say don't archive other people's posts, just don't repost it if they've deleted it.
    
    That doesn't say don't archive other people's posts, just don't repost it if they've deleted it.
    
    1 vote
    
    cfabbro
    April 28, 2019 (edited April 28, 2019)
    Link Parent
    Exactly, which is why I specifically said "there are a few exceptions in terms of the application of said scraping"... scraping is not against the rules, but using scraped data to do either of...
    
    Exactly, which is why I specifically said "there are a few exceptions in terms of the application of said scraping"... scraping is not against the rules, but using scraped data to do either of those things is.
    
    3 votes
2. unknown user (OP)
  April 28, 2019
  Link Parent
  Thanks!
  
  Thanks!
  
  5 votes
clerical_terrors
April 27, 2019
Link
People have already done something like this if you're interested. Deimos seems largely ok with it but sending him a message couldn't hurt.

People have already done something like this if you're interested. Deimos seems largely ok with it but sending him a message couldn't hurt.

5 votes
[5]

Comment deleted by author
Link
1. unknown user (OP)
  April 27, 2019
  Link Parent
  Thanks! Well, off the top of my head, I'd fetch my page, see if there are new stuff, and just write the inner html of that stuff to a file whose name is the comment id. If the most recent comment...
  
  Thanks!
  
  By the way, how exactly do you plan on scraping your profile? You should share your methods when you get something put together, if you don't mind.
  
  Well, off the top of my head, I'd fetch my page, see if there are new stuff, and just write the inner html of that stuff to a file whose name is the comment id. If the most recent comment is already downloaded, nothing to do, so the script terminates immediately, no need for hashing or anything. The initial scraping would be a big batch (but I'd add a reasonable delay between each fetch), where I'd have to follow the "Next" links until I hit the bottom of it (with delays of a minute or more to not overwhelm the server). I'd only invoke this script manually, after I make 5-10 comments. IDK if that'd count as a heavy load, I do have a considerable backlog of comments since last July, but it shouldn't be much different from me browsing my comments.
  
  Because Tildes is a Good Website, nothing fancy would be needed; all it'd probably take is cURL + something that can parse HTML, so it'd probably end up being a small Bash script or a Python 3 script with urllib3. I'll definitely share it here if I ever do it.
  
  BTW I have something similar for HN which uses the Firebase API, if anybody's interested, I publish it somewhere for you to use. It is a couple pages of Python.
  
  (I have a scraper that I used to generate RSS from my uni's horrible websites, and it is such a mess to deal with that messy pile of JS shit. I used PhantomJS and Perl's RSS library to make something work, and it helped me soo much back then, thanks to it I could avoid the need for a Facebook account to join the FB group of my department.)
  
  5 votes
2. [3]
  kfwyre
  April 28, 2019
  Link Parent
  I'll add my voice into the ring and say that I'm also interested in having a way to backup my profile. I'd love for this kind of thing to be built into the site--a kind of "Tildes Takeout" option....
  
  I'll add my voice into the ring and say that I'm also interested in having a way to backup my profile. I'd love for this kind of thing to be built into the site--a kind of "Tildes Takeout" option.
  
  Also, this is somewhat off-topic, but I figure the crowd here would be a good one to ask: does such a backup tool exist for reddit? I have an old account over there I would like to delete, but I haven't done so yet because I want to save my posts and comments from it first. I've looked up a couple of different options, but they're either too techy for me to get off the ground, or too untrustworthy for me to plug in my account info.
  
  3 votes
  1. heady
    April 29, 2019 (edited April 29, 2019)
    Link Parent
    I used power delete suite You don't have to delete anything to use the backup feature, just untick the boxes for deleting/editing comments and submissions when you run it.
    
    I used power delete suite
    
    You don't have to delete anything to use the backup feature, just untick the boxes for deleting/editing comments and submissions when you run it.
    
    3 votes
  2. Algernon_Asimov
    April 28, 2019
    Link Parent
    Why not ask the people on Reddit themselves? I know from personal experience that the folks at /r/Help are... well... helpful.
    
    I figure the crowd here would be a good one to ask: does such a backup tool exist for reddit?
    
    Why not ask the people on Reddit themselves? I know from personal experience that the folks at /r/Help are... well... helpful.