19
votes
Is it OK to scrape Tildes?
I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that curls up my user page and downloads fresh stuff from there periodically. So for my personal case, the question is that is this allowed / welcome practice?
The generic question is that is it welcome to scrape Tildes' public pages, in general?
Sure, that's definitely fine. There are already quite a few bots (mostly from search engines and similar services) scraping the site regularly. The site is public now, there's not really any way to prevent scraping (or anything wrong with it).
It should probably be noted there are a few exceptions in terms of the application of said scraping though:
https://docs.tildes.net/code-of-conduct
That doesn't say don't archive other people's posts, just don't repost it if they've deleted it.
Exactly, which is why I specifically said "there are a few exceptions in terms of the application of said scraping"... scraping is not against the rules, but using scraped data to do either of those things is.
Thanks!
People have already done something like this if you're interested. Deimos seems largely ok with it but sending him a message couldn't hurt.
Thanks!
Well, off the top of my head, I'd fetch my page, see if there are new stuff, and just write the inner html of that stuff to a file whose name is the comment id. If the most recent comment is already downloaded, nothing to do, so the script terminates immediately, no need for hashing or anything. The initial scraping would be a big batch (but I'd add a reasonable delay between each fetch), where I'd have to follow the "Next" links until I hit the bottom of it (with delays of a minute or more to not overwhelm the server). I'd only invoke this script manually, after I make 5-10 comments. IDK if that'd count as a heavy load, I do have a considerable backlog of comments since last July, but it shouldn't be much different from me browsing my comments.
Because Tildes is a Good Website, nothing fancy would be needed; all it'd probably take is cURL + something that can parse HTML, so it'd probably end up being a small Bash script or a Python 3 script with urllib3. I'll definitely share it here if I ever do it.
BTW I have something similar for HN which uses the Firebase API, if anybody's interested, I publish it somewhere for you to use. It is a couple pages of Python.
(I have a scraper that I used to generate RSS from my uni's horrible websites, and it is such a mess to deal with that messy pile of JS shit. I used PhantomJS and Perl's RSS library to make something work, and it helped me soo much back then, thanks to it I could avoid the need for a Facebook account to join the FB group of my department.)
I'll add my voice into the ring and say that I'm also interested in having a way to backup my profile. I'd love for this kind of thing to be built into the site--a kind of "Tildes Takeout" option.
Also, this is somewhat off-topic, but I figure the crowd here would be a good one to ask: does such a backup tool exist for reddit? I have an old account over there I would like to delete, but I haven't done so yet because I want to save my posts and comments from it first. I've looked up a couple of different options, but they're either too techy for me to get off the ground, or too untrustworthy for me to plug in my account info.
I used power delete suite
You don't have to delete anything to use the backup feature, just untick the boxes for deleting/editing comments and submissions when you run it.
Why not ask the people on Reddit themselves? I know from personal experience that the folks at /r/Help are... well... helpful.