Tildes

Activity

Votes

Comments

New

All activity

Showing only topics with the tag "scraping.web". Back to normal view

Aggressive bots ruined my weekend
~tech
- internet
Article 1129 words
2 comments

bearblog.dev

October 29

41 votes
Reddit will block the Internet Archive
~tech
- internet
- social media
Article 682 words
26 comments

The Verge

August 11

58 votes
The great LLM scrape
~tech
- internet
Article 766 words, published Mar 26 2025
4 comments

bearblog.dev

August 4

24 votes
Pay up or stop scraping: Cloudflare program charges bots for each crawl
~tech
- internet
Article 319 words
13 comments

Ars Technica

July 1

46 votes
As consumers switch from Google Search to ChatGPT, a new kind of bot is scraping data for AI
~tech
- internet
- google.search
Article 777 words
18 comments

The Washington Post

June 13

28 votes
Anubis works

~tech Link

25 comments

xeiaso.net

April 13

35 votes
Please stop externalizing your costs directly into my face

~tech Article 752 words, published Mar 17 2025

56 comments

drewdevault.com

March 20

121 votes
Block AI scrapers with Anubis
~comp
- open source
Article 1617 words, published Jan 19 2025
29 comments

xeiaso.net

March 17

27 votes
FOSS infrastructure is under attack by AI companies

~tech Article 1864 words

8 comments

thelibre.news

March 20

39 votes
LLM crawlers continue to DDoS SourceHut

~tech Article 427 words

1 comment

sr.ht

March 17

11 votes
Nepenthes: a tarpit intended to catch AI web crawlers
~tech
- internet
Article 1509 words
23 comments

zadzmo.org

January 19

33 votes
wordfreq will no longer be updated partly due to AI polluting the data
~comp
- open source
Link
17 comments

GitHub: rspeer

September 19, 2024

74 votes
Websites are blocking the wrong AI scrapers (because AI companies keep making new ones)
~tech
- internet
Article 1256 words
2 comments

404media.co

July 29, 2024

18 votes
Looking for help scraping and deleting a Reddit account

~comp Ask (advice)

I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any...

I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any recommendations on good simple ways to scrape all the comments off of it and save them? Then what’s the best way to completely erase a Reddit footprint these days?

Looking for as simple a solution as possible, I’m not tech illiterate by any means but it’s also not a real strong suit for me.

11 comments

AnEarlyMartyr

April 23, 2024

18 votes
Chrome/Firefox Plugin to locally scrape data from multiple URLs
~tech
- browsers
Ask (recommendations)
As the title suggests, I am looking for a free chrome or firefox plugin that can locally scrape data from multiple URLs. To be a bit more precise, what I mean by it: A free chrome or firefox...

As the title suggests, I am looking for a free chrome or firefox plugin that can locally scrape data from multiple URLs. To be a bit more precise, what I mean by it:
- A free chrome or firefox plugin
- Local scraping: it runs in the browser itself. No cloud computing or "credits" required to run
- Scrape data: Collects predefined data from certain data fields within a website such as https://www.dastelefonbuch.de/Suche/Test
- Infinite scroll: to load data that only loads once the browser scrolls down (kind of like in the page I linked above)
I am not looking into programming my own scraper using python or anything similar. I have found plugins that "kind of" do what I am describing above, and about two weeks ago I found one that pretty much perfectly does what is described ("DataGrab"), but it starts asking to buy credits after running it a few times.

My own list:
- DataGrab: Excellent, apart from asking to buy credits after a while
- SimpleScraper: Excellent, but asks to buy credits pretty much immediately
- Easy Scraper: Works well for single pages, but no possibility to feed in multiple URLs to crawl
- Instant Data Scraper: Works well for single pages and infinite scroll pages, but no possibility to feed in multiple URLs to crawl
- "Data Scraper - Easy Web Scraping" / dataminer.io: Doesn't work well
- Scrapy.org: Too much programming, but looks quite neat and well documented
Any suggestions are highly welcome!

Edit: A locally run executable or cmd-line based program would be fine too, as long as it just needs to be configured (e.g., creating a list of URLs stored in a .txt or .csv file) instead of coded (e.g., coding an infinite scroll function from scratch).
7 comments

douchebag

April 17, 2024

8 votes
Web scraping for me, but not for thee
~tech
- internet
Article 2040 words, published Aug 24 2023
1 comment

ericgoldman.org

August 30, 2023

19 votes
Report: Potential New York Times lawsuit could force OpenAI to wipe ChatGPT and start over

~tech Article 835 words

55 comments

Ars Technica

August 17, 2023

75 votes
‘Not for machines to harvest’: Data revolts break out against AI
~tech
- internet
- social media
Article 1705 words
35 comments

The New York Times

July 17, 2023

40 votes
The shady world of Brave selling copyrighted data for AI training
~tech
- internet
- browsers
Article 1697 words
24 comments

stackdiary.com

July 17, 2023

59 votes
Google updates its privacy policy to clarify it can use public data for training AI models
~tech
- google
- privacy
Article 719 words
24 comments

Gizmodo

July 4, 2023

44 votes
Mastodon's dubious crawler exemption

~comp Article 554 words, published Nov 7 2022

1 comment

jefftk.com

November 27, 2022

4 votes
Common Crawl: an open repository of web crawl data
~comp
- open source
Link
1 comment

commoncrawl.org

January 12, 2022

9 votes
Python web scraping with virtual private networks

~comp Article 2189 words

0 comments

marksblogg.com

April 14, 2020

3 votes
I made a (very, very) basic Tildes scraper and CLI browser ruby gem

~tildes Text 92 words

Here's the ruby gem page and here's the github. Right now it comes with a command line browser that can browse the front page and group pages with no sorting options, and you can view the contents...

Here's the ruby gem page and here's the github. Right now it comes with a command line browser that can browse the front page and group pages with no sorting options, and you can view the contents of a topic (link or text) aswell as the comments. The methods defined in lib/tilde-scraper/api.rb can be used to scrape tildes pages into Group, Page, Topic, and Comment objects.

Right now it's super basic and messy, but I figured if anyone was interested in it it would be the people here.

0 comments

clone1

October 5, 2019

9 votes
Web scraping doesn’t violate anti-hacking law, appeals court rules
~tech
- internet
Article 821 words
0 comments

Ars Technica

September 10, 2019

12 votes
Google open-sources their robots.txt parser and releases an RFC for formalizing the Robots Exclusion Protocol specification
~comp
- open source
Article 289 words
0 comments

googleblog.com

July 1, 2019

10 votes
Is it OK to scrape Tildes?

~tildes Ask

I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that...

I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that curls up my user page and downloads fresh stuff from there periodically. So for my personal case, the question is that is this allowed / welcome practice?

The generic question is that is it welcome to scrape Tildes' public pages, in general?

10 comments

unknown user

April 27, 2019

19 votes
Extract clean(er), readable text from web pages via the Mercury Web Parser.
~comp
- open source
Link
14 comments

GitHub

February 28, 2019

8 votes
Uber, statistics, and a chrome extension

~comp Article 511 words, published Dec 22 2018

0 comments

jonlu.ca

December 25, 2018

5 votes
Ryanair, Berlin, and Hamiltonian cycles - finding a travel route using graph theory

~comp Article 1214 words

3 comments

jonlu.ca

November 24, 2018

8 votes