-
58 votes
-
The great LLM scrape
24 votes -
Pay up or stop scraping: Cloudflare program charges bots for each crawl
46 votes -
As consumers switch from Google Search to ChatGPT, a new kind of bot is scraping data for AI
28 votes -
Anubis works
35 votes -
Please stop externalizing your costs directly into my face
121 votes -
Block AI scrapers with Anubis
27 votes -
FOSS infrastructure is under attack by AI companies
39 votes -
LLM crawlers continue to DDoS SourceHut
11 votes -
Nepenthes: a tarpit intended to catch AI web crawlers
33 votes -
wordfreq will no longer be updated partly due to AI polluting the data
74 votes -
Websites are blocking the wrong AI scrapers (because AI companies keep making new ones)
18 votes -
Looking for help scraping and deleting a Reddit account
I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any...
I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any recommendations on good simple ways to scrape all the comments off of it and save them? Then what’s the best way to completely erase a Reddit footprint these days?
Looking for as simple a solution as possible, I’m not tech illiterate by any means but it’s also not a real strong suit for me.
18 votes -
Chrome/Firefox Plugin to locally scrape data from multiple URLs
As the title suggests, I am looking for a free chrome or firefox plugin that can locally scrape data from multiple URLs. To be a bit more precise, what I mean by it: A free chrome or firefox...
As the title suggests, I am looking for a free chrome or firefox plugin that can locally scrape data from multiple URLs. To be a bit more precise, what I mean by it:
- A free chrome or firefox plugin
- Local scraping: it runs in the browser itself. No cloud computing or "credits" required to run
- Scrape data: Collects predefined data from certain data fields within a website such as https://www.dastelefonbuch.de/Suche/Test
- Infinite scroll: to load data that only loads once the browser scrolls down (kind of like in the page I linked above)
I am not looking into programming my own scraper using python or anything similar. I have found plugins that "kind of" do what I am describing above, and about two weeks ago I found one that pretty much perfectly does what is described ("DataGrab"), but it starts asking to buy credits after running it a few times.
My own list:
- DataGrab: Excellent, apart from asking to buy credits after a while
- SimpleScraper: Excellent, but asks to buy credits pretty much immediately
- Easy Scraper: Works well for single pages, but no possibility to feed in multiple URLs to crawl
- Instant Data Scraper: Works well for single pages and infinite scroll pages, but no possibility to feed in multiple URLs to crawl
- "Data Scraper - Easy Web Scraping" / dataminer.io: Doesn't work well
- Scrapy.org: Too much programming, but looks quite neat and well documented
Any suggestions are highly welcome!
Edit: A locally run executable or cmd-line based program would be fine too, as long as it just needs to be configured (e.g., creating a list of URLs stored in a .txt or .csv file) instead of coded (e.g., coding an infinite scroll function from scratch).
8 votes -
Web scraping for me, but not for thee
19 votes -
Report: Potential New York Times lawsuit could force OpenAI to wipe ChatGPT and start over
75 votes -
‘Not for machines to harvest’: Data revolts break out against AI
40 votes -
The shady world of Brave selling copyrighted data for AI training
59 votes -
Google updates its privacy policy to clarify it can use public data for training AI models
44 votes -
Mastodon's dubious crawler exemption
4 votes -
Common Crawl: an open repository of web crawl data
9 votes -
Python web scraping with virtual private networks
3 votes -
I made a (very, very) basic Tildes scraper and CLI browser ruby gem
Here's the ruby gem page and here's the github. Right now it comes with a command line browser that can browse the front page and group pages with no sorting options, and you can view the contents...
Here's the ruby gem page and here's the github. Right now it comes with a command line browser that can browse the front page and group pages with no sorting options, and you can view the contents of a topic (link or text) aswell as the comments. The methods defined in lib/tilde-scraper/api.rb can be used to scrape tildes pages into Group, Page, Topic, and Comment objects.
Right now it's super basic and messy, but I figured if anyone was interested in it it would be the people here.
9 votes -
Web scraping doesn’t violate anti-hacking law, appeals court rules
12 votes -
Google open-sources their robots.txt parser and releases an RFC for formalizing the Robots Exclusion Protocol specification
10 votes -
Is it OK to scrape Tildes?
I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that...
I wanted to keep the title---and the question, for that matter---generic, but my use case is that I want to make a backup of my posts on Tildes, and I'd fancy automating that with a script that curls up my user page and downloads fresh stuff from there periodically. So for my personal case, the question is that is this allowed / welcome practice?
The generic question is that is it welcome to scrape Tildes' public pages, in general?
19 votes -
Extract clean(er), readable text from web pages via the Mercury Web Parser.
8 votes -
Uber, statistics, and a chrome extension
5 votes -
Ryanair, Berlin, and Hamiltonian cycles - finding a travel route using graph theory
8 votes