8 votes

Chrome/Firefox Plugin to locally scrape data from multiple URLs

Posted April 17, 2024 by douchebag (edited April 17, 2024)

Tags: browsers, ask.recommendations, programming, data, scraping.web, chrome, firefox, websites, crawlers.web

As the title suggests, I am looking for a free chrome or firefox plugin that can locally scrape data from multiple URLs. To be a bit more precise, what I mean by it:

A free chrome or firefox plugin
Local scraping: it runs in the browser itself. No cloud computing or "credits" required to run
Scrape data: Collects predefined data from certain data fields within a website such as https://www.dastelefonbuch.de/Suche/Test
Infinite scroll: to load data that only loads once the browser scrolls down (kind of like in the page I linked above)

I am not looking into programming my own scraper using python or anything similar. I have found plugins that "kind of" do what I am describing above, and about two weeks ago I found one that pretty much perfectly does what is described ("DataGrab"), but it starts asking to buy credits after running it a few times.

My own list:

DataGrab: Excellent, apart from asking to buy credits after a while
SimpleScraper: Excellent, but asks to buy credits pretty much immediately
Easy Scraper: Works well for single pages, but no possibility to feed in multiple URLs to crawl
Instant Data Scraper: Works well for single pages and infinite scroll pages, but no possibility to feed in multiple URLs to crawl
"Data Scraper - Easy Web Scraping" / dataminer.io: Doesn't work well
Scrapy.org: Too much programming, but looks quite neat and well documented

Any suggestions are highly welcome!

Edit: A locally run executable or cmd-line based program would be fine too, as long as it just needs to be configured (e.g., creating a list of URLs stored in a .txt or .csv file) instead of coded (e.g., coding an infinite scroll function from scratch).

7 comments

[2]
creesch
April 17, 2024 (edited April 17, 2024)
Link
edit: What are you using this for? I might be reading too much into this, but your only other interaction on Tildes so far has been one highly specific other question, which you seem to have...

edit: What are you using this for? I might be reading too much into this, but your only other interaction on Tildes so far has been one highly specific other question, which you seem to have scrubbed clean of your comments and contents yesterday. Combined with the subject and that you seem to be scraping phone numbers (based on your example) does make me question the involved motives.

What sort of scraping are you looking for? I assume you are looking at something that only fetches specific information from specific bits of a page, not just the entire website?

Looking at scrapy.org, it seems to do what is minimally needed there, defining what css selectors to use and what data to fetch. I haven't looked at the other offerings, but I am guessing they over a nice GUI type thing for that aspect, as you didn't mention it. Which actually is one of the more important requirements if I had to guess.

It also severely limits your options, tooling offering this sort of GUI likely is service based. Not only does it require quite a bit of extra work to implement such a GUI, they also need to account for a variety of other things like banners in the way, etc. And, to be frank, most other people go for the slightly more technical approach as there is no lack of options there. Scrapy is highly specific, but any test framework can also be used for scrapping. Often you see people using playwright for this or selenium webdriver for their favorite language.

All of this is basically a long-winded way of me saying that you are likely stuck with service/cloud based offerings. Unless you put in a bit of extra time and effort in learning the very basics of css, html and a tiny bit of python (for scrapy).

tl;dr Can't have your cake and eat too. Your options are basically paying for a low friction non-technical solution or put in the time and effort and do it for free.

13 votes
1. douchebag (OP)
  April 18, 2024
  Link Parent
  Cheers for this reply. I think I'll put in the effort to try and figure things out using scrappy and/or another python based solution. I was rather just hoping there might be someone who had come...
  
  Cheers for this reply. I think I'll put in the effort to try and figure things out using scrappy and/or another python based solution. I was rather just hoping there might be someone who had come across a tool that may already suit the need.
  
  As for you other question(s): I usually delete my comments and/or posts several times per week and only leave them there if they are related to some sort of technical solution or advice that may help someone down the road.
  
  As for the scraping: I am putting together an overview of # of businesses in a certain sector and found that the online phone book would give a good indictor of this. However this only works by searching for specific keywords and later cleansing the data for duplicates, false entries, irrelevant entries etc. There's also other sources out there that I'll use for my statistic, but they are partially quite imprecise in other ways (industry reports, association reports etc.)
  
  Hope that answers the questions :)
  
  4 votes

[3]

xk3

April 17, 2024 (edited April 18, 2024)

Link

The website you linked to loads additional pages as HTML. I wrote something with seleniumwire that can do this if the data loads with JSON. It is a bit experimental and the results might be spread...

within a website such as

The website you linked to loads additional pages as HTML. I wrote something with seleniumwire that can do this if the data loads with JSON. It is a bit experimental and the results might be spread across a lot of different tables but it works.

pip install xklb
lb siteadd output.db --scroll $URL

For HTML you might be able to use xidel to extract the content that you want.

For kicks, I tried adding the same functionality for HTML, but I get a lot of this:

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 19, column 24

which I'm not very surprised by. HTML is difficult for programs to parse. I'm sure it might work for some websites but certainly not the German phonebook site.

2 votes

[2]

douchebag (OP)

April 18, 2024

Link Parent

This is a helpful input, thank you for taking the time to test it out and sharing your results. I was guessing the extra loading of pages may be a core issue, but I wasn't aware it was caused by...

This is a helpful input, thank you for taking the time to test it out and sharing your results.

I was guessing the extra loading of pages may be a core issue, but I wasn't aware it was caused by the HTML format.

1 vote

xk3

April 19, 2024 (edited April 19, 2024)

Link Parent

yeah HTML is not very strict about how information is stored because it is mostly a visual document format. Parsing unstructured data intelligently is difficult. Loading infinite pages is...

yeah HTML is not very strict about how information is stored because it is mostly a visual document format. Parsing unstructured data intelligently is difficult.

Loading infinite pages is comparatively easy to implement and works pretty much universally across all websites. I've never encountered a website where my auto-scrolling code didn't work.

I spent some time over the past couple days to look into this more and I was able to get data out but because HTML is so unstructured the data is not very interesting:

lb siteadd output.db --scroll --extract-html $URL -vvv

-vvv will show the browser window so you can see what is going on. You can remove it to hide the automated browser.

In output.db (this is just the first five rows of two tables from the first page):

body_div_main_div_div_div_div_div (46 rows)

title	a_href	span_text	a_title	a_text	text	span	i_text
Deutschkurse bei der Universität München e.V.	https://adresse.dastelefonbuch.de/M%C3%BCnchen/1-Universit%C3%A4ten-Deutschkurse-bei-der-Universit%C3%A4t-M%C3%BCnchen-e-V-M%C3%BCnchen-Agnesstr.html	Deutschkurse bei der Universität München e.V.
	#		Agnesstr. 27, 80798 München, Schwabing-West	,, , Schwabing-West
	https://www.dastelefonbuch.de/service/free-call/Deutschkurse%20bei%20der%20Universit%C3%A4t%20M%C3%BCnchen%20e.V./089%202%2044%2010%2049-0/1126/0099?freeCallCmd=3038393234343130343930404034383564396166633863666235653635386635623766363865363334616339613333346630643639&service=pre	...	Gratis anrufen	Gratis anrufen	49-0	{"text": "089 2 44 10", "span": {"text": "..."}}
	http://www.dkfa.de			www.dkfa.de			Homepage
					Branche: Universitäten, Sprachschulen

body_div_main_div_div_div_div_div_a_span (30 rows)

text
Agnesstr. 27
80798
München
Ettensberger Str. 1
87544

So my tool is probably not useful for the German phonebook website but it might work okay on another site. It depends how they structure their HTML.

But maybe if you know SQL my tool will make HTML parsing more accessible to you and you might be able to make use of this data after some additional transformations...

2 votes

[2]
owyongsk
April 20, 2024
Link
Sorry I'm outside and on my phone. But I didn't see mentions of webscraper.io, it's a Chrome or Firefox extension. I've used it for some automation and local scraping. It's free for local. EDIT:...

Sorry I'm outside and on my phone. But I didn't see mentions of webscraper.io, it's a Chrome or Firefox extension. I've used it for some automation and local scraping. It's free for local.

EDIT: To do multiple URLs I create a spreadsheet on Google Sheets with multiple URLs and publish it and use it as a source to feed the scraper.

1 vote
1. douchebag (OP)
  April 23, 2024
  Link Parent
  That's a smart suggestion! I'll look into this. After a quick test it seems quite promising!
  
  webscraper.io
  To do multiple URLs I create a spreadsheet on Google Sheets with multiple URLs and publish it and use it as a source to feed the scraper.
  
  That's a smart suggestion! I'll look into this. After a quick test it seems quite promising!
  
  1 vote