8 votes

Chrome/Firefox Plugin to locally scrape data from multiple URLs

As the title suggests, I am looking for a free chrome or firefox plugin that can locally scrape data from multiple URLs. To be a bit more precise, what I mean by it:

  • A free chrome or firefox plugin
  • Local scraping: it runs in the browser itself. No cloud computing or "credits" required to run
  • Scrape data: Collects predefined data from certain data fields within a website such as https://www.dastelefonbuch.de/Suche/Test
  • Infinite scroll: to load data that only loads once the browser scrolls down (kind of like in the page I linked above)

I am not looking into programming my own scraper using python or anything similar. I have found plugins that "kind of" do what I am describing above, and about two weeks ago I found one that pretty much perfectly does what is described ("DataGrab"), but it starts asking to buy credits after running it a few times.

My own list:

  • DataGrab: Excellent, apart from asking to buy credits after a while
  • SimpleScraper: Excellent, but asks to buy credits pretty much immediately
  • Easy Scraper: Works well for single pages, but no possibility to feed in multiple URLs to crawl
  • Instant Data Scraper: Works well for single pages and infinite scroll pages, but no possibility to feed in multiple URLs to crawl
  • "Data Scraper - Easy Web Scraping" / dataminer.io: Doesn't work well
  • Scrapy.org: Too much programming, but looks quite neat and well documented

Any suggestions are highly welcome!

Edit: A locally run executable or cmd-line based program would be fine too, as long as it just needs to be configured (e.g., creating a list of URLs stored in a .txt or .csv file) instead of coded (e.g., coding an infinite scroll function from scratch).

7 comments

  1. [2]
    creesch
    (edited )
    Link
    edit: What are you using this for? I might be reading too much into this, but your only other interaction on Tildes so far has been one highly specific other question, which you seem to have...

    edit: What are you using this for? I might be reading too much into this, but your only other interaction on Tildes so far has been one highly specific other question, which you seem to have scrubbed clean of your comments and contents yesterday. Combined with the subject and that you seem to be scraping phone numbers (based on your example) does make me question the involved motives.


    What sort of scraping are you looking for? I assume you are looking at something that only fetches specific information from specific bits of a page, not just the entire website?

    Looking at scrapy.org, it seems to do what is minimally needed there, defining what css selectors to use and what data to fetch. I haven't looked at the other offerings, but I am guessing they over a nice GUI type thing for that aspect, as you didn't mention it. Which actually is one of the more important requirements if I had to guess.

    It also severely limits your options, tooling offering this sort of GUI likely is service based. Not only does it require quite a bit of extra work to implement such a GUI, they also need to account for a variety of other things like banners in the way, etc. And, to be frank, most other people go for the slightly more technical approach as there is no lack of options there. Scrapy is highly specific, but any test framework can also be used for scrapping. Often you see people using playwright for this or selenium webdriver for their favorite language.

    All of this is basically a long-winded way of me saying that you are likely stuck with service/cloud based offerings. Unless you put in a bit of extra time and effort in learning the very basics of css, html and a tiny bit of python (for scrapy).

    tl;dr Can't have your cake and eat too. Your options are basically paying for a low friction non-technical solution or put in the time and effort and do it for free.

    13 votes
    1. douchebag
      Link Parent
      Cheers for this reply. I think I'll put in the effort to try and figure things out using scrappy and/or another python based solution. I was rather just hoping there might be someone who had come...

      Cheers for this reply. I think I'll put in the effort to try and figure things out using scrappy and/or another python based solution. I was rather just hoping there might be someone who had come across a tool that may already suit the need.

      As for you other question(s): I usually delete my comments and/or posts several times per week and only leave them there if they are related to some sort of technical solution or advice that may help someone down the road.

      As for the scraping: I am putting together an overview of # of businesses in a certain sector and found that the online phone book would give a good indictor of this. However this only works by searching for specific keywords and later cleansing the data for duplicates, false entries, irrelevant entries etc. There's also other sources out there that I'll use for my statistic, but they are partially quite imprecise in other ways (industry reports, association reports etc.)

      Hope that answers the questions :)

      4 votes
  2. [3]
    xk3
    (edited )
    Link
    The website you linked to loads additional pages as HTML. I wrote something with seleniumwire that can do this if the data loads with JSON. It is a bit experimental and the results might be spread...

    within a website such as

    The website you linked to loads additional pages as HTML. I wrote something with seleniumwire that can do this if the data loads with JSON. It is a bit experimental and the results might be spread across a lot of different tables but it works.

    pip install xklb
    lb siteadd output.db --scroll $URL
    

    For HTML you might be able to use xidel to extract the content that you want.

    For kicks, I tried adding the same functionality for HTML, but I get a lot of this:

    xml.parsers.expat.ExpatError: not well-formed (invalid token): line 19, column 24

    which I'm not very surprised by. HTML is difficult for programs to parse. I'm sure it might work for some websites but certainly not the German phonebook site.

    2 votes
    1. [2]
      douchebag
      Link Parent
      This is a helpful input, thank you for taking the time to test it out and sharing your results. I was guessing the extra loading of pages may be a core issue, but I wasn't aware it was caused by...

      This is a helpful input, thank you for taking the time to test it out and sharing your results.

      I was guessing the extra loading of pages may be a core issue, but I wasn't aware it was caused by the HTML format.

      1 vote
      1. xk3
        (edited )
        Link Parent
        yeah HTML is not very strict about how information is stored because it is mostly a visual document format. Parsing unstructured data intelligently is difficult. Loading infinite pages is...

        yeah HTML is not very strict about how information is stored because it is mostly a visual document format. Parsing unstructured data intelligently is difficult.

        Loading infinite pages is comparatively easy to implement and works pretty much universally across all websites. I've never encountered a website where my auto-scrolling code didn't work.

        I spent some time over the past couple days to look into this more and I was able to get data out but because HTML is so unstructured the data is not very interesting:

        lb siteadd output.db --scroll --extract-html $URL -vvv
        

        -vvv will show the browser window so you can see what is going on. You can remove it to hide the automated browser.

        In output.db (this is just the first five rows of two tables from the first page):

        body_div_main_div_div_div_div_div (46 rows)

        title a_href span_text a_title a_text text span i_text
        Deutschkurse bei der Universität München e.V. https://adresse.dastelefonbuch.de/M%C3%BCnchen/1-Universit%C3%A4ten-Deutschkurse-bei-der-Universit%C3%A4t-M%C3%BCnchen-e-V-M%C3%BCnchen-Agnesstr.html Deutschkurse bei der Universität München e.V.
        # Agnesstr. 27, 80798 München, Schwabing-West ,, , Schwabing-West
        https://www.dastelefonbuch.de/service/free-call/Deutschkurse%20bei%20der%20Universit%C3%A4t%20M%C3%BCnchen%20e.V./089%202%2044%2010%2049-0/1126/0099?freeCallCmd=3038393234343130343930404034383564396166633863666235653635386635623766363865363334616339613333346630643639&service=pre ... Gratis anrufen Gratis anrufen 49-0 {"text": "089 2 44 10", "span": {"text": "..."}}
        http://www.dkfa.de www.dkfa.de Homepage
        Branche: Universitäten, Sprachschulen

        body_div_main_div_div_div_div_div_a_span (30 rows)

        text
        Agnesstr. 27
        80798
        München
        Ettensberger Str. 1
        87544

        So my tool is probably not useful for the German phonebook website but it might work okay on another site. It depends how they structure their HTML.

        But maybe if you know SQL my tool will make HTML parsing more accessible to you and you might be able to make use of this data after some additional transformations...

        2 votes
  3. [2]
    owyongsk
    Link
    Sorry I'm outside and on my phone. But I didn't see mentions of webscraper.io, it's a Chrome or Firefox extension. I've used it for some automation and local scraping. It's free for local. EDIT:...

    Sorry I'm outside and on my phone. But I didn't see mentions of webscraper.io, it's a Chrome or Firefox extension. I've used it for some automation and local scraping. It's free for local.

    EDIT: To do multiple URLs I create a spreadsheet on Google Sheets with multiple URLs and publish it and use it as a source to feed the scraper.

    1 vote
    1. douchebag
      Link Parent
      That's a smart suggestion! I'll look into this. After a quick test it seems quite promising!

      webscraper.io
      To do multiple URLs I create a spreadsheet on Google Sheets with multiple URLs and publish it and use it as a source to feed the scraper.

      That's a smart suggestion! I'll look into this. After a quick test it seems quite promising!

      1 vote