13 votes

Best solution to extract PDF data?

Hi folks--

To those more knowledgeable than I am:

What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.

I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.

Any thoughts? Thanks in advance!

23 comments

  1. [8]
    RNG
    Link
    Can you reasonably bring in software packages onto the network you are doing the work on? I recommend using Tesseract OCR with Pytesseract, but there are other PDF parsing libraries that have...

    Can you reasonably bring in software packages onto the network you are doing the work on? I recommend using Tesseract OCR with Pytesseract, but there are other PDF parsing libraries that have coin-flip odds for working with your particular PDF format.

    7 votes
    1. [6]
      sparksbet
      Link Parent
      Is it necessary for OP to use something like Tesseract if they don't need OCR? I've only used Tesseract for the OCR side of things, and running OCR would be much slower and with probably less...

      Is it necessary for OP to use something like Tesseract if they don't need OCR? I've only used Tesseract for the OCR side of things, and running OCR would be much slower and with probably less consistent results than extracting text more directly from the file.

      1 vote
      1. [5]
        RNG
        Link Parent
        The problem is that there really isn't a reliable way to extract data from a PDF. PDFs are weird. If his system is connected to the internet, he can try a few different libraries that try to parse...

        The problem is that there really isn't a reliable way to extract data from a PDF. PDFs are weird. If his system is connected to the internet, he can try a few different libraries that try to parse PDFs, but like I said before, these have basically coin-flip odds of success. If he has to request software be installed and approved by security or something, Tesseract will be about as sure a bet as any.

        5 votes
        1. [4]
          sparksbet
          Link Parent
          I've worked on this problem before in a professional context, and while there are myriad ways PDFs can fail to be parsed by tools dedicated to extracting text from them, I think "a coin flip of...

          I've worked on this problem before in a professional context, and while there are myriad ways PDFs can fail to be parsed by tools dedicated to extracting text from them, I think "a coin flip of success" is a big exaggeration. You'd be surprised how good the big libraries with a lot of work put into them can get at this.

          3 votes
          1. [3]
            Eji1700
            Link Parent
            I have as well and so far have never found something robust enough to handle my use case with enough reliability to put into production, so I think it depends heavily on your data and file.

            I have as well and so far have never found something robust enough to handle my use case with enough reliability to put into production, so I think it depends heavily on your data and file.

            2 votes
            1. [2]
              sparksbet
              Link Parent
              Oh yeah it's absolutely highly variable depending on the specific tasks and files. My solution was in a context where we were processing a LOT of different files, so some percentage of failures...

              Oh yeah it's absolutely highly variable depending on the specific tasks and files. My solution was in a context where we were processing a LOT of different files, so some percentage of failures was okay as long as it wasn't too high. PDFs were actually not the biggest offender on that front, iirc Word and Excel documents were the bigger offenders in sheer numbers of errors (if you exclude OCR timing out bc we didn't want to spend indefinite time and money on OCR-ing huge pdfs).

              3 votes
              1. danke
                Link Parent
                This is only tangentially related, but are there ways of reliably creating PDFs in a manner amenable to text extraction, or are issues just inherent to the format due to its glyph-based structure?

                This is only tangentially related, but are there ways of reliably creating PDFs in a manner amenable to text extraction, or are issues just inherent to the format due to its glyph-based structure?

    2. thereticent
      Link Parent
      Hmm, it sounds beyond my abilities, but I will ask one of our IT/sysadmin folks whether we can do that. Thanks!

      Hmm, it sounds beyond my abilities, but I will ask one of our IT/sysadmin folks whether we can do that. Thanks!

  2. [2]
    streblo
    Link
    Have you tried pdfgrep? https://pdfgrep.org/ If the tables are pretty consistent a suitable regex might work for you.

    Have you tried pdfgrep?

    https://pdfgrep.org/

    If the tables are pretty consistent a suitable regex might work for you.

    6 votes
    1. thereticent
      Link Parent
      This is the kind of thing I was picturing! I'm going to check out a variety of options, and I'll update when I see some results. Appreciate it!

      This is the kind of thing I was picturing! I'm going to check out a variety of options, and I'll update when I see some results. Appreciate it!

      2 votes
  3. [2]
    sparksbet
    Link
    Extracting text from PDFs can be a pain even without OCR. Apache Tika is pretty much the state of the art at text extraction and is what I used for my last project, but that project required...

    Extracting text from PDFs can be a pain even without OCR. Apache Tika is pretty much the state of the art at text extraction and is what I used for my last project, but that project required extraction from more file types than just PDFs and from a variety of sources. If you know you'll get the same types of PDFs each time, you may be able to use something smaller and more dedicated rather than such a big component -- Tika includes OCR functionality through Tesseract but it can be turned off.

    4 votes
    1. thereticent
      Link Parent
      Thank you! There seem like many options that are fairly complicated, so I'll put these on the list to look into more deeply.

      Thank you! There seem like many options that are fairly complicated, so I'll put these on the list to look into more deeply.

      1 vote
  4. [2]
    bl4kers
    Link
    This Ruby gem should work: https://github.com/yob/pdf-reader

    This Ruby gem should work: https://github.com/yob/pdf-reader

    2 votes
    1. thereticent
      Link Parent
      Oh intriguing...Ruby and PHP were the two languages I checked out back in college, and this looked like it might make things really simple. Thanks!

      Oh intriguing...Ruby and PHP were the two languages I checked out back in college, and this looked like it might make things really simple. Thanks!

      1 vote
  5. [2]
    Crespyl
    Link
    I don't use it much myself, but I have a couple family members who've used Microsoft's Power Query to extract data into Excel from a variety of sources, including PDFs.

    I don't use it much myself, but I have a couple family members who've used Microsoft's Power Query to extract data into Excel from a variety of sources, including PDFs.

    2 votes
    1. thereticent
      Link Parent
      Ooh, I just saw people here singing the praises of PowerQuery on another thread. I'll go check it out. Thanks!

      Ooh, I just saw people here singing the praises of PowerQuery on another thread. I'll go check it out. Thanks!

      1 vote
  6. [2]
    first-must-burn
    Link
    This may not be what you need, but MS Powertoys has Text Extractor tool that is based on this tool. If the PDFs have the tables in the same region, you might be able to set up a capture region and...

    This may not be what you need, but MS Powertoys has Text Extractor tool that is based on this tool.

    If the PDFs have the tables in the same region, you might be able to set up a capture region and page through the pdfs capturing the pages one at a time.

    Admittedly that's a pretty manual process, but if nothing else avails you, it will be better to have some automation around it.

    1 vote
    1. thereticent
      Link Parent
      That's actually not too manual at all, as long as I'm only setting capture zones once for each page... I'll check it out. Thanks!

      That's actually not too manual at all, as long as I'm only setting capture zones once for each page... I'll check it out. Thanks!

      1 vote
  7. [2]
    ignorabimus
    Link
    Some pretty good (albeit commercial) solutions are PDF query language and PDFxStream.

    Some pretty good (albeit commercial) solutions are PDF query language and PDFxStream.

    1 vote
    1. thereticent
      Link Parent
      They look right on! My cussedness requires that I try it the harder, cheaper way first, but this will be in my back pocket. Thanks. :)

      They look right on! My cussedness requires that I try it the harder, cheaper way first, but this will be in my back pocket. Thanks. :)

  8. [2]
    whbboyd
    Link
    A former employer of mine built a tool for more-or-less this exact purpose: Textricator. Unfortunately, I was never involved in its use or development, so I can't help much with using it, but we...

    A former employer of mine built a tool for more-or-less this exact purpose: Textricator. Unfortunately, I was never involved in its use or development, so I can't help much with using it, but we were extracting structured data from court records which had been rendered as PDFs, which sounds reasonably close to what you're doing.

    1 vote
    1. thereticent
      Link Parent
      Hahaha, this really is exact! I'll report back.

      Hahaha, this really is exact! I'll report back.

  9. xk3
    Link
    pypdf_table_extraction (formerly camelot) sounds exactly like what you need. I put an example Git-scraping repo here:...
    1 vote