Best solution to extract PDF data?
Hi folks--
To those more knowledgeable than I am:
What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.
I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.
Any thoughts? Thanks in advance!
Can you reasonably bring in software packages onto the network you are doing the work on? I recommend using Tesseract OCR with Pytesseract, but there are other PDF parsing libraries that have coin-flip odds for working with your particular PDF format.
Is it necessary for OP to use something like Tesseract if they don't need OCR? I've only used Tesseract for the OCR side of things, and running OCR would be much slower and with probably less consistent results than extracting text more directly from the file.
The problem is that there really isn't a reliable way to extract data from a PDF. PDFs are weird. If his system is connected to the internet, he can try a few different libraries that try to parse PDFs, but like I said before, these have basically coin-flip odds of success. If he has to request software be installed and approved by security or something, Tesseract will be about as sure a bet as any.
I've worked on this problem before in a professional context, and while there are myriad ways PDFs can fail to be parsed by tools dedicated to extracting text from them, I think "a coin flip of success" is a big exaggeration. You'd be surprised how good the big libraries with a lot of work put into them can get at this.
I have as well and so far have never found something robust enough to handle my use case with enough reliability to put into production, so I think it depends heavily on your data and file.
Oh yeah it's absolutely highly variable depending on the specific tasks and files. My solution was in a context where we were processing a LOT of different files, so some percentage of failures was okay as long as it wasn't too high. PDFs were actually not the biggest offender on that front, iirc Word and Excel documents were the bigger offenders in sheer numbers of errors (if you exclude OCR timing out bc we didn't want to spend indefinite time and money on OCR-ing huge pdfs).
This is only tangentially related, but are there ways of reliably creating PDFs in a manner amenable to text extraction, or are issues just inherent to the format due to its glyph-based structure?
Hmm, it sounds beyond my abilities, but I will ask one of our IT/sysadmin folks whether we can do that. Thanks!
Have you tried pdfgrep?
https://pdfgrep.org/
If the tables are pretty consistent a suitable regex might work for you.
This is the kind of thing I was picturing! I'm going to check out a variety of options, and I'll update when I see some results. Appreciate it!
Extracting text from PDFs can be a pain even without OCR. Apache Tika is pretty much the state of the art at text extraction and is what I used for my last project, but that project required extraction from more file types than just PDFs and from a variety of sources. If you know you'll get the same types of PDFs each time, you may be able to use something smaller and more dedicated rather than such a big component -- Tika includes OCR functionality through Tesseract but it can be turned off.
Thank you! There seem like many options that are fairly complicated, so I'll put these on the list to look into more deeply.
This Ruby gem should work: https://github.com/yob/pdf-reader
Oh intriguing...Ruby and PHP were the two languages I checked out back in college, and this looked like it might make things really simple. Thanks!
I don't use it much myself, but I have a couple family members who've used Microsoft's Power Query to extract data into Excel from a variety of sources, including PDFs.
Ooh, I just saw people here singing the praises of PowerQuery on another thread. I'll go check it out. Thanks!
This may not be what you need, but MS Powertoys has Text Extractor tool that is based on this tool.
If the PDFs have the tables in the same region, you might be able to set up a capture region and page through the pdfs capturing the pages one at a time.
Admittedly that's a pretty manual process, but if nothing else avails you, it will be better to have some automation around it.
That's actually not too manual at all, as long as I'm only setting capture zones once for each page... I'll check it out. Thanks!
Some pretty good (albeit commercial) solutions are PDF query language and PDFxStream.
They look right on! My cussedness requires that I try it the harder, cheaper way first, but this will be in my back pocket. Thanks. :)
A former employer of mine built a tool for more-or-less this exact purpose: Textricator. Unfortunately, I was never involved in its use or development, so I can't help much with using it, but we were extracting structured data from court records which had been rendered as PDFs, which sounds reasonably close to what you're doing.
Hahaha, this really is exact! I'll report back.
pypdf_table_extraction (formerly camelot) sounds exactly like what you need.
I put an example Git-scraping repo here:
https://github.com/chapmanjacobd/us_visa_statistics/blob/main/.github/workflows/immigrant.yaml
Which uses the python code here:
https://github.com/chapmanjacobd/library/blob/d26b854b4291b7f6bc143915a6ec6a3145233aa1/xklb/utils/file_utils.py#L655