Best solution to extract PDF data?
Hi folks--
To those more knowledgeable than I am:
What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.
I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.
Any thoughts? Thanks in advance!
Can you reasonably bring in software packages onto the network you are doing the work on? I recommend using Tesseract OCR with Pytesseract, but there are other PDF parsing libraries that have coin-flip odds for working with your particular PDF format.
Hmm, it sounds beyond my abilities, but I will ask one of our IT/sysadmin folks whether we can do that. Thanks!
I think Pytesseract is well known enough that if you have any familiarity with python at all, you can likely get 99% of the way there by prompting an LLM to generate what you are looking for.
Is it necessary for OP to use something like Tesseract if they don't need OCR? I've only used Tesseract for the OCR side of things, and running OCR would be much slower and with probably less consistent results than extracting text more directly from the file.
The problem is that there really isn't a reliable way to extract data from a PDF. PDFs are weird. If his system is connected to the internet, he can try a few different libraries that try to parse PDFs, but like I said before, these have basically coin-flip odds of success. If he has to request software be installed and approved by security or something, Tesseract will be about as sure a bet as any.
I've worked on this problem before in a professional context, and while there are myriad ways PDFs can fail to be parsed by tools dedicated to extracting text from them, I think "a coin flip of success" is a big exaggeration. You'd be surprised how good the big libraries with a lot of work put into them can get at this.
I have as well and so far have never found something robust enough to handle my use case with enough reliability to put into production, so I think it depends heavily on your data and file.
Oh yeah it's absolutely highly variable depending on the specific tasks and files. My solution was in a context where we were processing a LOT of different files, so some percentage of failures was okay as long as it wasn't too high. PDFs were actually not the biggest offender on that front, iirc Word and Excel documents were the bigger offenders in sheer numbers of errors (if you exclude OCR timing out bc we didn't want to spend indefinite time and money on OCR-ing huge pdfs).
This is only tangentially related, but are there ways of reliably creating PDFs in a manner amenable to text extraction, or are issues just inherent to the format due to its glyph-based structure?
You can absolutely create easy to parse PDFs if you are just creating them from a text starting point and use any number of common libraries. The issue with PDFs is that they can be all over the place in content and functionality. Adobe kind of went nuts with all the baked in tech in PDFs. Like, is it necessary to be able to render 3D models in a document? I wouldn't have said yes to that, but Adobe sure thought so. Using OCR lets you sidestep most of this wackiness, but OCR isn't perfect and probably never will be.
I unfortunately was never on the creating side of things, so I can't say much there. Most of the non-OCR-related errors we got had to do with something not following the normal spec (which I'm more familiar with for the XML-based formats, personally).
My understanding of looking into this is it’s just a byproduct of the way it works.
Images are easy to do with tools like Tesseract: if they're clean enough, it works really well.
But PDFs... PDFs are a nightmare. You can mix and match images and text (so what looks like text often isn't), and the text itself is regularly mangled and not usable without manual cleaning. PDFs are meant to look good and be printable; the ones that can be recovered are rare. Honestly the best way to extract text from a PDF is to first convert it to images, and run OCR on them.
Edit: Just to be clear I'm talking from experience here. Maybe your PDFs are good enough to be extracted, because they were made properly from good materials (like markdown or Office files) but that's not the case for a good portion of them 🤷♂️
If you need the text as plaintext from most PDFs, this just isn't even close to true from my experience. Existing solutions (like Apache Tika or even something simpler but less robust like pypdf) tend to be good enough solutions for most PDFs in that scenario, and they're also FAR faster than doing OCR. Especially the larger the file is, OCR can really take ages, so even if it were perfectly accurate (and I can say from experience that Tesseract is not), it's not preferable if you have large PDFs or a large number of PDFs to process quickly.
Have you tried pdfgrep?
https://pdfgrep.org/
If the tables are pretty consistent a suitable regex might work for you.
This is the kind of thing I was picturing! I'm going to check out a variety of options, and I'll update when I see some results. Appreciate it!
Extracting text from PDFs can be a pain even without OCR. Apache Tika is pretty much the state of the art at text extraction and is what I used for my last project, but that project required extraction from more file types than just PDFs and from a variety of sources. If you know you'll get the same types of PDFs each time, you may be able to use something smaller and more dedicated rather than such a big component -- Tika includes OCR functionality through Tesseract but it can be turned off.
Thank you! There seem like many options that are fairly complicated, so I'll put these on the list to look into more deeply.
I don't use it much myself, but I have a couple family members who've used Microsoft's Power Query to extract data into Excel from a variety of sources, including PDFs.
Ooh, I just saw people here singing the praises of PowerQuery on another thread. I'll go check it out. Thanks!
One of those people was me, and I was planning to suggest PQ. I’ve used it for PDF data before, and it works quite well, as long as it doesn’t have to OCR and the PDF is set up sensibly. It sounds like that is your use case, so it should work great.
I use PQ for extracting from PDFs. Some of them are very ugly and I have been able to successfully manipulate the data to give me what I need.
This Ruby gem should work: https://github.com/yob/pdf-reader
Oh intriguing...Ruby and PHP were the two languages I checked out back in college, and this looked like it might make things really simple. Thanks!
A former employer of mine built a tool for more-or-less this exact purpose: Textricator. Unfortunately, I was never involved in its use or development, so I can't help much with using it, but we were extracting structured data from court records which had been rendered as PDFs, which sounds reasonably close to what you're doing.
Hahaha, this really is exact! I'll report back.
This may not be what you need, but MS Powertoys has Text Extractor tool that is based on this tool.
If the PDFs have the tables in the same region, you might be able to set up a capture region and page through the pdfs capturing the pages one at a time.
Admittedly that's a pretty manual process, but if nothing else avails you, it will be better to have some automation around it.
That's actually not too manual at all, as long as I'm only setting capture zones once for each page... I'll check it out. Thanks!
Some pretty good (albeit commercial) solutions are PDF query language and PDFxStream.
They look right on! My cussedness requires that I try it the harder, cheaper way first, but this will be in my back pocket. Thanks. :)
pypdf_table_extraction (formerly camelot) sounds exactly like what you need.
I put an example Git-scraping repo here:
https://github.com/chapmanjacobd/us_visa_statistics/blob/main/.github/workflows/immigrant.yaml
Which uses the python code here:
https://github.com/chapmanjacobd/library/blob/d26b854b4291b7f6bc143915a6ec6a3145233aa1/xklb/utils/file_utils.py#L655
Is the text in acrofields / acroform fields or is it just text on the PDF itself?
I'm going to assume just on the PDF itself. They aren't fillable fields, if that's what those terms mean.
If there's not too much, honestly just screenshotting it and using a mac to copy paste is pretty easy.
ChatGPT is also surprisingly good at this sort of stuff.
Not sure if it'll fit your needs, but I self-host a PDF tool called Stirling PDF that has a surprising amount of flexibility as a general PDF tool. Since it's self-hosted and open source you can cut it off from the network if you're worried about sensitive data. It's also really easy to deploy if you're familiar with docker.
You mentioned you can use Linux, so I'd suggest
pdftotext
which can be found in thepoppler-utils
package on many distros.As I understand it, PDF kind of flattens things and doesn't really preserve much structure. When you select lines in a PDF for copy and paste, it's using heuristics on the positions of the characters, words and lines to guess at what goes together.
By default,
pdftotext
will just dump out a text file with all the extracted text. That alone may or may not be suitable depending on your needs. You can also try it with the-layout
option where it will try to format the text file that it writes to look visually like the original PDF.But since the text and table are nearly identical, you should be able to take advantage of the positioning of things being in common places.
You can run
pdftotext -bbox-layout <pdfFileIn> <xmlFileOut>
to get an XML file with the positions of the boxes for all of the blocks, lines, and words throughout the document. You can inspect that to figure out the position of a box that will cover just your table (or, say, a row of a table).
Then you can ask the utility to extract the text from just that box instead of the entire document:
pdftotext -f <page> -l <page> -x <boxXMin> -y <boxYMin> -W <boxWidth> -H <boxHeight> <pdfFileIn> <textFileOut>
and repeat as needed to extract your tables.
For automating
pdftotext
with scripting and regular expressions, I've found that that the GPT-like LLMs with coding capabilities have been surprisingly good at writing that sort of glue code when I'm feeling too lazy to do it myself. I still don't trust them with code in big complicated system, but this sort of thing is really their jam. (Assuming you have access to such a model, even if it's just a local or on-site model run by your org.)After the many recommendations here already, I don’t really have much practical advice to add, just wanted to mention that PDF text extraction is a rather convoluted topic. Entire products and patents exist around it for a reason.