Best solution to extract PDF data?

programming

Ask (advice)

Hi folks-- To those more knowledgeable than I am: What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word...

Hi folks--

To those more knowledgeable than I am:

What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.

I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.

Any thoughts? Thanks in advance!

24 votes

Applying Chinese Wall Reverse Engineering to LLM Code Editing

Best solution to extract PDF data?

On being a c̵o̵m̵p̵u̵t̵e̵r̵ ̵s̵c̵i̵e̵n̵t̵i̵s̵t̵ human being in the time of collapse

Control of Computer Pointer Using Hand Gesture Recognition in Motion Pictures

Lessons learned from 15 years of SumatraPDF, an open source Windows app

Bumper-Sticker Computer Science