-
8 votes
-
Best solution to extract PDF data?
Hi folks-- To those more knowledgeable than I am: What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word...
Hi folks--
To those more knowledgeable than I am:
What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.
I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.
Any thoughts? Thanks in advance!
24 votes -
On being a c̵o̵m̵p̵u̵t̵e̵r̵ ̵s̵c̵i̵e̵n̵t̵i̵s̵t̵ human being in the time of collapse
12 votes -
Control of Computer Pointer Using Hand Gesture Recognition in Motion Pictures
3 votes -
Lessons learned from 15 years of SumatraPDF, an open source Windows app
20 votes -
Bumper-Sticker Computer Science
11 votes