• Activity
  • Votes
  • Comments
  • New
  • All activity
  • Showing only topics in ~comp with the tag "pdf". Back to normal view / Search all groups
    1. Best solution to extract PDF data?

      Hi folks-- To those more knowledgeable than I am: What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word...

      Hi folks--

      To those more knowledgeable than I am:

      What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.

      I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.

      Any thoughts? Thanks in advance!

      24 votes