24 votes

Best solution to extract PDF data?

Hi folks--

To those more knowledgeable than I am:

What would be the best local solution to extract numerical data from a batch of PDF file reports? The values I want are interspersed among word processor formatted tables and irrelevant text. The text and table formatting are (nearly) identical across reports. The data I want vary across reports. The PDFs are not of images...I can select and copy text without OCR. I have thousands to process, and the data themselves are confidential (I have clearance) and cannot be shared. I can use Windows or Linux but no MacOS.

I am technically inclined, so I bashed my head against regular expressions just enough to use notepad++ to find and delete most of the irrelevant stuff and make a CSV, but it's a hacky, imprecise method and not nearly automated enough for batches. For reference, I don't code for a living or even as a hobby, but I use R and bash, am familiar with IDEs, and can follow pseudocode well enough to edit and use scripts.

Any thoughts? Thanks in advance!

37 comments

  1. [14]
    RNG
    Link
    Can you reasonably bring in software packages onto the network you are doing the work on? I recommend using Tesseract OCR with Pytesseract, but there are other PDF parsing libraries that have...

    Can you reasonably bring in software packages onto the network you are doing the work on? I recommend using Tesseract OCR with Pytesseract, but there are other PDF parsing libraries that have coin-flip odds for working with your particular PDF format.

    11 votes
    1. [2]
      thereticent
      Link Parent
      Hmm, it sounds beyond my abilities, but I will ask one of our IT/sysadmin folks whether we can do that. Thanks!

      Hmm, it sounds beyond my abilities, but I will ask one of our IT/sysadmin folks whether we can do that. Thanks!

      3 votes
      1. RNG
        Link Parent
        I think Pytesseract is well known enough that if you have any familiarity with python at all, you can likely get 99% of the way there by prompting an LLM to generate what you are looking for.

        I think Pytesseract is well known enough that if you have any familiarity with python at all, you can likely get 99% of the way there by prompting an LLM to generate what you are looking for.

        3 votes
    2. [11]
      sparksbet
      Link Parent
      Is it necessary for OP to use something like Tesseract if they don't need OCR? I've only used Tesseract for the OCR side of things, and running OCR would be much slower and with probably less...

      Is it necessary for OP to use something like Tesseract if they don't need OCR? I've only used Tesseract for the OCR side of things, and running OCR would be much slower and with probably less consistent results than extracting text more directly from the file.

      2 votes
      1. [8]
        RNG
        Link Parent
        The problem is that there really isn't a reliable way to extract data from a PDF. PDFs are weird. If his system is connected to the internet, he can try a few different libraries that try to parse...

        The problem is that there really isn't a reliable way to extract data from a PDF. PDFs are weird. If his system is connected to the internet, he can try a few different libraries that try to parse PDFs, but like I said before, these have basically coin-flip odds of success. If he has to request software be installed and approved by security or something, Tesseract will be about as sure a bet as any.

        10 votes
        1. [7]
          sparksbet
          Link Parent
          I've worked on this problem before in a professional context, and while there are myriad ways PDFs can fail to be parsed by tools dedicated to extracting text from them, I think "a coin flip of...

          I've worked on this problem before in a professional context, and while there are myriad ways PDFs can fail to be parsed by tools dedicated to extracting text from them, I think "a coin flip of success" is a big exaggeration. You'd be surprised how good the big libraries with a lot of work put into them can get at this.

          5 votes
          1. [6]
            Eji1700
            Link Parent
            I have as well and so far have never found something robust enough to handle my use case with enough reliability to put into production, so I think it depends heavily on your data and file.

            I have as well and so far have never found something robust enough to handle my use case with enough reliability to put into production, so I think it depends heavily on your data and file.

            4 votes
            1. [5]
              sparksbet
              Link Parent
              Oh yeah it's absolutely highly variable depending on the specific tasks and files. My solution was in a context where we were processing a LOT of different files, so some percentage of failures...

              Oh yeah it's absolutely highly variable depending on the specific tasks and files. My solution was in a context where we were processing a LOT of different files, so some percentage of failures was okay as long as it wasn't too high. PDFs were actually not the biggest offender on that front, iirc Word and Excel documents were the bigger offenders in sheer numbers of errors (if you exclude OCR timing out bc we didn't want to spend indefinite time and money on OCR-ing huge pdfs).

              4 votes
              1. [4]
                danke
                Link Parent
                This is only tangentially related, but are there ways of reliably creating PDFs in a manner amenable to text extraction, or are issues just inherent to the format due to its glyph-based structure?

                This is only tangentially related, but are there ways of reliably creating PDFs in a manner amenable to text extraction, or are issues just inherent to the format due to its glyph-based structure?

                3 votes
                1. zod000
                  Link Parent
                  You can absolutely create easy to parse PDFs if you are just creating them from a text starting point and use any number of common libraries. The issue with PDFs is that they can be all over the...

                  You can absolutely create easy to parse PDFs if you are just creating them from a text starting point and use any number of common libraries. The issue with PDFs is that they can be all over the place in content and functionality. Adobe kind of went nuts with all the baked in tech in PDFs. Like, is it necessary to be able to render 3D models in a document? I wouldn't have said yes to that, but Adobe sure thought so. Using OCR lets you sidestep most of this wackiness, but OCR isn't perfect and probably never will be.

                  2 votes
                2. sparksbet
                  Link Parent
                  I unfortunately was never on the creating side of things, so I can't say much there. Most of the non-OCR-related errors we got had to do with something not following the normal spec (which I'm...

                  I unfortunately was never on the creating side of things, so I can't say much there. Most of the non-OCR-related errors we got had to do with something not following the normal spec (which I'm more familiar with for the XML-based formats, personally).

                3. Eji1700
                  Link Parent
                  My understanding of looking into this is it’s just a byproduct of the way it works.

                  My understanding of looking into this is it’s just a byproduct of the way it works.

      2. [2]
        0xSim
        (edited )
        Link Parent
        Images are easy to do with tools like Tesseract: if they're clean enough, it works really well. But PDFs... PDFs are a nightmare. You can mix and match images and text (so what looks like text...

        Images are easy to do with tools like Tesseract: if they're clean enough, it works really well.

        But PDFs... PDFs are a nightmare. You can mix and match images and text (so what looks like text often isn't), and the text itself is regularly mangled and not usable without manual cleaning. PDFs are meant to look good and be printable; the ones that can be recovered are rare. Honestly the best way to extract text from a PDF is to first convert it to images, and run OCR on them.

        Edit: Just to be clear I'm talking from experience here. Maybe your PDFs are good enough to be extracted, because they were made properly from good materials (like markdown or Office files) but that's not the case for a good portion of them 🤷‍♂️

        2 votes
        1. sparksbet
          Link Parent
          If you need the text as plaintext from most PDFs, this just isn't even close to true from my experience. Existing solutions (like Apache Tika or even something simpler but less robust like pypdf)...

          If you need the text as plaintext from most PDFs, this just isn't even close to true from my experience. Existing solutions (like Apache Tika or even something simpler but less robust like pypdf) tend to be good enough solutions for most PDFs in that scenario, and they're also FAR faster than doing OCR. Especially the larger the file is, OCR can really take ages, so even if it were perfectly accurate (and I can say from experience that Tesseract is not), it's not preferable if you have large PDFs or a large number of PDFs to process quickly.

          3 votes
  2. [2]
    streblo
    Link
    Have you tried pdfgrep? https://pdfgrep.org/ If the tables are pretty consistent a suitable regex might work for you.

    Have you tried pdfgrep?

    https://pdfgrep.org/

    If the tables are pretty consistent a suitable regex might work for you.

    10 votes
    1. thereticent
      Link Parent
      This is the kind of thing I was picturing! I'm going to check out a variety of options, and I'll update when I see some results. Appreciate it!

      This is the kind of thing I was picturing! I'm going to check out a variety of options, and I'll update when I see some results. Appreciate it!

      6 votes
  3. [2]
    sparksbet
    Link
    Extracting text from PDFs can be a pain even without OCR. Apache Tika is pretty much the state of the art at text extraction and is what I used for my last project, but that project required...

    Extracting text from PDFs can be a pain even without OCR. Apache Tika is pretty much the state of the art at text extraction and is what I used for my last project, but that project required extraction from more file types than just PDFs and from a variety of sources. If you know you'll get the same types of PDFs each time, you may be able to use something smaller and more dedicated rather than such a big component -- Tika includes OCR functionality through Tesseract but it can be turned off.

    6 votes
    1. thereticent
      Link Parent
      Thank you! There seem like many options that are fairly complicated, so I'll put these on the list to look into more deeply.

      Thank you! There seem like many options that are fairly complicated, so I'll put these on the list to look into more deeply.

      4 votes
  4. [4]
    Crespyl
    Link
    I don't use it much myself, but I have a couple family members who've used Microsoft's Power Query to extract data into Excel from a variety of sources, including PDFs.

    I don't use it much myself, but I have a couple family members who've used Microsoft's Power Query to extract data into Excel from a variety of sources, including PDFs.

    6 votes
    1. [3]
      thereticent
      Link Parent
      Ooh, I just saw people here singing the praises of PowerQuery on another thread. I'll go check it out. Thanks!

      Ooh, I just saw people here singing the praises of PowerQuery on another thread. I'll go check it out. Thanks!

      3 votes
      1. Weldawadyathink
        Link Parent
        One of those people was me, and I was planning to suggest PQ. I’ve used it for PDF data before, and it works quite well, as long as it doesn’t have to OCR and the PDF is set up sensibly. It sounds...

        One of those people was me, and I was planning to suggest PQ. I’ve used it for PDF data before, and it works quite well, as long as it doesn’t have to OCR and the PDF is set up sensibly. It sounds like that is your use case, so it should work great.

        4 votes
      2. PnkNBlck71817
        Link Parent
        I use PQ for extracting from PDFs. Some of them are very ugly and I have been able to successfully manipulate the data to give me what I need.

        I use PQ for extracting from PDFs. Some of them are very ugly and I have been able to successfully manipulate the data to give me what I need.

        1 vote
  5. [2]
    bl4kers
    Link
    This Ruby gem should work: https://github.com/yob/pdf-reader

    This Ruby gem should work: https://github.com/yob/pdf-reader

    5 votes
    1. thereticent
      Link Parent
      Oh intriguing...Ruby and PHP were the two languages I checked out back in college, and this looked like it might make things really simple. Thanks!

      Oh intriguing...Ruby and PHP were the two languages I checked out back in college, and this looked like it might make things really simple. Thanks!

      4 votes
  6. [2]
    whbboyd
    Link
    A former employer of mine built a tool for more-or-less this exact purpose: Textricator. Unfortunately, I was never involved in its use or development, so I can't help much with using it, but we...

    A former employer of mine built a tool for more-or-less this exact purpose: Textricator. Unfortunately, I was never involved in its use or development, so I can't help much with using it, but we were extracting structured data from court records which had been rendered as PDFs, which sounds reasonably close to what you're doing.

    4 votes
    1. thereticent
      Link Parent
      Hahaha, this really is exact! I'll report back.

      Hahaha, this really is exact! I'll report back.

  7. [2]
    first-must-burn
    Link
    This may not be what you need, but MS Powertoys has Text Extractor tool that is based on this tool. If the PDFs have the tables in the same region, you might be able to set up a capture region and...

    This may not be what you need, but MS Powertoys has Text Extractor tool that is based on this tool.

    If the PDFs have the tables in the same region, you might be able to set up a capture region and page through the pdfs capturing the pages one at a time.

    Admittedly that's a pretty manual process, but if nothing else avails you, it will be better to have some automation around it.

    3 votes
    1. thereticent
      Link Parent
      That's actually not too manual at all, as long as I'm only setting capture zones once for each page... I'll check it out. Thanks!

      That's actually not too manual at all, as long as I'm only setting capture zones once for each page... I'll check it out. Thanks!

      3 votes
  8. [2]
    ignorabimus
    Link
    Some pretty good (albeit commercial) solutions are PDF query language and PDFxStream.

    Some pretty good (albeit commercial) solutions are PDF query language and PDFxStream.

    2 votes
    1. thereticent
      Link Parent
      They look right on! My cussedness requires that I try it the harder, cheaper way first, but this will be in my back pocket. Thanks. :)

      They look right on! My cussedness requires that I try it the harder, cheaper way first, but this will be in my back pocket. Thanks. :)

      1 vote
  9. xk3
    Link
    pypdf_table_extraction (formerly camelot) sounds exactly like what you need. I put an example Git-scraping repo here:...
    2 votes
  10. [2]
    TurtleCracker
    Link
    Is the text in acrofields / acroform fields or is it just text on the PDF itself?

    Is the text in acrofields / acroform fields or is it just text on the PDF itself?

    2 votes
    1. thereticent
      Link Parent
      I'm going to assume just on the PDF itself. They aren't fillable fields, if that's what those terms mean.

      I'm going to assume just on the PDF itself. They aren't fillable fields, if that's what those terms mean.

      1 vote
  11. Maxi
    Link
    If there's not too much, honestly just screenshotting it and using a mac to copy paste is pretty easy. ChatGPT is also surprisingly good at this sort of stuff.

    If there's not too much, honestly just screenshotting it and using a mac to copy paste is pretty easy.

    ChatGPT is also surprisingly good at this sort of stuff.

    2 votes
  12. Gaywallet
    Link
    Not sure if it'll fit your needs, but I self-host a PDF tool called Stirling PDF that has a surprising amount of flexibility as a general PDF tool. Since it's self-hosted and open source you can...

    Not sure if it'll fit your needs, but I self-host a PDF tool called Stirling PDF that has a surprising amount of flexibility as a general PDF tool. Since it's self-hosted and open source you can cut it off from the network if you're worried about sensitive data. It's also really easy to deploy if you're familiar with docker.

    2 votes
  13. Boojum
    Link
    You mentioned you can use Linux, so I'd suggest pdftotext which can be found in the poppler-utils package on many distros. As I understand it, PDF kind of flattens things and doesn't really...

    You mentioned you can use Linux, so I'd suggest pdftotext which can be found in the poppler-utils package on many distros.

    As I understand it, PDF kind of flattens things and doesn't really preserve much structure. When you select lines in a PDF for copy and paste, it's using heuristics on the positions of the characters, words and lines to guess at what goes together.

    By default, pdftotext will just dump out a text file with all the extracted text. That alone may or may not be suitable depending on your needs. You can also try it with the -layout option where it will try to format the text file that it writes to look visually like the original PDF.

    But since the text and table are nearly identical, you should be able to take advantage of the positioning of things being in common places.

    You can run
    pdftotext -bbox-layout <pdfFileIn> <xmlFileOut>
    to get an XML file with the positions of the boxes for all of the blocks, lines, and words throughout the document. You can inspect that to figure out the position of a box that will cover just your table (or, say, a row of a table).

    Then you can ask the utility to extract the text from just that box instead of the entire document:
    pdftotext -f <page> -l <page> -x <boxXMin> -y <boxYMin> -W <boxWidth> -H <boxHeight> <pdfFileIn> <textFileOut>
    and repeat as needed to extract your tables.

    For automating pdftotext with scripting and regular expressions, I've found that that the GPT-like LLMs with coding capabilities have been surprisingly good at writing that sort of glue code when I'm feeling too lazy to do it myself. I still don't trust them with code in big complicated system, but this sort of thing is really their jam. (Assuming you have access to such a model, even if it's just a local or on-site model run by your org.)

    2 votes
  14. tauon
    Link
    After the many recommendations here already, I don’t really have much practical advice to add, just wanted to mention that PDF text extraction is a rather convoluted topic. Entire products and...

    After the many recommendations here already, I don’t really have much practical advice to add, just wanted to mention that PDF text extraction is a rather convoluted topic. Entire products and patents exist around it for a reason.