Help with converting PDF to Excel and back to PDF?
I may be asking a dumb question or going about this wrong but I'm not sure what to do here.
So right now, I receive an estimation from one company in a PDF. It has a bunch of fields such as customer name, product, address, etc. Then I type that data and put it into Excel, where I add additional data that I have. From there I have a second PDF which has form fields that I fill with the data of the Excel spreadsheet.
My problem is with the first PDF that I get from this other company, unless I am doing something wrong I am unable to get that first PDF to show the data as fields. If I convert the first PDF into an Excel then the table data is very messy.
The amount of typing or copy and paste that it's not hard but it is time consuming. What is the best way for me to go about doing this? I've been Googling things but I'm not sure the right words of action I'm looking for.
I hope this all makes sense, but if not please ask questions and I'll do my best to try and clarify further.
The best solution is to ask them to provide you with a different format than PDF. If an Excel file, great. Otherwise, you could import a CSV. Even a text file is better than a PDF, because it would be much easier to parse.
If you're copy-pasting (highlighting segments on screen, etc.), consider copy-pasting the entire document at once, then using a scripting language and tools (awk, sed, etc.) on the text to generate a CSV. You can then import that into Excel.
Unfortunately, they won't give me another format other than PDF. I can get the text from bounding boxes in Adobe. Is there any way to turn those into form fields or pull the text into an Excel spreadsheet?
Notable mentions here of two versatile python libraries viz. PyPDF2 and pdfminer.six. I've used them both in client projects recently to parse large PDFs to text and then extract data out of that text using regular expressions, etc.
I've also used another pdf parsing library smalot/pdfparser, this one is written in PHP. I've handled as large PDF files as 800MB using this if memory serves correctly!
Of course, the success in these cases depends a lot on how well your PDF itself is formatted. Maybe you can experiment with one briefly as a proof of concept and see where it goes?
Ilovepdf.com has back and forth excel options.
Tabula is a popular tool that's great for extracting table data from PDFs. I've used it successfully many times for the type of work you're talking about.
https://tabula.technology/
Getting data from Excel into PDF form fields is harder, but assuming you're on Windows, you can set up a AHK script that will copy a row out of data from Excel, split it apart, focus the PDF window, fill in each field of the PDF with CTRL+V and TAB keystrokes, then refocus Excel and move your cursor down a row.
(E: and for clarity, AHK = "autohotkey", which is a simple tool that swaps windows/opens files/sends keystrokes/clicks mouse. Much easier for a non-techie to use than a traditional programming language.)
https://smallpdf.com/pdf-to-excel
I like smallpdf for their jpg to pdf conversion. I have not tried this conversion, but it's worth a shot.
Another option is Bing Chat or ChatGPT+, Bing chat allows you to extract information from PDF files, same thing with CharGPT+ but you need a plug-in (AFAIK) and that costs money, bing is free. You might be able to get them to make you the excel but just getting the data in a format to simply copy and paste shouldn’t be too hard.
I recently also had to do something like this. I ended up writing a small program in Python using Camelot. It worked surprisingly well.