2 votes

[Python] Trouble fetching checkbox and radio fields with PyPDF2

Posted August 23, 2022 by noble_pleb

My project involves reading text from a bunch of PDF form files for which I'm using PyPDF2 open source library. There is no issue in getting the text data as follows:

reader = PdfReader("data/test.pdf")
cnt = len(reader.pages)
print("reading pdf (%d pages)" % cnt)
page = reader.pages[cnt-1]
lines = page.extract_text().splitlines()
print("%d lines extracted..." % len(lines))

However, this text doesn't contain the checked statuses of the radio and checkboxes. I just get normal text (like "Yes No" for example) instead of these values.

I also tried the reader.get_fields() and reader.get_form_text_fields() methods as described in their documentation but they return empty values. I also tried reading it through annotations but no "/Annots" found on the page. When I open the PDF in a notepad++ to see its meta data, this is what I get:

%PDF-1.4
%²³´µ
%Generated by ExpertPdf v9.2.2

It appears to me that these checkboxes aren't usual form fields used in PDF but appear similar to HTML elements. Is there any way to extract these fields using python?

2 comments

[2]
vivarium
August 23, 2022
Link
Clicking through to the project's GitHub issue page, it looks like you've gotten this sorted out. :) PDF parsing is a neat problem domain! What's so hard about PDF text extraction? is another...

Clicking through to the project's GitHub issue page, it looks like you've gotten this sorted out. :)

It's not printed because the PDF text is selectable but the checkboxes aren't which makes me think that it's embedded HTML or something. That's actually more likely because there are HTML documents too with need to be processed and are formatted similarly.

PDF parsing is a neat problem domain! What's so hard about PDF text extraction? is another interesting read on the topic, for anyone else who's curious.

2 votes
1. noble_pleb (OP)
  August 24, 2022
  Link Parent
  Thank you. The user agreed that the radio/checkbox values aren't needed at least for now, so I closed this. But yeah, this is a neat problem domain which needs to be explored further!
  
  Thank you. The user agreed that the radio/checkbox values aren't needed at least for now, so I closed this. But yeah, this is a neat problem domain which needs to be explored further!
  
  1 vote