What is the optimal way to convert an RPG book to a text format?
An RPG book is a book containing the rules and setting for a tabletop RPG game. Like Dungeons and Dragons 5th Edition, Worlds Without Number, Star Trek Adventures, etc.
The fact that they are rarely in text format always puts me off reading RPG books. I don't want to diminish the importance of art, but importing printed RPG books is prohibitively expensive, and reading huge PDFs on a laptop is not a good experience for me.
I also find it unpleasant to navigate the complicated design of these books. They're distracting.
I have a 6.8" Kindle Paperwhite but reading RPG PDFs on it is awful. RPG books have lots of art and complicated layouts. Unfortunately, there doesn't seem to be an easy way to make an RPG into text. I was seriously considering just copying the text and converting it to markdown myself (it doesn't need to be markdown, just something that I can convert into a format my Kindle understands) when I remembered chatGPT.
Copying the text and asking GPT to make it into markdown worked okay, but it missed the tables. Sending an image of a page worked pretty well, so I think AI is the way here. But I am not a GPT subscriber and I bet I'll hit a limit at some point. Also, instead of sending pages individually, I would prefer to send the PDF and get the result in text. Even if there were limitations (like only 10 pages in one go), it would be an improvement.
In any case, using chatGPT will be much better than doing it by hand. But is there an AI or other kind of PDF service that is better suited for that task, so I can reduce the amount of manual input?
The best format is likely a wiki.
Preferably one that is very granular, well indexed and configured to automatically hyperlink/tooltip other mentions of specialized concepts.
The issue is that there tends to be a lot of wikis floating around for each game and its very difficult to identify one that is most current and accurate. Especially with how so many newer games seem to be updated on the fly and the constant push for new content that the players can't keep up.
Another option is the fan made apps. I used to use the pf and dnd5e companion guide apps where the community would format every source book and hide them as addons on the forum. But this was years ago and don't know if they've been taken down for copyright.
And it was incredibly easy to search up concepts, bookmark and tag pages, plan encounters, incorporate homebrew stuff and generate things on the fly. And because they're community driven, it's mostly features people need and not microtransactions all the way through.
Here's the android apps I used for 5e
https://play.google.com/store/apps/details?id=com.blastervla.ddencountergeneratorddencountergenerator
Never played the 2e version but this looks like the original app.
https://play.google.com/store/apps/details?id=com.misthero.pf2earchive
That last one is Pathfinder 2e.
I've used the D&D Beyond app and website as well, but it doesn't really substitute for being able to read the book especially for people who just enjoy that.
Don't have any suggestions, but your experience using ChatGPT reminded me of this Ars Technica article: Why extracting data from PDFs is still a nightmare for data experts
LLMs have some advantages for OCR but I still prefer using traditional tools like Tesseract.
I've found
ocrmypdf
to be an easy-to-use wrapper around Tesseract and calibre'sebook-convert
to be a reliable tool for converting between formats.I put both tools together in this tool:
This will OCR if necessary and convert to open-ebook format (which is essentially an unzipped ePub)
While I agree that traditional tools are better than LLMs for this usecase, OCR is not necessary if it's the type of PDF where you can copy and paste the text. It would take way longer (and often produce worse results) than other methods of text extraction, in my experience, so it's better used as a backup for when other methods aren't possible.
Yeah my tool will skip OCR if the document already has selectable text (if the file claims to be PDF/A compliant). ocrmypdf has these options as well (and my tool passes the same flags to ocrmypdf, when provided):
Ah cool, I've never used this tool in isolation since I've always needed to do other types of documents in addition to PDFs, so I wasn't aware!
If you use Obsidian for DM prep, the best resource I’ve found is ttrpg-convert-cli. It pulls JSON data directly from 5etools/pf2etools and converts directly to Markdown, which then can be plugged into any Obsidian Vault.
With Obsidian Sync, it’s probably a decent way to read any popular 5e/Pf2e book, but I’ve only used it as rules references while DMing.
I think there’s a GUI too, but I never tried it.
Seconded - I use it myself. As a DM, having all the books I use in easily searchable markdown form is amazing for my session prep; Obsidian can double as a planning/writing tool for this purpose as well, and has some extra useful tools (Dice Roller for using random tables, Leaflet for interactive maps, etc)
Yes. Like Mage: the Ascension, Stars Without Number, Dungeons and Dragons 5h Edition, etc.
Yes presumably
I know your pain.
Closest I ever got was reading the PDFs in Acrobat on my phone and turning on "Flow" mode which tries to take the main body of text and make it one continuous body.
It was OK, but it ocassionaly lost track of where it was and you'd either read something out of order or miss a paragraph/page.
But yeah, I don't know of a good way I'm afraid. Many of these rule books have important information in images, tables and sidebars which don't fit the kindle experience.
"Calibre" is what you're looking for, though you might have to play around with different settings and formats to get things to look right.
Extracting raw text from PDFs is not a solved problem, but since you can copy-paste text from it, that means this PDF won't require extensive OCR at least. When I worked on a component to extract text from PDFs at my last job, I used Apache Tika -- you need to be at least a little technical to get it set up, but there's a docker image so to get it set up to run one PDF through it shouldn't be too tough if you have enough knowledge for that.
If money's not an issue, the OCR solution from Abby works pretty well in my experience (we used to use it at work for a project a few years ago).