11 votes

What is the optimal way to convert an RPG book to a text format?

Posted March 31 by lou (edited March 31)

An RPG book is a book containing the rules and setting for a tabletop RPG game. Like Dungeons and Dragons 5th Edition, Worlds Without Number, Star Trek Adventures, etc.

The fact that they are rarely in text format always puts me off reading RPG books. I don't want to diminish the importance of art, but importing printed RPG books is prohibitively expensive, and reading huge PDFs on a laptop is not a good experience for me.

I also find it unpleasant to navigate the complicated design of these books. They're distracting.

I have a 6.8" Kindle Paperwhite but reading RPG PDFs on it is awful. RPG books have lots of art and complicated layouts. Unfortunately, there doesn't seem to be an easy way to make an RPG into text. I was seriously considering just copying the text and converting it to markdown myself (it doesn't need to be markdown, just something that I can convert into a format my Kindle understands) when I remembered chatGPT.

Copying the text and asking GPT to make it into markdown worked okay, but it missed the tables. Sending an image of a page worked pretty well, so I think AI is the way here. But I am not a GPT subscriber and I bet I'll hit a limit at some point. Also, instead of sending pages individually, I would prefer to send the PDF and get the result in text. Even if there were limitations (like only 10 pages in one go), it would be an improvement.

In any case, using chatGPT will be much better than doing it by hand. But is there an AI or other kind of PDF service that is better suited for that task, so I can reduce the amount of manual input?

15 comments

[2]
SloMoMonday
March 31 (edited March 31)
Link
The best format is likely a wiki. Preferably one that is very granular, well indexed and configured to automatically hyperlink/tooltip other mentions of specialized concepts. The issue is that...

The best format is likely a wiki.
Preferably one that is very granular, well indexed and configured to automatically hyperlink/tooltip other mentions of specialized concepts.

The issue is that there tends to be a lot of wikis floating around for each game and its very difficult to identify one that is most current and accurate. Especially with how so many newer games seem to be updated on the fly and the constant push for new content that the players can't keep up.

Another option is the fan made apps. I used to use the pf and dnd5e companion guide apps where the community would format every source book and hide them as addons on the forum. But this was years ago and don't know if they've been taken down for copyright.

And it was incredibly easy to search up concepts, bookmark and tag pages, plan encounters, incorporate homebrew stuff and generate things on the fly. And because they're community driven, it's mostly features people need and not microtransactions all the way through.

Here's the android apps I used for 5e

https://play.google.com/store/apps/details?id=com.blastervla.ddencountergeneratorddencountergenerator

Never played the 2e version but this looks like the original app.
https://play.google.com/store/apps/details?id=com.misthero.pf2earchive

7 votes
1. DefinitelyNotAFae
  March 31
  Link Parent
  That last one is Pathfinder 2e. I've used the D&D Beyond app and website as well, but it doesn't really substitute for being able to read the book especially for people who just enjoy that.
  
  That last one is Pathfinder 2e.
  
  I've used the D&D Beyond app and website as well, but it doesn't really substitute for being able to read the book especially for people who just enjoy that.
  
  2 votes
synergy-unsterile
March 31
Link
Don't have any suggestions, but your experience using ChatGPT reminded me of this Ars Technica article: Why extracting data from PDFs is still a nightmare for data experts

Don't have any suggestions, but your experience using ChatGPT reminded me of this Ars Technica article: Why extracting data from PDFs is still a nightmare for data experts

4 votes
[4]
xk3
March 31
Link
LLMs have some advantages for OCR but I still prefer using traditional tools like Tesseract. I've found ocrmypdf to be an easy-to-use wrapper around Tesseract and calibre's ebook-convert to be a...
LLMs have some advantages for OCR but I still prefer using traditional tools like Tesseract.

I've found ocrmypdf to be an easy-to-use wrapper around Tesseract and calibre's ebook-convert to be a reliable tool for converting between formats.

I put both tools together in this tool:
```
pip install library
library process-text --no-delete-larger my_book.pdf
```
This will OCR if necessary and convert to open-ebook format (which is essentially an unzipped ePub)
4 votes
1. [3]
  sparksbet
  April 9
  Link Parent
  While I agree that traditional tools are better than LLMs for this usecase, OCR is not necessary if it's the type of PDF where you can copy and paste the text. It would take way longer (and often...
  
  While I agree that traditional tools are better than LLMs for this usecase, OCR is not necessary if it's the type of PDF where you can copy and paste the text. It would take way longer (and often produce worse results) than other methods of text extraction, in my experience, so it's better used as a backup for when other methods aren't possible.
  1. [2]
    xk3
    April 9 (edited April 9)
    Link Parent
    Yeah my tool will skip OCR if the document already has selectable text (if the file claims to be PDF/A compliant). ocrmypdf has these options as well (and my tool passes the same flags to...
    
    Yeah my tool will skip OCR if the document already has selectable text (if the file claims to be PDF/A compliant). ocrmypdf has these options as well (and my tool passes the same flags to ocrmypdf, when provided):
    
    OCR options: Control how OCR is applied -f, --force-ocr Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF) -s, --skip-text Skip OCR on any pages that already contain text, but include the page in final output; useful for PDFs that contain a mix of images, text pages, and/or previously OCRed pages --redo-ocr Attempt to detect and remove the hidden OCR layer from files that were previously OCRed with OCRmyPDF or another program. Apply OCR to text found in raster images. Existing visible text objects will not be changed. If there is no existing OCR, OCR will be added.
    
    2 votes
    
    sparksbet
    April 9
    Link Parent
    Ah cool, I've never used this tool in isolation since I've always needed to do other types of documents in addition to PDFs, so I wasn't aware!
    
    Ah cool, I've never used this tool in isolation since I've always needed to do other types of documents in addition to PDFs, so I wasn't aware!
[2]
shrub
April 8
Link
If you use Obsidian for DM prep, the best resource I’ve found is ttrpg-convert-cli. It pulls JSON data directly from 5etools/pf2etools and converts directly to Markdown, which then can be plugged...

If you use Obsidian for DM prep, the best resource I’ve found is ttrpg-convert-cli. It pulls JSON data directly from 5etools/pf2etools and converts directly to Markdown, which then can be plugged into any Obsidian Vault.

With Obsidian Sync, it’s probably a decent way to read any popular 5e/Pf2e book, but I’ve only used it as rules references while DMing.

I think there’s a GUI too, but I never tried it.

3 votes
1. hamefang
  April 10
  Link Parent
  Seconded - I use it myself. As a DM, having all the books I use in easily searchable markdown form is amazing for my session prep; Obsidian can double as a planning/writing tool for this purpose...
  
  Seconded - I use it myself. As a DM, having all the books I use in easily searchable markdown form is amazing for my session prep; Obsidian can double as a planning/writing tool for this purpose as well, and has some extra useful tools (Dice Roller for using random tables, Leaflet for interactive maps, etc)
[3]

Comment deleted by author
Link
1. lou (OP)
  March 31 (edited March 31)
  Link Parent
  Yes. Like Mage: the Ascension, Stars Without Number, Dungeons and Dragons 5h Edition, etc.
  
  Yes. Like Mage: the Ascension, Stars Without Number, Dungeons and Dragons 5h Edition, etc.
  
  2 votes
2. audiodude
  March 31
  Link Parent
  Yes presumably
  
  Yes presumably
  
  1 vote
kaffo
March 31
Link
I know your pain. Closest I ever got was reading the PDFs in Acrobat on my phone and turning on "Flow" mode which tries to take the main body of text and make it one continuous body. It was OK,...

I know your pain.
Closest I ever got was reading the PDFs in Acrobat on my phone and turning on "Flow" mode which tries to take the main body of text and make it one continuous body.
It was OK, but it ocassionaly lost track of where it was and you'd either read something out of order or miss a paragraph/page.

But yeah, I don't know of a good way I'm afraid. Many of these rule books have important information in images, tables and sidebars which don't fit the kindle experience.

1 vote
thearctic
April 3
Link
"Calibre" is what you're looking for, though you might have to play around with different settings and formats to get things to look right.

"Calibre" is what you're looking for, though you might have to play around with different settings and formats to get things to look right.
sparksbet
April 9
Link
Extracting raw text from PDFs is not a solved problem, but since you can copy-paste text from it, that means this PDF won't require extensive OCR at least. When I worked on a component to extract...

Extracting raw text from PDFs is not a solved problem, but since you can copy-paste text from it, that means this PDF won't require extensive OCR at least. When I worked on a component to extract text from PDFs at my last job, I used Apache Tika -- you need to be at least a little technical to get it set up, but there's a docker image so to get it set up to run one PDF through it shouldn't be too tough if you have enough knowledge for that.
hamefang
April 10
Link
If money's not an issue, the OCR solution from Abby works pretty well in my experience (we used to use it at work for a project a few years ago).

If money's not an issue, the OCR solution from Abby works pretty well in my experience (we used to use it at work for a project a few years ago).