16 votes

American Stories: A large-scale structured text dataset of historical US newspapers

Posted September 2, 2023 by skybrian

Tags: usa, newspapers, ocr, library of congress, chronicling america, artificial intelligence, studies, source.arxiv

https://arxiv.org/abs/2308.12477

Link information

This data is scraped automatically and may be incorrect.

Published: Aug 2 2023

3 comments

skybrian (OP)
September 2, 2023
Link
Here's the abstract: Here's the Github repo, which links to the dataset on HuggingSpace. It seems to be a large number of tar.gz files. I don't see a way to browse it. Having a nicer way to search...

Here's the abstract:

Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.

Here's the Github repo, which links to the dataset on HuggingSpace. It seems to be a large number of tar.gz files.

I don't see a way to browse it. Having a nicer way to search and browse old newspaper articles might be pretty cool?

2 votes
[2]
boxer_dogs_dance
September 2, 2023
Link
I was surprised that in their emphasis on training technology like large language models, they did not propose to also provide access to libraries so that the articles could be accessed by users...

I was surprised that in their emphasis on training technology like large language models, they did not propose to also provide access to libraries so that the articles could be accessed by users of the library. I have done news article research on microfilm and it is a huge pain. The abstract pointed out problems with the OCR and other access features of the stored articles and claimed that their dataset was better quality. Why not provide access to people and institutions directly rather than only through ai?

1 vote
1. skybrian (OP)
  September 3, 2023
  Link Parent
  They're not actually providing access through AI either. The main goal of many researchers is to publish papers, often with associated data. They typically don't maintain anything, neither...
  
  They're not actually providing access through AI either. The main goal of many researchers is to publish papers, often with associated data. They typically don't maintain anything, neither production code nor production websites. Once they've published something, it's done and they move on to their next research project. (Although, maybe they'll work on a followup paper using the same data.)
  
  But if anyone wants to do an interesting open source project, building a website for it might be fun? The first step would be looking at the data dump and seeing how much a pain it would be. No AI needed, probably.
  
  Since it's research-quality code, though, you might also end up fixing their pipeline and running it again, and that would involve understanding and improving the AI stuff they used for the pipeline.
  
  A more direct way of answering your question would be write to them and ask if their data would be suitable for making a website.
  
  2 votes