Request for help: Backing up NASA public databases
TL;DR: NASA's public Planetary Data System is at risk of being shut down. Anyone have any ideas for backing it up?
Hi everyone,
Bit of a long-shot here, but I wanted to try on high-quality tildes before jumping back into the cesspool of reddit. I'm posting it in ~science rather than ~space as I figure interest in backing up public data is broader than just the space community.
I work regularly with NASA's Planetary Data System, or PDS. It's a massive (~3.5petabytes!!) archive of off-world scientific data (largely but not all imaging data). PDS is integral for scientific research - public and private - around the world, and is maintained, for free, by NASA (with support of a number of Academic institutions).
The current state of affairs for NASA is grim:
- NASA Lays Off ISS Workers at Marshall Space Flight Center
 - More layoffs at JPL
 - NASA is sinking its flagship science center during the government shutdown — and may be breaking the law in the process, critics say
 
And as a result, I (and many of my industry friends) have become increasingly concerned that PDS will be taken down as NASA is increasingly torn down for spare parts and irreparably damaged. This administration seems bent on destroying all forms of recording-keeping and public science, so who knows how long PDS will be kept up. Once it's down, it'll be a nightmare to try and collect it all again from various sources. I suspect we'll permanently lose decades worth of data - PDS includes information going all the way back to the Apollo missions!
As such, we've been pushing to back-up as much of PDS as we can, but have absolutely no hope of downloading it all within the next year or two, nevermind in a few months if the current cuts impact us soon.
If you or someone you know would be interested in helping figure out how we can back-up PDS before it's too late, please let me know here or in a DM. I've already tried reaching out to the Internet Archive, but did not hear anything back from them.
Edit: to clarify, the larger problem is download speeds - we've topped out at 20mb/s with 8 connections.
You should ask https://www.reddit.com/r/DataHoarder/ this is right up their alley. There are a few people there with multi-petabyte setups. Ballpark, I would say that’s about $20k in LTO-9 tape for 3.5PB if it isn’t compressible.
For temporal datasets there's likely a lot of redundant data across time: seasonal/solar/other cycles...
ArchiveTeam does not specifically list PDS as at-risk. But I am interested in distributed file systems--I think it would be possible to build a system that creates 7,000 torrents (each 500GB) of the PDS data (using the original URLs as web seeds) and index it like Anna's Archive so that people can see what is well-seeded or not. edit: ahh but this would require reading all the data at least once to build the hashes for the torrents
Double replying to you but I hadn't seen the Archive Team website before - thanks! Do you know how they assess risk? Is it worth my reaching out to them to discuss? If so, do you have a contact I can use that won't just go to spam hell?
It looks like IRC is the best way to reach out:
https://wiki.archiveteam.org/index.php/Archiveteam:IRC
I would try to download the content that is specifically relevant to your research (whether work or a personal project) and--if it's too big--try to download a small portion of it and figure out how to compress it more. Hopefully you can find a way to fit the lossless data locally--if not it may be worth having a lossy copy than no copy at all!
For lossy compression, ImageMagick converting to AV1 Image (
.avif) works very well on image documents. You could also try ghostscript for large, scanned PDF documents:gs -q -dNOPAUSE -dBATCH -dSAFER -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dDownsampleColorImages=true -dOverrideICC -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dColorImageDownsampleType=/Bicubic -dColorImageResolution=120 -dGrayImageDownsampleType=/Bicubic -dGrayImageResolution=120 -dMonoImageDownsampleType=/Bicubic -dMonoImageResolution=120 -sOutputFile=out.pdf input.pdfThanks! Very helpful reply. Without giving too much away on a public website, we are indeed closing in on backing up everything we need ourselves, but it's a small fragment of PDS as a whole and on a personal, community, and scientific level I would be devastated if we lost the rest of the archive to malicious stupidity.
Is this not something that The Internet Archive would be able to assist with? Have you tried reaching out to them to see what they say?
Is this main problem here storage space, or bandwidth limitations? I don't have a detailed specific solution either way, Just saying that any solution will likely be pretty different. I'm mainly asking because I see the other replies approaching this from a storage perspective, but I interpret your question as if it's mainly a bandwidth problem (of course you need to solve both).
You're right that the answer is truly both, but ultimately it's more of a bandwidth problem. IIRC the best we've managed was 20MB/s using 8 concurrent connections.
If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising and getting in touch with the maintainers at NASA.
It may be possible to fund/volunteer someone with the necessary clearances to do bulk replication on-site. Presumably there's a contact somewhere at NASA who could speak to that, and probably knows a lot about the best way to store the data for either archival or a useful production mirror.
Not a small undertaking.
Agreed, but as of now my NASA contacts have run dry. It seems the culture of fear is so strong atm that no one is willing to contemplate off the wall ideas to preserve data. If anyone knows the right people, I'm all ears.
You may be interested in Azure Data Box. You can download to the cloud with functionally unlimited bandwidth and then they will provide you with a physical export consisting of 525TB portable drive arrays that you can receive at your facility and transfer to your on-premises storage solution.