28 votes

Request for help: Backing up NASA public databases

TL;DR: NASA's public Planetary Data System is at risk of being shut down. Anyone have any ideas for backing it up?

Hi everyone,

Bit of a long-shot here, but I wanted to try on high-quality tildes before jumping back into the cesspool of reddit. I'm posting it in ~science rather than ~space as I figure interest in backing up public data is broader than just the space community.

I work regularly with NASA's Planetary Data System, or PDS. It's a massive (~3.5petabytes!!) archive of off-world scientific data (largely but not all imaging data). PDS is integral for scientific research - public and private - around the world, and is maintained, for free, by NASA (with support of a number of Academic institutions).

The current state of affairs for NASA is grim:

And as a result, I (and many of my industry friends) have become increasingly concerned that PDS will be taken down as NASA is increasingly torn down for spare parts and irreparably damaged. This administration seems bent on destroying all forms of recording-keeping and public science, so who knows how long PDS will be kept up. Once it's down, it'll be a nightmare to try and collect it all again from various sources. I suspect we'll permanently lose decades worth of data - PDS includes information going all the way back to the Apollo missions!

As such, we've been pushing to back-up as much of PDS as we can, but have absolutely no hope of downloading it all within the next year or two, nevermind in a few months if the current cuts impact us soon.

If you or someone you know would be interested in helping figure out how we can back-up PDS before it's too late, please let me know here or in a DM. I've already tried reaching out to the Internet Archive, but did not hear anything back from them.

Edit: to clarify, the larger problem is download speeds - we've topped out at 20mb/s with 8 connections.

12 comments

  1. [4]
    unkz
    (edited )
    Link
    You should ask https://www.reddit.com/r/DataHoarder/ this is right up their alley. There are a few people there with multi-petabyte setups. Ballpark, I would say that’s about $20k in LTO-9 tape...

    You should ask https://www.reddit.com/r/DataHoarder/ this is right up their alley. There are a few people there with multi-petabyte setups. Ballpark, I would say that’s about $20k in LTO-9 tape for 3.5PB if it isn’t compressible.

    13 votes
    1. [3]
      xk3
      (edited )
      Link Parent
      http://www.textfiles.com/programming/FORMATS/pds_form.txt http://justsolve.archiveteam.org/wiki/PDS For temporal datasets there's likely a lot of redundant data across time: seasonal/solar/other...

      Most images from planetary missions are gray scale 8-bit images and take on a
      narrow range of values in any given file. A simple compression algorithm can
      achieve 3 or 4 to 1 compression ratio.

      For temporal datasets there's likely a lot of redundant data across time: seasonal/solar/other cycles...

      ArchiveTeam does not specifically list PDS as at-risk. But I am interested in distributed file systems--I think it would be possible to build a system that creates 7,000 torrents (each 500GB) of the PDS data (using the original URLs as web seeds) and index it like Anna's Archive so that people can see what is well-seeded or not. edit: ahh but this would require reading all the data at least once to build the hashes for the torrents

      9 votes
      1. [2]
        Aerrol
        Link Parent
        Double replying to you but I hadn't seen the Archive Team website before - thanks! Do you know how they assess risk? Is it worth my reaching out to them to discuss? If so, do you have a contact I...

        Double replying to you but I hadn't seen the Archive Team website before - thanks! Do you know how they assess risk? Is it worth my reaching out to them to discuss? If so, do you have a contact I can use that won't just go to spam hell?

        5 votes
  2. [2]
    xk3
    (edited )
    Link
    I would try to download the content that is specifically relevant to your research (whether work or a personal project) and--if it's too big--try to download a small portion of it and figure out...

    I (and many of my industry friends) have become increasingly concerned that PDS will be taken down

    I would try to download the content that is specifically relevant to your research (whether work or a personal project) and--if it's too big--try to download a small portion of it and figure out how to compress it more. Hopefully you can find a way to fit the lossless data locally--if not it may be worth having a lossy copy than no copy at all!

    For lossy compression, ImageMagick converting to AV1 Image (.avif) works very well on image documents. You could also try ghostscript for large, scanned PDF documents: gs -q -dNOPAUSE -dBATCH -dSAFER -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dDownsampleColorImages=true -dOverrideICC -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dColorImageDownsampleType=/Bicubic -dColorImageResolution=120 -dGrayImageDownsampleType=/Bicubic -dGrayImageResolution=120 -dMonoImageDownsampleType=/Bicubic -dMonoImageResolution=120 -sOutputFile=out.pdf input.pdf

    4 votes
    1. Aerrol
      Link Parent
      Thanks! Very helpful reply. Without giving too much away on a public website, we are indeed closing in on backing up everything we need ourselves, but it's a small fragment of PDS as a whole and...

      Thanks! Very helpful reply. Without giving too much away on a public website, we are indeed closing in on backing up everything we need ourselves, but it's a small fragment of PDS as a whole and on a personal, community, and scientific level I would be devastated if we lost the rest of the archive to malicious stupidity.

      6 votes
  3. plutonic
    Link
    Is this not something that The Internet Archive would be able to assist with? Have you tried reaching out to them to see what they say?

    Is this not something that The Internet Archive would be able to assist with? Have you tried reaching out to them to see what they say?

    3 votes
  4. [5]
    Bwerf
    Link
    Is this main problem here storage space, or bandwidth limitations? I don't have a detailed specific solution either way, Just saying that any solution will likely be pretty different. I'm mainly...

    As such, we've been pushing to back-up as much of PDS as we can, but have absolutely no hope of downloading it all within the next year or two, nevermind in a few months if the current cuts impact us soon.

    Is this main problem here storage space, or bandwidth limitations? I don't have a detailed specific solution either way, Just saying that any solution will likely be pretty different. I'm mainly asking because I see the other replies approaching this from a storage perspective, but I interpret your question as if it's mainly a bandwidth problem (of course you need to solve both).

    2 votes
    1. [4]
      Aerrol
      Link Parent
      You're right that the answer is truly both, but ultimately it's more of a bandwidth problem. IIRC the best we've managed was 20MB/s using 8 concurrent connections.

      You're right that the answer is truly both, but ultimately it's more of a bandwidth problem. IIRC the best we've managed was 20MB/s using 8 concurrent connections.

      5 votes
      1. [2]
        FlippantGod
        Link Parent
        If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising...

        If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising and getting in touch with the maintainers at NASA.

        It may be possible to fund/volunteer someone with the necessary clearances to do bulk replication on-site. Presumably there's a contact somewhere at NASA who could speak to that, and probably knows a lot about the best way to store the data for either archival or a useful production mirror.

        Not a small undertaking.

        10 votes
        1. Aerrol
          Link Parent
          Agreed, but as of now my NASA contacts have run dry. It seems the culture of fear is so strong atm that no one is willing to contemplate off the wall ideas to preserve data. If anyone knows the...

          Agreed, but as of now my NASA contacts have run dry. It seems the culture of fear is so strong atm that no one is willing to contemplate off the wall ideas to preserve data. If anyone knows the right people, I'm all ears.

          3 votes
      2. unkz
        Link Parent
        You may be interested in Azure Data Box. You can download to the cloud with functionally unlimited bandwidth and then they will provide you with a physical export consisting of 525TB portable...

        You may be interested in Azure Data Box. You can download to the cloud with functionally unlimited bandwidth and then they will provide you with a physical export consisting of 525TB portable drive arrays that you can receive at your facility and transfer to your on-premises storage solution.