60 votes

Request for help: Backing up NASA public databases

TL;DR: NASA's public Planetary Data System is at risk of being shut down. Anyone have any ideas for backing it up?

Hi everyone,

Bit of a long-shot here, but I wanted to try on high-quality tildes before jumping back into the cesspool of reddit. I'm posting it in ~science rather than ~space as I figure interest in backing up public data is broader than just the space community.

I work regularly with NASA's Planetary Data System, or PDS. It's a massive (~3.5petabytes!!) archive of off-world scientific data (largely but not all imaging data). PDS is integral for scientific research - public and private - around the world, and is maintained, for free, by NASA (with support of a number of Academic institutions).

The current state of affairs for NASA is grim:

And as a result, I (and many of my industry friends) have become increasingly concerned that PDS will be taken down as NASA is increasingly torn down for spare parts and irreparably damaged. This administration seems bent on destroying all forms of recording-keeping and public science, so who knows how long PDS will be kept up. Once it's down, it'll be a nightmare to try and collect it all again from various sources. I suspect we'll permanently lose decades worth of data - PDS includes information going all the way back to the Apollo missions!

As such, we've been pushing to back-up as much of PDS as we can, but have absolutely no hope of downloading it all within the next year or two, nevermind in a few months if the current cuts impact us soon.

If you or someone you know would be interested in helping figure out how we can back-up PDS before it's too late, please let me know here or in a DM. I've already tried reaching out to the Internet Archive, but did not hear anything back from them.

Edit: to clarify, the larger problem is download speeds - we've topped out at 20mb/s with 8 connections.

23 comments

  1. [4]
    unkz
    (edited )
    Link
    You should ask https://www.reddit.com/r/DataHoarder/ this is right up their alley. There are a few people there with multi-petabyte setups. Ballpark, I would say that’s about $20k in LTO-9 tape...

    You should ask https://www.reddit.com/r/DataHoarder/ this is right up their alley. There are a few people there with multi-petabyte setups. Ballpark, I would say that’s about $20k in LTO-9 tape for 3.5PB if it isn’t compressible.

    28 votes
    1. [3]
      xk3
      (edited )
      Link Parent
      http://www.textfiles.com/programming/FORMATS/pds_form.txt http://justsolve.archiveteam.org/wiki/PDS For temporal datasets there's likely a lot of redundant data across time: seasonal/solar/other...

      Most images from planetary missions are gray scale 8-bit images and take on a
      narrow range of values in any given file. A simple compression algorithm can
      achieve 3 or 4 to 1 compression ratio.

      For temporal datasets there's likely a lot of redundant data across time: seasonal/solar/other cycles...

      ArchiveTeam does not specifically list PDS as at-risk. But I am interested in distributed file systems--I think it would be possible to build a system that creates 7,000 torrents (each 500GB) of the PDS data (using the original URLs as web seeds) and index it like Anna's Archive so that people can see what is well-seeded or not. edit: ahh but this would require reading all the data at least once to build the hashes for the torrents

      16 votes
      1. [2]
        Aerrol
        Link Parent
        Double replying to you but I hadn't seen the Archive Team website before - thanks! Do you know how they assess risk? Is it worth my reaching out to them to discuss? If so, do you have a contact I...

        Double replying to you but I hadn't seen the Archive Team website before - thanks! Do you know how they assess risk? Is it worth my reaching out to them to discuss? If so, do you have a contact I can use that won't just go to spam hell?

        8 votes
  2. [2]
    xk3
    (edited )
    Link
    I would try to download the content that is specifically relevant to your research (whether work or a personal project) and--if it's too big--try to download a small portion of it and figure out...

    I (and many of my industry friends) have become increasingly concerned that PDS will be taken down

    I would try to download the content that is specifically relevant to your research (whether work or a personal project) and--if it's too big--try to download a small portion of it and figure out how to compress it more. Hopefully you can find a way to fit the lossless data locally--if not it may be worth having a lossy copy than no copy at all!

    For lossy compression, ImageMagick converting to AV1 Image (.avif) works very well on image documents. You could also try ghostscript for large, scanned PDF documents: gs -q -dNOPAUSE -dBATCH -dSAFER -sProcessColorModel=DeviceGray -sColorConversionStrategy=Gray -dDownsampleColorImages=true -dOverrideICC -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dColorImageDownsampleType=/Bicubic -dColorImageResolution=120 -dGrayImageDownsampleType=/Bicubic -dGrayImageResolution=120 -dMonoImageDownsampleType=/Bicubic -dMonoImageResolution=120 -sOutputFile=out.pdf input.pdf

    8 votes
    1. Aerrol
      Link Parent
      Thanks! Very helpful reply. Without giving too much away on a public website, we are indeed closing in on backing up everything we need ourselves, but it's a small fragment of PDS as a whole and...

      Thanks! Very helpful reply. Without giving too much away on a public website, we are indeed closing in on backing up everything we need ourselves, but it's a small fragment of PDS as a whole and on a personal, community, and scientific level I would be devastated if we lost the rest of the archive to malicious stupidity.

      13 votes
  3. [5]
    plutonic
    Link
    Is this not something that The Internet Archive would be able to assist with? Have you tried reaching out to them to see what they say?

    Is this not something that The Internet Archive would be able to assist with? Have you tried reaching out to them to see what they say?

    8 votes
    1. [4]
      Aerrol
      Link Parent
      Lol as the OP says I did try emailing them and got 0 response. If anyone knows how to get in touch properly, I'm all ears!

      Lol as the OP says I did try emailing them and got 0 response. If anyone knows how to get in touch properly, I'm all ears!

      7 votes
      1. [3]
        kacey
        Link Parent
        Have you tried Archiveteam? There's an IRC channel listed there. They aren't from the Internet Archive, but they have a bunch of tools (and volunteers) available to help mirror parts of the...

        Have you tried Archiveteam? There's an IRC channel listed there. They aren't from the Internet Archive, but they have a bunch of tools (and volunteers) available to help mirror parts of the Internet that are being shut down. E.g. they mirrored flickr a while back, I think, as well as the US government apparently.

        (edit) for clarity's sake, they wrote a piece of software which volunteers use to help download dying sites in parallel. It's explicitly set up for situations like yours, so they'll hopefully be receptive and helpful!

        (edit 2) Looks like someone beat me to this yesterday XD apologies.

        8 votes
        1. gil
          Link Parent
          This is probably our best option and anyone with some bandwidth and storage to spare can help by running a Warrior themselves.

          This is probably our best option and anyone with some bandwidth and storage to spare can help by running a Warrior themselves.

          4 votes
        2. Aerrol
          Link Parent
          I really need to reach out to them with two of you suggesting I do! I've been swamped this week but it's on my list now. Thank you for the suggestion, duplication or not!

          I really need to reach out to them with two of you suggesting I do! I've been swamped this week but it's on my list now. Thank you for the suggestion, duplication or not!

          2 votes
  4. [4]
    rosco
    Link
    If I can make a secondary suggestion if you manage to figure this out, NAIP data would be another extremely worthwhile dataset to backup if there is any risk of it being taken down. Unlike LandSat...

    If I can make a secondary suggestion if you manage to figure this out, NAIP data would be another extremely worthwhile dataset to backup if there is any risk of it being taken down. Unlike LandSat imagery, NAIP is usually down to about a meter of resolution and also critical for mapping terrestrial changes. Just a thought!

    Hope you guys figure it out!!!

    6 votes
    1. [3]
      Aerrol
      Link Parent
      Do you have any sense of who maintains this data? I have no experience and therefore can't speak to risk of takedown.

      Do you have any sense of who maintains this data? I have no experience and therefore can't speak to risk of takedown.

      1 vote
      1. [2]
        rosco
        Link Parent
        It's a partnership between USGS and USDA, but I think the USDA maintains it. It's the baseline for a lot of the Long Term Ecological Monitoring projects so I just envision them seeing it as a...

        It's a partnership between USGS and USDA, but I think the USDA maintains it. It's the baseline for a lot of the Long Term Ecological Monitoring projects so I just envision them seeing it as a thread to tracking impacts to biodiversity and climate change.

        2 votes
        1. Greg
          Link Parent
          Given that we've seen Federal datasets being silently altered on a political basis, I think it's justified to be concerned.

          Given that we've seen Federal datasets being silently altered on a political basis, I think it's justified to be concerned.

          2 votes
  5. [8]
    Bwerf
    Link
    Is this main problem here storage space, or bandwidth limitations? I don't have a detailed specific solution either way, Just saying that any solution will likely be pretty different. I'm mainly...

    As such, we've been pushing to back-up as much of PDS as we can, but have absolutely no hope of downloading it all within the next year or two, nevermind in a few months if the current cuts impact us soon.

    Is this main problem here storage space, or bandwidth limitations? I don't have a detailed specific solution either way, Just saying that any solution will likely be pretty different. I'm mainly asking because I see the other replies approaching this from a storage perspective, but I interpret your question as if it's mainly a bandwidth problem (of course you need to solve both).

    3 votes
    1. [7]
      Aerrol
      Link Parent
      You're right that the answer is truly both, but ultimately it's more of a bandwidth problem. IIRC the best we've managed was 20MB/s using 8 concurrent connections.

      You're right that the answer is truly both, but ultimately it's more of a bandwidth problem. IIRC the best we've managed was 20MB/s using 8 concurrent connections.

      7 votes
      1. [2]
        FlippantGod
        Link Parent
        If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising...

        If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising and getting in touch with the maintainers at NASA.

        It may be possible to fund/volunteer someone with the necessary clearances to do bulk replication on-site. Presumably there's a contact somewhere at NASA who could speak to that, and probably knows a lot about the best way to store the data for either archival or a useful production mirror.

        Not a small undertaking.

        16 votes
        1. Aerrol
          Link Parent
          Agreed, but as of now my NASA contacts have run dry. It seems the culture of fear is so strong atm that no one is willing to contemplate off the wall ideas to preserve data. If anyone knows the...

          Agreed, but as of now my NASA contacts have run dry. It seems the culture of fear is so strong atm that no one is willing to contemplate off the wall ideas to preserve data. If anyone knows the right people, I'm all ears.

          6 votes
      2. unkz
        (edited )
        Link Parent
        You may be interested in Azure Data Box. You can download to the cloud with functionally unlimited bandwidth and then they will provide you with a physical export consisting of 525TB portable...

        You may be interested in Azure Data Box. You can download to the cloud with functionally unlimited bandwidth and then they will provide you with a physical export consisting of 525TB portable drive arrays that you can receive at your facility and transfer to your on-premises storage solution.

        8 votes
      3. [3]
        Greg
        Link Parent
        I’m assuming the total outgoing bandwidth available has to be a fair amount higher than that, otherwise researchers wouldn’t be able to get enough data out of the system to realistically work on,...

        I’m assuming the total outgoing bandwidth available has to be a fair amount higher than that, otherwise researchers wouldn’t be able to get enough data out of the system to realistically work on, so it’s probably reasonable to plan on the basis of anywhere from 10x that (full backup in about six months) to 100x that (full backup in a few weeks). Running 10 or 20 or 50 or even 500 lightweight cloud VMs on separate connections for a few weeks or months isn’t going to be that onerous in the context of what a few PB of storage costs.

        If NASA’s upstream bandwidth is actually a bottleneck, there’s going to be no way around getting them involved to give you physical access. Unless you’re OK with a six year timeline to get the data out, which I’m assuming you’re not, it just doesn’t work at 20MB/s. But the good news is I really doubt that’s close to a hard limit for any systems hosting a dataset of this size, because you’d have queues of PhD students waiting six months for their research samples to download if it were!

        In terms of actual storage, it’s a mid five figure to low six figure project, which isn’t horrible in the context of larger university budgets and is almost nothing in the context of large tech company budgets (although I’d understand not wanting to go all in on trusting them). If you can get someone to donate the space - Wikimedia, Internet Archive, Internet2, Google, AWS, Backblaze, Hugging Face, and/or any number of universities all seem like possible candidates there - I’d be more than happy to get involved on the software and data handling side. It’s not sounding wildly dissimilar to projects I’ve been part of in the past from the tech side, but I think pulling together the resources to make it happen will be the challenge on this one.

        8 votes
        1. [2]
          Aerrol
          Link Parent
          So from my understanding is there's a limit per connection, but a lot of bandwidth for multiple users connecting - or else as you say there would be a massive problem with the global user base. My...

          So from my understanding is there's a limit per connection, but a lot of bandwidth for multiple users connecting - or else as you say there would be a massive problem with the global user base.

          My worry about using VMs that way is overloading the system and/or getting rate limited for abusing the service, but maybe that's paranoid of me.

          1 vote
          1. Greg
            Link Parent
            I'd say it's totally reasonable for you to be considering those things, it's absolutely a positive that you want to be a polite user of the resources available, but I wouldn't let that lead to...

            I'd say it's totally reasonable for you to be considering those things, it's absolutely a positive that you want to be a polite user of the resources available, but I wouldn't let that lead to talking yourself out of doing an important piece of work altogether. NASA wants us to download this data, after all, that's why they spend time, money, and effort hosting it publicly in the first place!

            For what it's worth, framing this as the positive project that it genuinely is might help open doors a little if you do have the opportunity to talk to relevant people, too: something along the lines of "partnering with private organisations to share some of the cost and overhead that's currently being shouldered solely by the taxpayer" is a truthful, accurate, and politically viable way of explaining it to people who might otherwise be less receptive to discussing it.

            That said, asking permission of a large bureaucracy is a crapshoot at the best of times: nobody wants their name on the approval because that means they end up taking the blame if anyone gets pissed off about it, justifiably or otherwise. In a situation like this, where downloading the data is explicitly allowed and encouraged already, I'd just go ahead with it and keep that sales pitch in the back pocket just in case anyone questions it later.

            Going back to the tech, I'd be surprised if a sustained 2Gbps (~200MB/s) is enough for anyone to notice or care about on an installation of this size, especially when it's split across the eight "nodes" managed by different organisations, potentially in different datacenters. Realistically I'd assume each "node" is probably comprised of multiple servers, so even if the data is heavily weighted towards only one of them I doubt a couple of Gbps will raise an eyebrow.

            From a relatively quick glance at the software stack it looks as though at least some of the hosting is on AWS, and S3 has functionally unlimited bandwidth, so an alternative approach to minimising disruption if that turns out to be where the majority of data sits would be aiming for 100Gbps or more overall download rate and have the entire thing done over a long weekend. Perhaps more likely to hit a limiter that way, but also a lot more likely to get the project completed by doing so quickly. If it does get throttled, that's the point you configure the downloads to politely back off a bit, and in the worst case fall back to the plan of ~2Gbps sustained over six months - rate limiting isn't the end of the world, after all, it's just a request to slow down; you're only the asshole if you ignore that request! (And sometimes the severity of a situation justifies being a bit of an asshole for the greater good, but that's another topic and I'm not familiar enough to know if we're there yet in terms of risk)

            Assuming there's somewhere lined up to store and host all the data, the next steps in my mind would be figuring out how and where it's hosted - is it all S3? 90/10 split between S3 and in-house? 50/50? 10/90? That's going to heavily inform the next steps around planning downloads. Picking apart the open source tools to see what rate limits and assumptions are already baked in should give a decent starting point, and then verifying those assumptions with some smaller scale parallel tests follows pretty naturally.

            After that it's a question of how and where to run the VMs; Amazon's egress charges are absolutely punitive, as are GCP and Azure's, so even if the data's already on AWS it might not be the best place to run the downloads unless they're also acting as a sponsor of the project (which they could well be willing to do, the actual cost to them is effectively zero even if they'd charge you $350k as a customer). Physical export from AWS could also be an option, depending where the data's going - you might need to copy it all into your own S3 bucket first for them to allow that, but doing so all within the same cloud should at least be fast. Cloudflare might be more willing to work with you on bandwidth and transfer, they're far more sensible about that side of things than the big three cloud providers in my experience.

            That all kind of ties back to the existing data location and intended download rate, though. If it's going to be single digit Gbps for months, you can do it as a background process across a few end user connections no problem. If you're going for speed, some kind of cloud-agnostic terraform setup will probably be smart to stay within limits across multiple providers.