9
votes
Do data storage providers 'share' identical data among clients?
What I mean is, if two clients of a provider upload the same data, do the hosts identify the match and create only references to the identical data for the multiple clients, or do they simply have two copies of the same data on their server?
Caveat: I am merely geeky.
What prompted the thought was an article maybe on /. that Russia might run out of data storage, and this would impact entertainment providers. Seems a good way to save space if that's an issue.
I would guess not, as storage is pretty cheap these days (or was at least in the before times), and it would raise privacy and possibly copyright issues. But a semi-scrupulous and miserly provider might consider it I suppose. Anyone here have any insights or knowledges?
It's the opposite. They make copies of data in multiple data centers to make sure it's not lost.
That may be true, but, at least in the past, Dropbox used to do block-level deduplication, and people used to exploit this to "teleport" files that someone had already uploaded to Dropbox amongst users. I don't know if this abuse is what caused Dropbox to change their API because this was being abused at scale or if it was a casualty of something else, but in theory, for large, read-only files that many users keep in their Dropbox, it would make sense on Dropbox's end to dedupe them, at least within the same data center.
Seems unlikely that most datasets that data storage providers have would be capable of much deduplication. Photos are fairly unique, documents are fairly unique - my guess would be music and videos might fit this? That said, they’re probably already using a file system like ZFS or btrfs with compression on and potentially file system level deduplication.
A properly end-to-end encrypted service would also be fundamentally incapable of identifying any duplicate bits across user accounts.
It depends on the system being used and on what you mean by "identical".
If you're not allowed to know anything about the data that you're storing, the best you can really do is deduplication. Peeling away a few layers of abstraction, a file is really just a bunch of digits in order. While it's a bunch of digits, they don't have any meaning. If we want to store the file "A" which stores the data "123" and the file "B" which stores the data "345", we could write down where we stuck the "1", the "2", and the "3" for A, and do the same for B. Or, we could be clever and for B just update our meta-data about where we're storing things and add a note on the "3" that says that it is the last part of A and also the first part of B.
Deduping is AFAIK used in all production systems. It's low hanging fruit for storage space reduction. It's generally considered a good tradeoff between the extra metadata space and compute usage to do the checking.
The other idea that has a similar effect is shadow copies. The classic use case is an OS for remote desktop. You have one base image for all of your users and they each start with the same copy of the OS file. Over time, they will make changes, add things, or delete them, but the vast majority of the files won't be touched. This means that Alice and Bob can both be using the same OS but Alice's desktop background is foo.png and Bob's is bar.png, so we'll mark those two files as being different, but point the rest of it back to the base image.