Activity

Votes

Comments

New

All activity

Showing only topics in ~comp with the tag "data". Back to normal view / Search all groups

Shopping around for a new-and-improved backup solution

Ask (advice)

A few days ago, I posted this and quickly realized that the world of data backups is far richer than just sudo rsync -av --delete --exclude=Videos /home /home_bkup. So now I'm window shopping the...

A few days ago, I posted this and quickly realized that the world of data backups is far richer than just sudo rsync -av --delete --exclude=Videos /home /home_bkup.

So now I'm window shopping the top Linux-supported backup solutions: borg, duplicacy, kopia, restic and--oh look--a core borg dev just dropped his own new-and-improved solution, vykar.

Restic was the first tool I started to research, and I thought I really liked it, got as far as installing, initializing a test repo, creating a couple of snapshots. But restic seems to be, hmm, fussy about the source and destination paths, absolute vs relative paths, etc.

The fact that merely renaming a parent directory (or grandparent, or great-grandparent, etc) causes restic to treat every unchanged byte below that as brand new ... that's a recipe for giant, bloated repos, and it's unacceptable to me ... and hey, lookit that, borg does not do that. So now, restic is out and borg is in.

But what other pros v cons are there, that I haven't even realized need to be considered? What advantages/disadvantages do other apps offer? Which ones can I easily automate with nightly/hourly cron jobs? Which ones have their own even-better automated solutions?

Do I even want encryption? All of my drives/volumes are LUKS encrypted, and anything I would store remotely would also get encrypted before it ever left my LAN ... plus, I'm just a bit nervous about having the backups encrypted, requiring working, functional software to restore/recover data from them....

That may not seem like such a big concern, perhaps, but I am currently working my way thru decrypting a bunch of 10-15 year old TrueCrypt-ed volumes, which requires using an old, outdated version of VeraCrypt and a somewhat "cross-my-fingers" effort to find KeePass repos old enough (also outdated, KeePass 1.0 repos) to still contain the various passwords I used to encrypt those ancient volumes ... but also still use new enough master passwords that I can still get the KeePass repos unlocked.

With rsync, I can literally just go into any backup, find the specific version of the specific file(s) I want to recover, and manually copy it back to my workspace. Is anything like that option available in any of these deduplicated/encrypted solutions, even if they're not encrypted? If (eg) a borg repo is created w/o encryption, the data is still all just borg-specific blobs, right? Or can I navigate into the repo and just manually grab files?

Oh yeah ... for reference, the past 10-ish years, my backup routine has been to create a new, dated, destination folder, starting with a full backup of my /home folder (excluding things like Videos, Music, VMs, other bulky stuff that gets backed up separately/differently), and then running nightly diff backups into the same folder, while also maintaining a "one-day-older" second backup of the whole thing on a 2nd HDD ... then, every 3-6 months, zipping up the current backup folder and starting a new one.

At any rate, there you go; that's the kind of stuff I'm thinking about now, as I overhaul my 20-year-old, 20TB (but could be 2TB) backup system.

Any and all feedback, recommendations, tips are welcome. Danke.

13 comments

Eric_the_Cerise

5 days ago

18 votes
Medium term cold storage options?

Ask (recommendations)

Increasingly I'm looking at my backup solution and I'm not totally happy. My "threat model" I guess is if the house burns down and we only make it out with the shirts on our backs. Alternatively...

Increasingly I'm looking at my backup solution and I'm not totally happy. My "threat model" I guess is if the house burns down and we only make it out with the shirts on our backs. Alternatively if I get hit by a bus I'd like a backup of passwords and maybe some instructions for my wife.

Mostly irrelevant discussion on my current backup or lack of situation

Up until recently I had a VPS running syncthing as a central backup for all my devices but it kind of looks like that got randomly wiped or something... my plan up until that happened was that I have a computer in a locker at work that I occasionally fired up to sync my syncthing stuff. This has some issues, the big one being that it doesn't deal with bus factor.

My next plan (and the point of this topic) is to have some data stored offline in a safe deposit box at the bank or some other secure location and swap the data out at some interval like 6 months or 1 year. The stuff I REALLY care about is easily under 1gb and stuff I kind of care about (photos and that kind of thing) is < 1tb.

Also currently I'm paying for iCloud each month even though I've mostly left the mac-osphere. This is where my < 1tb of photos are. I intend to download all of that and stop paying for iCloud in the coming months.

TL;DR What are decent medium term cold storage options for < 1gb that I can be really sure will be good for several years (maybe 10 or 20 years at the extreme end) and is fairly cheap. I was thinking optical media but I'm kind of lost as to what specifically to get and how to not get conned by buying fake media (m discs). I (somewhat randomly) have an m disc drive in my computer but I don't know if thats overkill or not? My important stuff may even fit on a CD actually...

33 comments

mild_takes

April 15

24 votes
Sovereign Cloud stats every CIO needs before their next board meeting
- privacy
Article 385 words
0 comments

windriver.com

April 16

7 votes
Applying Chinese Wall Reverse Engineering to LLM Code Editing

Article

1 comment

arXiv

July 22, 2025

8 votes
How data travels in F1

Article 1572 words

5 comments

ideefixe.co

May 27, 2025

12 votes
Who will maintain Vim? A demo of Git Who

Article 1346 words

6 comments

sinclairtarget.com

March 19, 2025

20 votes
How I analyzed 1,378 restaurants using Places API to find hotspots in my city
- programming
Article 2844 words, published Feb 12 2025
0 comments

mattsayar.com

February 20, 2025

14 votes
Undergraduate upends a forty-year-old data science conjecture

Article 644 words

6 comments

Quanta Magazine

February 12, 2025

26 votes
Why I will always be angry about software engineering
- programming
Article 3599 words
4 comments

mataroa.blog

November 12, 2024

34 votes
Get me out of data hell
- programming
Article 2913 words
2 comments

mataroa.blog

October 11, 2024

30 votes
I will fucking piledrive you if you mention AI again

Article 4269 words

32 comments

mataroa.blog

June 19, 2024

119 votes
Looking for help scraping and deleting a Reddit account

Ask (advice)

I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any...

I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any recommendations on good simple ways to scrape all the comments off of it and save them? Then what’s the best way to completely erase a Reddit footprint these days?

Looking for as simple a solution as possible, I’m not tech illiterate by any means but it’s also not a real strong suit for me.

11 comments

AnEarlyMartyr

April 23, 2024

18 votes
Those free USB sticks in your drawer are somehow crappier than you thought

Article 609 words, published Feb 7 2024

54 comments

Ars Technica

February 14, 2024

24 votes
Insomnia 8 forces users to login and use cloud storage
- open source
Link
14 comments

GitHub: Kong

September 27, 2023

29 votes
EU Spreadsheet risks interest group: Horror Stories

Article 10 217 words

3 comments

eusprig.org

August 24, 2023

9 votes
Representing heterogeneous data
- programming languages
- programming
Article 2821 words
1 comment

stuffwithstuff.com

August 8, 2023

6 votes
Looking for a remote storage provider to use for storing backups

Ask (recommendations)

I'm looking for mountable remote storage that I can use for my backup solution at home. I'm trying to get set up with backuppc and need to be able to mount a large remote filesystem to store my...

I'm looking for mountable remote storage that I can use for my backup solution at home. I'm trying to get set up with backuppc and need to be able to mount a large remote filesystem to store my archives. I've tried renting a 1TB storage box from Hetzner, but my account was rejected (I assume because of a recent legal name change). Can anybody recommend a similar provider of remote storage that I can rent and mount onto my server?

22 comments

h3x

June 15, 2023

27 votes
We spoke with the last person standing in the floppy disk business
- hardware
Article 3484 words, published Sep 12 2022
3 comments

aiga.org

September 17, 2022

11 votes
Help me decide what technology should I use for this project

Text 252 words

I’m a solo freelance programmer and want to write an app for internal project management, somewhere I can add projects, milestones, tasks, etc. and track them as I work on them, occasionally...

I’m a solo freelance programmer and want to write an app for internal project management, somewhere I can add projects, milestones, tasks, etc. and track them as I work on them, occasionally remind me of things like take a break, lunch time, etc. and over time I can track on which category I worked how many hours, etc.

I’m actually confused between whether to build this as a Web or Windows Desktop app. I’m considering latter because it can run efficiently on my laptop in the system tray using least memory and resources, web-based on the other hand will force me keep running an apache server too which will be an overhead (unless I host it on Google Cloud or someplace which might be an option?)

The only reason for considering web-based is that eventually I’m planning to make this tool open source and with web-based, many others can find this useful too (including OSX/Linux users). At that point, I may consider expanding its schema to include multi-user connectivity, client login, etc. but that’s going too far at this point!

The idea is that this tool should be useful not just for me but other freelancers, students, etc. who might be in my shoes. From that perspective, what do you think is the right technology to use? Web based or Windows based?

(I’ve extensively worked on C#/WinForms projects before and I’m thinking Visual Studio Express for desktop development. If web-based, it’ll be php/mysql based)

14 comments

noble_pleb

September 8, 2022

5 votes
Common Crawl: an open repository of web crawl data
- open source
Link
1 comment

commoncrawl.org

January 12, 2022

9 votes
McDonald's leaks password for Monopoly VIP database to winners
- security
Article 469 words
0 comments

bleepingcomputer.com

September 8, 2021

16 votes
How to design a database?

Ask (advice)
I'm working on an application that allows a user to view playlists belonging to a particular radio show and stream/download/favourite the tracks in them. It has 4 core entities: User, Show,...

I'm working on an application that allows a user to view playlists belonging to a particular radio show and stream/download/favourite the tracks in them. It has 4 core entities: User, Show, Playlist and Track.
- Each show has multiple playlists (one-to-many)
- Each playlist has multiple tracks (one-to-many)
To be able to reference a playlist belonging to a particular show. I gave those playlists the same uuid as the show they belong to. A few questions though.
1. Is this the right/best way to associate data?
2. As a track could potentially belong to multiple playlists, I can't take the same approach as I do for (show/playlist) How would be best to handle this? Ideally I would like to have a single "Track" table containing all tracks for all playlists.
For any experienced database designers out there, how would you structure this data? What would you consider in designing the schema and why? If I did go with 4 tables only, presumably there would be performance implications given the potential amount of data in any one of those tables, particularly tracks. If that is the case, how best to structure this kind of thing with performance in mind? Thanks in advance for any help :)

For reference, in case it's of importance, I'm using sqlite3.
12 comments

milkbones_4_bigelow

August 15, 2020

5 votes
Data Analysis with Dr Mike Pound | Computerphile

Video

1 comment

YouTube

July 10, 2019

6 votes

XML Data Munging Problem

programming

Text 568 words

Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...

Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.

Your input is some XML such as this:

<DOC>
<TEXT PARTNO="000">
<TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> .
</TEXT>
<TEXT PARTNO="001">
*FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> .
</TEXT>
<TEXT PARTNO="002">
In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore .
</TEXT>
<TEXT PARTNO="003">*BAR*-1
<TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123
</TEXT>
<TEXT PARTNO="004">
Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> !
</TEXT>
</DOC>

The above XML has so-called in-line textual annotations because the XML <TAG> elements are embedded within the document text itself.

Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).

The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each <TAG>’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters [A-Z]+ between two *, and optionally a trailing dash - followed by any number of digits [0-9]+.

Here is the desired JSON output for the above example to test your solution:

{
  "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n",
  "entities": [
    {
      "id": 0,
      "mentions": [
        {
          "start": 9,
          "end": 18,
          "id": 0,
          "text": "some data"
        },
        {
          "start": 41,
          "end": 49,
          "id": 0,
          "text": "the data"
        }
      ]
    },
    {
      "id": 1,
      "mentions": [
        {
          "start": 33,
          "end": 60,
          "id": 1,
          "text": "tags in the data are nested"
        },
        {
          "start": 80,
          "end": 91,
          "id": 1,
          "text": "nested tags"
        }
      ]
    },
    {
      "id": 2,
      "mentions": [
        {
          "start": 118,
          "end": 122,
          "id": 2,
          "text": "junk"
        },
        {
          "start": 144,
          "end": 148,
          "id": 2,
          "text": "Junk"
        },
        {
          "start": 326,
          "end": 330,
          "id": 2,
          "text": "junk"
        }
      ]
    },
    {
      "id": 3,
      "mentions": [
        {
          "start": 1,
          "end": 5,
          "id": 3,
          "text": "This"
        }
      ]
    },
    {
      "id": 4,
      "mentions": [
        {
          "start": 289,
          "end": 295,
          "id": 4,
          "text": "*this*"
        }
      ]
    }
  ]
}

Python 3 solution here.

If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).

4 votes

Microsoft sinks data centre off Orkney

Article 786 words

0 comments

BBC

June 6, 2018

8 votes