• Activity
  • Votes
  • Comments
  • New
  • All activity
  • Showing only topics in ~comp with the tag "data". Back to normal view / Search all groups
    1. Shopping around for a new-and-improved backup solution

      A few days ago, I posted this and quickly realized that the world of data backups is far richer than just sudo rsync -av --delete --exclude=Videos /home /home_bkup. So now I'm window shopping the...

      A few days ago, I posted this and quickly realized that the world of data backups is far richer than just sudo rsync -av --delete --exclude=Videos /home /home_bkup.

      So now I'm window shopping the top Linux-supported backup solutions: borg, duplicacy, kopia, restic and--oh look--a core borg dev just dropped his own new-and-improved solution, vykar.

      Restic was the first tool I started to research, and I thought I really liked it, got as far as installing, initializing a test repo, creating a couple of snapshots. But restic seems to be, hmm, fussy about the source and destination paths, absolute vs relative paths, etc.

      The fact that merely renaming a parent directory (or grandparent, or great-grandparent, etc) causes restic to treat every unchanged byte below that as brand new ... that's a recipe for giant, bloated repos, and it's unacceptable to me ... and hey, lookit that, borg does not do that. So now, restic is out and borg is in.

      But what other pros v cons are there, that I haven't even realized need to be considered? What advantages/disadvantages do other apps offer? Which ones can I easily automate with nightly/hourly cron jobs? Which ones have their own even-better automated solutions?

      Do I even want encryption? All of my drives/volumes are LUKS encrypted, and anything I would store remotely would also get encrypted before it ever left my LAN ... plus, I'm just a bit nervous about having the backups encrypted, requiring working, functional software to restore/recover data from them....

      That may not seem like such a big concern, perhaps, but I am currently working my way thru decrypting a bunch of 10-15 year old TrueCrypt-ed volumes, which requires using an old, outdated version of VeraCrypt and a somewhat "cross-my-fingers" effort to find KeePass repos old enough (also outdated, KeePass 1.0 repos) to still contain the various passwords I used to encrypt those ancient volumes ... but also still use new enough master passwords that I can still get the KeePass repos unlocked.

      With rsync, I can literally just go into any backup, find the specific version of the specific file(s) I want to recover, and manually copy it back to my workspace. Is anything like that option available in any of these deduplicated/encrypted solutions, even if they're not encrypted? If (eg) a borg repo is created w/o encryption, the data is still all just borg-specific blobs, right? Or can I navigate into the repo and just manually grab files?

      Oh yeah ... for reference, the past 10-ish years, my backup routine has been to create a new, dated, destination folder, starting with a full backup of my /home folder (excluding things like Videos, Music, VMs, other bulky stuff that gets backed up separately/differently), and then running nightly diff backups into the same folder, while also maintaining a "one-day-older" second backup of the whole thing on a 2nd HDD ... then, every 3-6 months, zipping up the current backup folder and starting a new one.

      At any rate, there you go; that's the kind of stuff I'm thinking about now, as I overhaul my 20-year-old, 20TB (but could be 2TB) backup system.

      Any and all feedback, recommendations, tips are welcome. Danke.

      18 votes
    2. Medium term cold storage options?

      Increasingly I'm looking at my backup solution and I'm not totally happy. My "threat model" I guess is if the house burns down and we only make it out with the shirts on our backs. Alternatively...

      Increasingly I'm looking at my backup solution and I'm not totally happy. My "threat model" I guess is if the house burns down and we only make it out with the shirts on our backs. Alternatively if I get hit by a bus I'd like a backup of passwords and maybe some instructions for my wife.

      Mostly irrelevant discussion on my current backup or lack of situation

      Up until recently I had a VPS running syncthing as a central backup for all my devices but it kind of looks like that got randomly wiped or something... my plan up until that happened was that I have a computer in a locker at work that I occasionally fired up to sync my syncthing stuff. This has some issues, the big one being that it doesn't deal with bus factor.

      My next plan (and the point of this topic) is to have some data stored offline in a safe deposit box at the bank or some other secure location and swap the data out at some interval like 6 months or 1 year. The stuff I REALLY care about is easily under 1gb and stuff I kind of care about (photos and that kind of thing) is < 1tb.

      Also currently I'm paying for iCloud each month even though I've mostly left the mac-osphere. This is where my < 1tb of photos are. I intend to download all of that and stop paying for iCloud in the coming months.

      TL;DR What are decent medium term cold storage options for < 1gb that I can be really sure will be good for several years (maybe 10 or 20 years at the extreme end) and is fairly cheap. I was thinking optical media but I'm kind of lost as to what specifically to get and how to not get conned by buying fake media (m discs). I (somewhat randomly) have an m disc drive in my computer but I don't know if thats overkill or not? My important stuff may even fit on a CD actually...

      24 votes
    3. Looking for help scraping and deleting a Reddit account

      I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any...

      I have a couple of old Reddit accounts I’d like to delete as fully as possible. However one of them dates back to my teenage years and it’s some of the only writings I have from that time. Any recommendations on good simple ways to scrape all the comments off of it and save them? Then what’s the best way to completely erase a Reddit footprint these days?

      Looking for as simple a solution as possible, I’m not tech illiterate by any means but it’s also not a real strong suit for me.

      18 votes
    4. Looking for a remote storage provider to use for storing backups

      I'm looking for mountable remote storage that I can use for my backup solution at home. I'm trying to get set up with backuppc and need to be able to mount a large remote filesystem to store my...

      I'm looking for mountable remote storage that I can use for my backup solution at home. I'm trying to get set up with backuppc and need to be able to mount a large remote filesystem to store my archives. I've tried renting a 1TB storage box from Hetzner, but my account was rejected (I assume because of a recent legal name change). Can anybody recommend a similar provider of remote storage that I can rent and mount onto my server?

      27 votes
    5. Help me decide what technology should I use for this project

      I’m a solo freelance programmer and want to write an app for internal project management, somewhere I can add projects, milestones, tasks, etc. and track them as I work on them, occasionally...

      I’m a solo freelance programmer and want to write an app for internal project management, somewhere I can add projects, milestones, tasks, etc. and track them as I work on them, occasionally remind me of things like take a break, lunch time, etc. and over time I can track on which category I worked how many hours, etc.

      I’m actually confused between whether to build this as a Web or Windows Desktop app. I’m considering latter because it can run efficiently on my laptop in the system tray using least memory and resources, web-based on the other hand will force me keep running an apache server too which will be an overhead (unless I host it on Google Cloud or someplace which might be an option?)

      The only reason for considering web-based is that eventually I’m planning to make this tool open source and with web-based, many others can find this useful too (including OSX/Linux users). At that point, I may consider expanding its schema to include multi-user connectivity, client login, etc. but that’s going too far at this point!

      The idea is that this tool should be useful not just for me but other freelancers, students, etc. who might be in my shoes. From that perspective, what do you think is the right technology to use? Web based or Windows based?

      (I’ve extensively worked on C#/WinForms projects before and I’m thinking Visual Studio Express for desktop development. If web-based, it’ll be php/mysql based)

      5 votes
    6. How to design a database?

      I'm working on an application that allows a user to view playlists belonging to a particular radio show and stream/download/favourite the tracks in them. It has 4 core entities: User, Show,...

      I'm working on an application that allows a user to view playlists belonging to a particular radio show and stream/download/favourite the tracks in them. It has 4 core entities: User, Show, Playlist and Track.

      • Each show has multiple playlists (one-to-many)
      • Each playlist has multiple tracks (one-to-many)

      To be able to reference a playlist belonging to a particular show. I gave those playlists the same uuid as the show they belong to. A few questions though.

      1. Is this the right/best way to associate data?
      2. As a track could potentially belong to multiple playlists, I can't take the same approach as I do for (show/playlist) How would be best to handle this? Ideally I would like to have a single "Track" table containing all tracks for all playlists.

      For any experienced database designers out there, how would you structure this data? What would you consider in designing the schema and why? If I did go with 4 tables only, presumably there would be performance implications given the potential amount of data in any one of those tables, particularly tracks. If that is the case, how best to structure this kind of thing with performance in mind? Thanks in advance for any help :)

      For reference, in case it's of importance, I'm using sqlite3.

      5 votes
    7. XML Data Munging Problem

      Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML. Your input is some XML such as this:...

      Here’s a problem I had to solve at work this week that I enjoyed solving. I think it’s a good programming challenge that will test if you really grok XML.

      Your input is some XML such as this:

      <DOC>
      <TEXT PARTNO="000">
      <TAG ID="3">This</TAG> is <TAG ID="0">some *JUNK* data</TAG> .
      </TEXT>
      <TEXT PARTNO="001">
      *FOO* Sometimes <TAG ID="1">tags in <TAG ID="0">the data</TAG> are nested</TAG> .
      </TEXT>
      <TEXT PARTNO="002">
      In addition to <TAG ID="1">nested tags</TAG> , sometimes there is also <TAG ID="2">junk</TAG> we need to ignore .
      </TEXT>
      <TEXT PARTNO="003">*BAR*-1
      <TAG ID="2">Junk</TAG> is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . *JUNK*-123
      </TEXT>
      <TEXT PARTNO="004">
      Note that <TAG ID="4">*this*</TAG> is just emphasized . It's not <TAG ID="2">junk</TAG> !
      </TEXT>
      </DOC>
      

      The above XML has so-called in-line textual annotations because the XML <TAG> elements are embedded within the document text itself.

      Your goal is to convert the in-line XML annotations to so-called stand-off annotations where the text is separated from the annotations and the annotations refer to the text via slicing into the text as a character array with starting and ending character offsets. While in-line annotations are more human-readable, stand-off annotations are equally machine-readable, and stand-off annotations can be modified without changing the document content itself (the text is immutable).

      The challenge, then, is to convert to a stand-off JSON format that includes the plain-text of the document and the XML tag annotations grouped by their tag element IDs. In order to preserve the annotation information from the original XML, you must keep track of each <TAG>’s starting and ending character offset within the plain-text of the document. The plain-text is defined as the character data in the XML document ignoring any junk. We’ll define junk as one or more uppercase ASCII characters [A-Z]+ between two *, and optionally a trailing dash - followed by any number of digits [0-9]+.

      Here is the desired JSON output for the above example to test your solution:

      {
        "data": "\nThis is some data .\n\n\nSometimes tags in the data are nested .\n\n\nIn addition to nested tags , sometimes there is also junk we need to ignore .\n\nJunk is marked by uppercase characters between asterisks and can also optionally be followed by a dash and then one or more digits . \n\nNote that *this* is just emphasized . It's not junk !\n\n",
        "entities": [
          {
            "id": 0,
            "mentions": [
              {
                "start": 9,
                "end": 18,
                "id": 0,
                "text": "some data"
              },
              {
                "start": 41,
                "end": 49,
                "id": 0,
                "text": "the data"
              }
            ]
          },
          {
            "id": 1,
            "mentions": [
              {
                "start": 33,
                "end": 60,
                "id": 1,
                "text": "tags in the data are nested"
              },
              {
                "start": 80,
                "end": 91,
                "id": 1,
                "text": "nested tags"
              }
            ]
          },
          {
            "id": 2,
            "mentions": [
              {
                "start": 118,
                "end": 122,
                "id": 2,
                "text": "junk"
              },
              {
                "start": 144,
                "end": 148,
                "id": 2,
                "text": "Junk"
              },
              {
                "start": 326,
                "end": 330,
                "id": 2,
                "text": "junk"
              }
            ]
          },
          {
            "id": 3,
            "mentions": [
              {
                "start": 1,
                "end": 5,
                "id": 3,
                "text": "This"
              }
            ]
          },
          {
            "id": 4,
            "mentions": [
              {
                "start": 289,
                "end": 295,
                "id": 4,
                "text": "*this*"
              }
            ]
          }
        ]
      }
      

      Python 3 solution here.

      If you need a hint, see if you can find an event-based XML parser (or if you’re feeling really motivated, write your own).

      4 votes