8 votes

What do you think is one thing every sysadmin should know how to do?

Blatantly stealing from the excellent post by /u/judah, I figured I'd make a sysadmin version because sysadmins tend to be underrepresented in tech discussions. Please keep your answers as cross-platform as possible without being uselessly generic.

I'll start. Realize that the system is going to go down, and accept that reality. Accept failure. How you respond to failure is how people who aren't sysadmins will see and value you.

10 comments

  1. [4]
    Emerald_Knight Link
    Automate, automate, automate. You know what I did when I set up the dev, QA, and production environments for my work? I wrote scripts. Why? Because I'm an idiot. An intelligent, college-educated...

    Automate, automate, automate.

    You know what I did when I set up the dev, QA, and production environments for my work? I wrote scripts. Why? Because I'm an idiot. An intelligent, college-educated idiot who still does stupid shit like grabbing baking sheets fresh out of the oven with his bare hand because he goes into autopilot every now and then like normal human beings do. And I don't want to find myself stupidly typing rm -rf old_release_directory and, as a result, inadvertently deleting the symlinked files and directories contained within, causing me to lose essential assets, configuration files, certs, and other important items because my brain couldn't remind me to do the smart thing and perform a find old_release_directory -type l -delete first. I don't want to perform a production release of new code only to find that I forgot to include the necessary symlink to an essential config file for a tool that only runs in the background and thus has no noticeable errors until we find that things suddenly don't seem to be happening as frequently as they used to.

    For that matter, errors should automatically be reported periodically. You don't want to find out about that broken tool several hours or days later, you want to know immediately that things are broken.

    Anything you do manually that has an identical set of steps every single time needs to be automated to some degree. Always, always, always remove human error from the equation when feasible.

    15 votes
    1. [3]
      Amarok Link Parent
      I can give you a good, lazy example of this. Every system I maintain has a dev, a test, and a prod (and if it doesn't I am a shrieking bastard until they do). The thing is, these are all really...

      I can give you a good, lazy example of this.

      Every system I maintain has a dev, a test, and a prod (and if it doesn't I am a shrieking bastard until they do). The thing is, these are all really the same system.

      I restore test from prod automatically, every night or week depending on how intensive that process is. This doesn't just guarantee my developers are happy with a prod-like instance to beat the hell out of, with no worry of it affecting production. It also guarantees that my backups are actually sane and functional, avoiding the write-only backup problem so many of us have been surprised by one dark weekend. The developers can call me at any time and request I refresh dev from prod, and I do the same thing there - dev gets replaced by last night's production backup.

      Devs have root on dev, do whatever the hell they want and I don't care. Only admins have root on prod/test, though on test the services are opened up so devs can jump in with analysis tools and whatnot. Devs deploy nothing - their time is too valuable. They give me a release and a move sheet detailing the deployment, admins deploy to test and testers test it. Anything goes wrong, I kick it back to dev and go to lunch - or help troubleshoot if they are stumped.

      Now this sounds like a lot of work. It's not - I picked Veeam as our backup solution for a reason. It turns servers into files, as far as backups are concerned, and that's a welcome simplification. I'm just moving files around. Trouble is, dev/test/prod have different names, config options, root passwords, ip addresses, firewall rules, and other system specific things. Changing all of that by hand is a gargantuan pain in the ass, so I automated it.

      Boot in single user mode, log in as root, type "become dev" or "become prod" and all of this information is changed instantly by copying config files to the right locations. Reboot and it's done. I can change the entire system identity with a simple one parameter command - and so can all of my other scripts and tools. Work hard to be lazy, it's worth it, and you make fewer mistakes. Store your configuration files in an external repo like git rather than local, and if someone forgets to make changes to the masters you'll find out on dev/test long before prod. You're also sure that you are always running those config files since replacing the ones on the systems is a natural part of the process. You can check configs without even having to log in.

      13 votes
      1. [2]
        teaearlgraycold Link Parent
        Sadly at work we have a lot of sensitive information that shouldn't be copied over to the staging environment.

        I restore test from prod automatically

        Sadly at work we have a lot of sensitive information that shouldn't be copied over to the staging environment.

        1. Amarok Link Parent
          So did we. Just get some scripts that target the affected tables with a replace and change security as necessary. It helps you keep track of it too. That become process can purge sensitive data,...

          So did we. Just get some scripts that target the affected tables with a replace and change security as necessary. It helps you keep track of it too. That become process can purge sensitive data, and it's good to have a script around that knows how to do that for each project. Plus, if you overlook some sensitive data with those scripts, you'd rather your developers told you than strangers on the internet after a security breach. The script's existence is like a double-check of the stuff you really need to keep track of and protect. Handy when it's audit-time.

          3 votes
  2. Amarok Link
    Learn to take notes and document things. That's the single most important skill I see missing sometimes. It's also the key to being a lazy admin, which is a goal everyone should have. Work hard so...

    Learn to take notes and document things. That's the single most important skill I see missing sometimes. It's also the key to being a lazy admin, which is a goal everyone should have. Work hard so you can be lazy later. It pays off. You won't remember what the hell you did on that system with that service six months from now. If you leave yourself notes, or make good documentation, it'll take you five minutes to get back up to speed, and you're a lot less likely to make a mistake.

    Good docs also save you a lot of training of co-workers. Look at any co-worker who needs to get up to speed on a system as a doc-update opportunity. Have them go through the doc to learn the task, and make their own notes. Then go back and update/clear up any confusion in the documents. This also saves you having to look over their shoulder all the time while they are learning something. Do this a couple times with a couple people and that doc will become bulletproof.

    Every system should have a bare-metal recovery doc - how to recreate the system from scratch on a brand new machine where you assume different hardware. Every system should also have an 'ownership' doc - who is the project manager, what's their contact info, what is it doing, what services is it running, why is it doing that, what parts of the org depend on this? What version is it running? What's the task schedule? It's kinda like a dog-tag for a server. You can combine these into one doc if you like, just make sure you've got that info.

    If there's a complex service with complex processes running or some procedure that a human being must babysit on a regular basis, be sure to document all of those as well. The more complicated it is, the more critical that documentation becomes.

    My second bit of advice is - do some reading before you dive into a new technology. There are two things you want to read. The first is the 'white paper' or equivalent. This tells you what the technology is and what it's good for (and not good for). The next is the 'best practices guide' which is more of a list of things not to do when setting it up and building it out. If you like, now you can move on to the manual. Play with it in a lab (if you don't have one, get one, even if it's just vmware workstation) before you try to do anything serious with it. Do this and you can wrap your brain around anything in 1-3 days depending on how complex the tech is.

    10 votes
  3. [3]
    dredmorbius Link
    Walk away. Go home at the end of the day, unplug for the weekend, take a complete vacation. If the show goes to shit without you, the bus factor is way too low. And you're heading for burnout.

    Walk away.

    Go home at the end of the day, unplug for the weekend, take a complete vacation.

    If the show goes to shit without you, the bus factor is way too low. And you're heading for burnout.

    7 votes
    1. Amarok Link Parent
      Amen. A lack of planning on your part does not constitute an emergency on mine. I beat this into the heads of every manager I ever work for - and most of them need it. It's not that they don't...

      Amen.

      A lack of planning on your part does not constitute an emergency on mine. I beat this into the heads of every manager I ever work for - and most of them need it. It's not that they don't know about these 'emergencies', more often they simply forget to tell us about something until it's happening.

      If there's a real emergency, that's fine. A lightning strike is a bitch and nobody's fault, and it has to be dealt with (though I'm sleeping in tomorrow after it's fixed). Working for an emergency factory, on the other hand, is no fun at all. Anyone in that situation who hasn't been granted the power to fix it (which is a fun challenge) should move on to work somewhere else.

      Sysadmins are like Sith. There must always be two. One of you could catch a meteorite with your face at any time.

      4 votes
    2. xstresedg Link Parent
      I need to do more of this. Work more together with my colleagues to cross-train everyone and focus on documentation. I'm a tech peon (techeon?) just like them, though.

      I need to do more of this. Work more together with my colleagues to cross-train everyone and focus on documentation. I'm a tech peon (techeon?) just like them, though.

      2 votes
  4. ianw Link
    I think Git is up there as a need-to-know at this point. Whatever level of Infrastructure as code you're at, it's always a good idea to save scripts and configuration files in some sort of version...

    I think Git is up there as a need-to-know at this point. Whatever level of Infrastructure as code you're at, it's always a good idea to save scripts and configuration files in some sort of version control system.

    6 votes
  5. tgiles Link
    This is a pretty opened ended one. I'll give you some thoughts after my 20 years in sysadmin and computer security... Every sysadmin should know (and be comfortable with!) the command line of...

    This is a pretty opened ended one. I'll give you some thoughts after my 20 years in sysadmin and computer security...

    • Every sysadmin should know (and be comfortable with!) the command line of every OS they may come across. Not everything is readily accessable by GUI. And there will be times where you must make changes across dozens (if not hundreds!) of systems at once. "Being comfortable with" means you must learn Bash and Powershell at a bare minimum.

    • I'll amplify Emerald_Knight's comment. Automate, automate, automate. Child, you have to automate. You have to be comfortable putting together commands and run them on a reoccuring basis. Learn glue code like Python or invest time into an orchestration system (like Saltstack) and automate your workflow.

    • Keep Security In Mind. Too often security is something considered far, far after the fact. We get so busy putting out fires that we neglect those temporary accounts, or the firewall ports "that should have been taken care of". Be responsible. Set reminders. Try to engineer the solution instead of slapping a bandaid on it and walking away. I can guarantee you that the bad guys are always scanning and they're waiting for you to slip up.

    • The only thing you should do more than Automate is Document. Put up a wiki. Use whatever documentation facilities your company has. Document your code. Document your scripts. Hell, document it and email everyone who could possibly care. Just get it out there.

    • And best for last assume everyone is an idiot. I know idiot is a strong word, but it's a placeholder. People are tired, busy, had an argument with their spouse, suck at time management, or plainly forget things. It's your job to keep track of everything that falls under your purview and take ownership when it makes sense. That means being the bad guy sometimes, yeah. But if you don't see it to the end, then it can fall between the cracks and disappear.

    5 votes