Situational Awareness as a Sysadmin

System Administration, or more accurately the Operations side of IT, is at its heart a technically complex job.  However, there are some soft skills that are important.  Note that I'm using 'soft' in a non-derogatory sense.  The raw technical aspects are 'hard' in that they have well-defined edges, and typically very clear right and wrong answers.  The 'soft' aspects tend to be fuzzier, with softer edges and more nuance.  One of these soft skills is Situational Awareness.

Situational Awareness is an interesting skill which I've only recently become aware enough of to give a name.  It's the ability to pay attention to the various clues about what's going on around you, remember them for a while, and thus infer things about what's going on in the systems you're managing. 

At it's simplest, it's being aware of the areas that your colleagues are working on Right Now(tm).  Then, when you see alerts, errors, or your log review shows something unusual, you can correlate that to either ignore it (because you're confident it's transient, being worked on, and/or not a real problem), point it out to the right person to fix it, or (in rare, sad cases), fix it yourself because you know the person who did it has no clue and never will, and it's just easier that way.  I'm not advocating that last option as ideal, I just know that in some situations it truly is the right choice.

You don't need to keep track of the full details of the work going on, just the general areas e.g. which set of servers, the name of the software, etc.

Deeper awareness can be obtained by other channels like overhearing local (non-private) conversations, or reading other communications flying around.  For example, if you see the e-mails or chat indicating someone has just purchased a replacement SSL cert, then you'll have a plausible cause should there be some unexpected certificate errors in the near future.  If absolutely nothing else, pay attention to the change control process for your org (however weak it might be); that should be the highest quality signal for Things That Are Happening.  YMMV based on the sanity of the process, of course, but even scanning a huge list of changes and noting the important looking ones is better than nothing.

The difficult bit is to collate, remember, filter, and eventually forget this information.  I don't have specific suggestions for this, it's just something I've learned how to do.  If you're finding it hard to do in your head, perhaps take the opportunity to write things down.  If you do standups or similar regular group meetings, write down what people are doing, and then at the next one, see what's still ongoing vs what's finished, and update your list.  The long-running projects are often the ones most likely to be relevant anyway, so if you find yourself copying an item from the old list to the new, that will help cement it. The tricky bit is managing to discard, or disregard in the first place, information that isn't relevant.  Repetition is probably important here, so play with your own habits and tools, such as typing it out, if simply seeing it once isn't enough to remember.

One key detail to this may be something I was reminded of when I started at my current job: "You don't have to know everything".  And they're right, but I believe strongly that you should be 'aware' of everything, or at least a decent subset, so that you know the detail exists and can be looked for later when it is relevant.  A teacher once told me a story about an aircraft mechanic from Britain in World War 2.  Years after the war, he could still remember which page of the manual had the details of something like the spark plug gaps in the engines he worked on, but didn't remember the actual numbers.  But he said that it was much more important that he be able to look up the correct numbers quickly and get the adjustment correct, than mis-remember a number, get it wrong, and have a pilot die because the plane didn't perform.  What we do in Ops isn't usually quite so life-or-death, but a similar principle can be derived: knowing the information exists, and where to find it is definitely preferable to not knowing anything at all, or guessing, and better than remembering it wrongly.

But why?  Why care at all?  Why expend all this effort?  To be quite fair, it's entirely possible to isolate yourself in your corner of the world.  If you're part of a team, you can focus on your specific areas of work and let everyone else take care of their corners.  If you're not on-call, this may well be fine.  If you do participate in an on call roster, and have enough clues who to call when a specific area goes wrong, you may also be fine, although you're on shaky ground.  But I promise you that you'll save a lot of time and effort at the pointy-end of an incident if you've been paying attention.  

And above all, it's kinda fun, and is definitely satisfying.  It develops your sense of mastery, understanding, and comfort.  If you know what's going on around you, then you can work in and around others without stepping on technical toes, avoid breaking things more than normal, and not make things worse in a time-critical situation.