Situational Awareness as a Sysadmin

System Administration, or more accurately the Operations side of IT, is at its heart a technically complex job.  However, there are some soft skills that are important.  Note that I'm using 'soft' in a non-derogatory sense.  The raw technical aspects are 'hard' in that they have well-defined edges, and typically very clear right and wrong answers.  The 'soft' aspects tend to be fuzzier, with softer edges and more nuance.  One of these soft skills is Situational Awareness.

Fixing (one case of) AWS EFS timeouts/stalls

AWS Elastic File System (EFS) is an NFS compatible network-accessible shared storage system.  It allows you to outsource the problem of HA network storage, which is highly attractive in some circumstances.  But, there are some sharp edges, which we discovered at work.

The problem

We're using it for the shared file storage of our gitlab HA cluster.  For quite a while it worked fine, then a month or two back it started occasionally stalling/timing out.  In kern.log we'd see:


Subscribe to RSS