Things I have learned - Part 1

I've been a system administrator for nearly 10 years now, and there's a few things I've learned along the way. This is the first in a series of short posts sharing some of these things.

One of my colleagues has a saying: "Just because it's working, doesn't mean it's not broken". Being able to build a server and make it do something, and even fix it if it breaks, does not mean you can run, maintain, and monitor what I call an Operational System. A fully operational system requires a whole lot of things to be correct, otherwise the system is unmaintainable. It might look like it's working; it might even be nominally performing the required task (e.g. serving a web site), but that's quite a step from having some confidence that it will continue to do so for the forseeable future without unnecessary intervention, will tell you when things are wrong, and not be a needy child requiring constant attention. And that's what I mean by "Operational"

Having a properly operational system means you can add another server, then another, with very little increase in management overhead. It means you're doing the minimum amount of work required to maintain them all.

It also means there is sufficient documentation; there's a lot of arguments that can be had about how much documentation this is, but in short it depends on what can be assumed based on whatever amounts to "standard practice", with additional details about per-system variations from that practice, and details of system-specific custom processes. So far so obvious, but what might be less so, is that too much documentation is also bad. There's several ways in which documentation could become excessive:

  1. Unnecessary repetition - when you have standard practice, and you document the details of that repeatedly for each server/system. It leads to documentation maintenance difficulties and makes it harder to see where the standard practice has been used, and where there is variation/customisation. This is the documentation equivalent of code re-use via shared code vs copy/paste
  2. Details better obtained from the running system - I cannot count the number of times I've found configuration files or scripts copied verbatim from a running system into a wiki. That then, on investigation, turn out to have been updated in production but not in the documentation. The documentation should rather point the reader at the relevant point on the running system (e.g. the config file), with perhaps some comments on the important elements. It's not out of the question that if you have active documentation (e.g. a wiki with plugins) that you could dynamically pull data from some configuration database (e.g. IP addresses), but the reason that's ok is because by definition it is going to always be up to date.

The documentation requirements imply that there is "standard practice" that can be assumed. An operational system consisting of more than a couple of servers must have such standardised practice. Something like puppet, chef et al is mandatory. But so is some sort of automated OS installation step such as FAI, kickstart, preseed etc, that gets your system built to a point where the configuration management system can bring it up to speed. It's ok to have some manual steps, provided they're well documented, and ideally are specific to a small set of servers; for example, establishing replication between database servers can often be a bit tricky to automate, and it's ok (in my book at least) for that to be a well documented set of steps that are performed when the replicating pair is setup. However you need to weigh this against the frequency of these manual steps; if this is your bread-and-butter build and you're doing it daily, then it may be worth the time to automate. If you're doing it once a year, not so much.

The recurring theme here is optimisation/minimisation of effort and data. Minimising the steps we have to take to build systems. Minimising how much documentation we write, and how much we have to read to understand the system. Minimising the amount of e-mail we get from the systems. All of which leads to maximising the number of servers we are able to manage (or minimising the time it takes to do so, thus leading to more time for gaming, drinking, or whatever else makes you happy). This optimisation is the only way we as sysadmins can manage the ever increasing number of systems expetced of us, without our heads exploding.

So over the next few posts, which I expect will be a lot shorter, I'll explore some of ways I try and achieve the above.