Things I have learned - Part 2

Systems should be silent, unless they have something important to say that requires action on the part of the system administrator. In particular, notifcations (e.,g. e-mails, but potentially other mechanisms) should only be sent if they require action. If the action required is not completely obvious from the direct contents of the notification, then you need to add a link to external documentation detailing what needs to happen (e.g. a "trouble code", or a link to your wiki etc).

To clarify the action requirements: it's obvious that "DISK FULL" means find/fix the extraneous disk usage, or add more space, but something like "Found a virus" may not be so obvious. In the latter case, what's the policy? Is it a customer file that needs to be preserved in case it's a false positive? Is it a file we will check out, then summarily delete if it's infected? This usually requires a process, which should probably be in your internal documentation, for which you need that link.

Systems MUST NOT (in the RFC-sense) send "All OK" status messages, or notice/info level script output to humans. They MAY send all OK messages to an automated system which knows to expect a certain set of OK messages, and which will then report to a human if not all are received as expected, or if some of them have reported an error.

The rationale for this is that with any significant number of systems, it's nearly impossible to notice that system X has not reported All OK about application Y, when there's 100 other status messages from all the other systems. And if you get normal notice/info level output regularly, it's much harder then to notice when that output changes; it becomes habit to skip over certain subject lines or from addresses (or worse, filter them out automatically). Ideally, an e-mail at all should be an exception, which kicks the human who reads it into action.

In that regard, I'm a big fan of "logcheck", and a huge hater of "logwatch". The latter summarises your log files periodically (typically daily) and sends you an e-mail with details of who logged in today, errors seen, and all manner of other detritus. It's never (in my experience) less than a single page of text; it's fine if you have one or two servers, but even as many as a dozen is too much to be able to process efficiently in your head and remember what's important. Logcheck on the other hand looks through your logs, ignores anything that is normal, and reports everything else. Adding to the ignore list is pretty easy. It doesn't give you the ability to say "report more than X events in period Y", sadly, but it's still a huge step up from logwatch, or seeing nothing. I cannot count the number of pending system issues I have found/fixed after logcheck reported something amiss. On the other hand, when I don't receive e-mails from logcheck, I can have confidence that nothing new/unexpected has been logged. This is a great position to be in.

Speaking of automatically deleting e-mails: if you find yourself about to do that, stop. If the e-mail isn't useful, stop whatever is sending them from sending them. If it's useful, why are you deleting it? Are you perhaps hoping someone else in the team might deal with it? Then you need better process for dealing with these issues being reported than e-mail to many people.

To summarise: Silence is golden, and let computers do the hard yards of ignoring "normality" and emphasising "abnormality".