Prometheus, Grafana, rates, and statistical kerfluffery

I love performance graphs and monitoring software; graphs are pretty, and there's nothing quite like the feeling of using a graph to identify precisely the cause of a technical problem. It means, however, that every few years I end up delving deep into some aspect of them to figure out why my graphs don't look 'right'. A few years ago it was Cacti + RRD losing information we thought we were keeping (pro-tip: consolidating to the maximum value of an average seen over a given period, rather than keeping the maximum of all maximums seen, is probably not what you want). This is the store of my latest battle with Prometheus and Grafana; there are quite a few different moving parts involved, and the story is an interesting one. Come on a journey with me (Spoiler: I win in the end).

DR for Puppet

I recently had to set up a DR (Disaster Recovery) capability for our Puppet Master (Puppet 4, Open Source version); until now, we'd run with just a single puppet master in a single geographical location. Certain events brought DR to the forefront of our minds and priority lists, and the task fell to me.

What Could Possibly Go Wrong

At my work we often throw around the phrase WCPGW (What Could Possibly Go Wrong) in response to ill-advised or just plain crazy ideas. It's fun, and lets off some steam, but it occurred to me recently that there's a useful kernel of truth in it. Indeed, a good sysadmin is always asking this question; when designing systems, preparing to make a change, in the heat of an emergency, and in security design and response.


Subscribe to RSS