Things I have learned - Part 5

Short and sweet today: There is always a point of failure, between your redundant, non-single-point-of-failure components You know, the single cable or switch that connects your VRRP firewalls, which on failure results in two machines that both think they're master. Or the RAID controller that connects to both disks in your RAID-1 mirror, which on failure takes out both disks (or worse, corrupts data on them). Or that little Environmental Monitoring Unit in your lovely big SAN that, on failure, makes the redundant SAN controllers decide that they cannot and should not be serving any delicious data from your racks of redundant disks to your servers over the multi-path multi-switch SAN fabric. That last one? Yeah, saw that in production once. Removing all single points of failure is actually rather hard. I'm not saying you shouldn't try, but when you think you're done, look again. Look in the cracks between your components, and ask yourself what will happen if those cracks widen. It's kinda fun, in a "watch a scary movie" way.