DR for Puppet

I recently had to set up a DR (Disaster Recovery) capability for our Puppet Master (Puppet 4, Open Source version); until now, we'd run with just a single puppet master in a single geographical location. Certain events brought DR to the forefront of our minds and priority lists, and the task fell to me. Note that I'm not talking about resilient/redundant arrangements of multiple always-live servers as very well documented at https://docs.puppet.com/guides/scaling_multiple_masters.html, but rather a somewhat simpler approach more suited to our current needs. The solution was fairly simple in the end, and worth recording in case it is of value for someone else. The constraints:

  • Geographically diverse (10s of ms apart)
  • Always up-to-date and ready to go
  • Minimal intervention required on a failover

The solution has three main components:

  • Set 'certname' in the '[main]' section of puppet.conf on the DR puppet master, to the FQDN of the primary puppet master.
  • Synchronise (one off initially, then periodically), /etc/puppetlabs/puppet/ssl/ from the primary to the DR server
  • Ensure your process for updating the puppet manifests updates both servers (in our case, git related via Gerrit; your mileage may, and very likely will, vary)

Failover is then just a matter of updating the DNS entry for the the normal FQDN of the puppetmaster to point to the IP of the DR puppet master. Clients begin connecting to the DR instance, and carry on largely unaware that anything has changed. The only potential for unexpected behaviour is for newly signed certificates for nodes that haven't been synchronised to the DR server yet. We sync (a very light operation) every 15 minutes, which is possibly overkill but limits the risk around a failover event. Worst case, any nodes which had their certs signed between the last sync and when the primary was lost need to be re-signed against the DR after failover. Conveniently, certname doesn't affect the fqdn or hostname facts, just what certificate is used by the puppet master (and client) on the DR instance. I did have to update some of the puppet-master specific puppet manifests to use $trusted['certname'] instead of $fqdn, but this was quite reasonable and made things a little clearer overall. Easy as, once you know how