What's so different about development vs operations?

On starting a new job recently (well, last year when I started this entry), I have been reminded yet again just how long it takes an Operations person (Sysadmin, SRE, whatever you want to call us this year; I'm going to use 'Ops' for this post) up to speed, compared to a developer.

I noticed this in particular because a person I worked with at my previous job moved to the same new employer a week earlier than me, him as a developer, me as an SRE.  He was up and running within a couple of weeks, committing code to the core repo, and getting features shipped.  I took at least 3-4 weeks before I was in a position to start doing anything other than super trivial things, and honestly, it was 3 months at least before I was even close to fully operational, and probably 6 months to get sufficiently up to speed that I felt like I could reasonably handle most things that might hit me while on call.  And this is normal for ops; perhaps a little bit extended at this job because the company is all-remote, world-wide and deliberately asynchronous, and I'm in a timezone a bit apart from the bulk of my team mates, but 'a few months' isn't unusual at all.

What is it that makes this take so much longer?  In a smoothly functioning SRE team, code-review can and should happen, so it's a stretch to suggest that good use of that by devs is a major factor (although it probably has better general acceptance, practice, and habits amongst dev teams than it does in ops teams, just because of history). 

Is it that production systems are often much more unique than apps are?  Perhaps.  A Ruby-on-Rails application (which is the core at my current job, as well as my previous, interestingly enough) has a lot of convention to fall back on, a lot of standard practices and things you can expect to Just Be.  These definitely ease the introduction to a new code-base, allowing a dev to find their feet a lot quicker.  I assume other frameworks have at least a few conventions, although perhaps not as strong as it is in the RoR ecosystem, monkey-patching not-with-standing.  This uniqueness feels like it could be an important aspect.  For ops, it's not just things like a choice of database , it's a choice of how to deploy it, how to handle High-Availability, Disaster-Recovery, backups, logging, and a myriad of other details, and in my experience, no-one solves these in all the same way.  I've seen as many unique and customised backup systems as I have jobs, if not more.  Add in combinations of these across all the layers, cloud providers, etc, and you end up with not only no two companies being similar, but that they're all utterly unique.  But part of me suggests I'm over-egging this, and that code-bases are the same, with history and layers and gotchas (technical debt, if you will).  So I'm not sure this fully explains the difference.

The other thing devs often have is automated tests.  A well provisioned set of unit and integration tests provide a valuable safety net for a developer new to a code-base, allowing them to confidently modify one corner of the code with the expectation that many unintended consequences on other areas will be caught by those tests.  There's obviously still many ways tests can be insufficient, but often these sorts of issues are missed (or caused) at coding time even by experienced senior developers who know the code base well, because the issue is obscure and complicated.  On the other side, automated testing for infrastructure is hard.  There are a lot of tools for configuration management testing, but ultimately very little compares to actually running a bit of the infrastructure for real, from the new code/configuration, and seeing how it works.  There is often (I'd love to say always, but I know it's not true) a non-production 'staging' environment, although this is often where code is tested in production-like infrastructure, and breaking this environment while testing operational/infrastructure changes is problematic, as such things can prevent normal pre-deployment testing, and thus releases.  And even if you have your own infrastructure environment to play with, scale is often lacking.  How do you test that your pgbouncer configuration is going to behave properly when you add another 8 web servers to the existing 24, in your scaled down test environment, without faking so much that the test is meaningless?   If you're operating on 10% of the data size, with made-up data, how can you be reasonably sure that performance is what it should be?  A full production-equivalent environment is also often hideously expensive, even if your configuration system allows you to spin it up on your cloud-provider of choice in a few minutes or hours.  But oh how I wish we could have such a thing; how much confidence would an Ops person have if they could make a change on an independent environment that was scaled to the size of production, with plausible data, and real-scale traffic being thrown at it?  We'd become masters of our domain, happy to make any change anytime, because we could simply try it out first in a safe place.  Dreams are free.

Is it a mindset thing?  Is a good (or just long-lived) Ops person simply painfully aware of all the ways a complex system can fail, and thus needs (and perhaps just wants) to understand it in depth before they dive in and change something?  There's certainly a difference in the way devs and ops view the world, which is entirely natural and valuable as both groups have different responsibilities and skillsets necessary to do their jobs well.  I suspect this drive to understand deeply complex running systems is part of it what leads some people to Ops, although that may very well just be my own personal bias and how I view the world.  This also feels like I'm  doing a disservice to those devs who really do think through all these things.

It could simply be the combination/addition of the above, and some other factors.  But I don't know.  I still feel like I'm missing some critical detail, some factor that would explain it succintly.  What I do know is that it's a hard road starting a new Ops job.

Also: don't go changing Ops jobs too regularly, or you'll spend all your time just trying to learn the new systems.  It'll be fun, for a bit, but mastery is nice too.