stroppykitten.com

What's so different about development vs operations?

craig — Fri, 19 Jun 2020 21:50:00 +0000

On starting a new job recently (well, last year when I started this entry), I have been reminded yet again just how long it takes an Operations person (Sysadmin, SRE, whatever you want to call us this year; I'm going to use 'Ops' for this post) up to speed, compared to a developer.

I noticed this in particular because a person I worked with at my previous job moved to the same new employer a week earlier than me, him as a developer, me as an SRE. He was up and running within a couple of weeks, committing code to the core repo, and getting features shipped. I took at least 3-4 weeks before I was in a position to start doing anything other than super trivial things, and honestly, it was 3 months at least before I was even close to fully operational, and probably 6 months to get sufficiently up to speed that I felt like I could reasonably handle most things that might hit me while on call. And this is normal for ops; perhaps a little bit extended at this job because the company is all-remote, world-wide and deliberately asynchronous, and I'm in a timezone a bit apart from the bulk of my team mates, but 'a few months' isn't unusual at all.

What is it that makes this take so much longer? In a smoothly functioning SRE team, code-review can and should happen, so it's a stretch to suggest that good use of that by devs is a major factor (although it probably has better general acceptance, practice, and habits amongst dev teams than it does in ops teams, just because of history).

Is it that production systems are often much more unique than apps are? Perhaps. A Ruby-on-Rails application (which is the core at my current job, as well as my previous, interestingly enough) has a lot of convention to fall back on, a lot of standard practices and things you can expect to Just Be. These definitely ease the introduction to a new code-base, allowing a dev to find their feet a lot quicker. I assume other frameworks have at least a few conventions, although perhaps not as strong as it is in the RoR ecosystem, monkey-patching not-with-standing. This uniqueness feels like it could be an important aspect. For ops, it's not just things like a choice of database , it's a choice of how to deploy it, how to handle High-Availability, Disaster-Recovery, backups, logging, and a myriad of other details, and in my experience, no-one solves these in all the same way. I've seen as many unique and customised backup systems as I have jobs, if not more. Add in combinations of these across all the layers, cloud providers, etc, and you end up with not only no two companies being similar, but that they're all utterly unique. But part of me suggests I'm over-egging this, and that code-bases are the same, with history and layers and gotchas (technical debt, if you will). So I'm not sure this fully explains the difference.

The other thing devs often have is automated tests. A well provisioned set of unit and integration tests provide a valuable safety net for a developer new to a code-base, allowing them to confidently modify one corner of the code with the expectation that many unintended consequences on other areas will be caught by those tests. There's obviously still many ways tests can be insufficient, but often these sorts of issues are missed (or caused) at coding time even by experienced senior developers who know the code base well, because the issue is obscure and complicated. On the other side, automated testing for infrastructure is hard. There are a lot of tools for configuration management testing, but ultimately very little compares to actually running a bit of the infrastructure for real, from the new code/configuration, and seeing how it works. There is often (I'd love to say always, but I know it's not true) a non-production 'staging' environment, although this is often where code is tested in production-like infrastructure, and breaking this environment while testing operational/infrastructure changes is problematic, as such things can prevent normal pre-deployment testing, and thus releases. And even if you have your own infrastructure environment to play with, scale is often lacking. How do you test that your pgbouncer configuration is going to behave properly when you add another 8 web servers to the existing 24, in your scaled down test environment, without faking so much that the test is meaningless? If you're operating on 10% of the data size, with made-up data, how can you be reasonably sure that performance is what it should be? A full production-equivalent environment is also often hideously expensive, even if your configuration system allows you to spin it up on your cloud-provider of choice in a few minutes or hours. But oh how I wish we could have such a thing; how much confidence would an Ops person have if they could make a change on an independent environment that was scaled to the size of production, with plausible data, and real-scale traffic being thrown at it? We'd become masters of our domain, happy to make any change anytime, because we could simply try it out first in a safe place. Dreams are free.

Is it a mindset thing? Is a good (or just long-lived) Ops person simply painfully aware of all the ways a complex system can fail, and thus needs (and perhaps just wants) to understand it in depth before they dive in and change something? There's certainly a difference in the way devs and ops view the world, which is entirely natural and valuable as both groups have different responsibilities and skillsets necessary to do their jobs well. I suspect this drive to understand deeply complex running systems is part of it what leads some people to Ops, although that may very well just be my own personal bias and how I view the world. This also feels like I'm doing a disservice to those devs who really do think through all these things.

It could simply be the combination/addition of the above, and some other factors. But I don't know. I still feel like I'm missing some critical detail, some factor that would explain it succintly. What I do know is that it's a hard road starting a new Ops job.

Also: don't go changing Ops jobs too regularly, or you'll spend all your time just trying to learn the new systems. It'll be fun, for a bit, but mastery is nice too.

My struggle with Kubernetes

craig — Thu, 19 Dec 2019 03:15:47 +0000

I struggle with Kubernetes

I've been circling Kubernetes for a couple of years now at work (two different jobs), slowly getting up to speed and coming to terms with what it is and how it works. I'm still a long way from being an expert, but even as I should be getting at least *comfortable* with it, I'm finding myself still struggling. Not with the raw technical matters; to be blunt, there's not a large number of fundamental concepts to grok with Kubernetes, just a few key ones and then a fair amount of nitty-gritty detail with each thing. But when I am tasked with 'deploy this thing to Kubernetes', or when I start thinking about how Kubernetes will impact some other system if and when we deploy to it, I start feeling tense and anxious. I have probed these feelings, much like one might probe a sore tooth, feeling the pain and trying to figure out what it is that makes me feel this way, and the extent of those feelings of pain.

I've been a professional Linux systems administrator for between 15 and 20 years, depending how you count experience (it wasn't officially my job title for some of those early years, but I was sort of doing it at least part time anyway). I started before virtualisation was a usable thing (I assume it was around, but wasn't mainstream and practically usable until several years into my career), and installing server Operating Systems onto bare metal was, if not common, at least something done occasionally (as opposed to 'practically never' now). Load-balancing wasn't common (at least where I was working, which may just have been a matter of scale not tech), configuration management was shell scripts and dreams, NoSQL was just an early fever-dream of a mad few (some things never change... but I jest), and there was absolutely no commodity Cloud at all (Amazon S3 wasn't launched until about 8 years into my IT career). I have seen these things come, and I have adapted. Each required re-learning things, and adjusting my habits and thought patterns, but it always seemed reasonable. None of them cause me the same feelings that Kubernetes does. It's possible I'm just getting old and set in my ways, but I see other new things coming and developing and they don't do that to me, so I *think* it's not just me.

And finally, I think I have a handle on it, and it all comes from a metaphor. See, Kubernetes is like a big ball of yarn. But, so are the systems I have always designed, built, and managed. They're made of bits and pieces of tools, techniques, and configuration that combine to produce the result we want. There's common bits to everything, things you can replace with similar yarn (same thickness, different colour), and unique bespoke things custom to any particular ball of yarn.

The difference with *my* ball of yarn vs Kubernetes, is that it's entirely my ball of yarn. I composed it with the parts that I understand and know; as I learned virtualisation, the cloud, load balancing and so on, I was just learning new types of yarn, how to cut them, and how to tie them together. Kubernetes, on the other hand, is a ball of yarn into which I poke some baubles (containers), and then the little magic pixies that live inside the ball of yarn put those baubles somewhere inside the ball, and tie them together for me. Those same pixies can magically make the ball bigger or smaller at any time (within limits), if they see the need. Rather than me adding in new chunks of yarn, the pixies do it for me, based on the guidance I give them (oh my hamster, so much YAML).

Where I have trouble is in my understanding of how those pixies will do their job; they still seem magical to me, and the instructions I'm allowed to give them feel obscure and somehow limited (although I can't seem to quantify that feeling). And those pixies are able to go on strike, or get sick, or just misbehave, and my ability to peer inside the ball of yarn feels limited; I *can* to a degree, but the tools are sometimes different (or limited, or missing), the picture I'm looking at is different, and the pixies might still be running around doing things while I'm looking.

And all of that bugs me. I was talking with my wife recently about something work related, and she got this look on her face and said to me: "Oh, you're a control freak". It's true, I am, and I've known it for a while; one of the things I enjoy about systems administrator is understanding and controlling (to the degree I need) complex systems. And until my knowledge, comfort, and understanding gets better, Kubernetes feels like it's taking those away from me.

I will get there; once I spend more time working with it, I'm sure I'll get to a point where it feels as comfortable as all the other tools I use. But until then, I'm still going to firmly gird my loins before entering battle, and overcome that feeling of squick.

Bread part 2

craig — Sat, 16 Nov 2019 22:43:20 +0000

Since /random/bread we have moved cities, and I'm continuing to make bread. It's been fine, but a week or so ago I did it on autopilot and put 1.25 cups of water in again.

And it worked fine. The loaf turned out perfectly, no problems.

I've since made a couple more loaves, with 1.25 cups of water. All fine.

I find it fascinating that 20-25% variation in the amount of water has so little effect on the bread; maybe if I made both variants and did side-by-side taste testing I could tell there was a difference, but from just eating the bread daily I haven't noticed anything yet.

Also I have even less idea what the hell was going on when the loaves weren't working and the dough was wet. Different flour maybe? Surely the subtleties of protein content in flour wouldn't have that much effect though. And stop calling me Shirley.

Oh well. I have bread, which means I have toast, which means I have breakfast. I'm just going to quietly put this episode behind me, and pretend the world is actually sane.

Bread

craig — Sat, 05 Oct 2019 06:07:23 +0000

Or, how I lost my mind

I've made most of the bread my wife and I eat for the last 8 years or so, in a home-grade breadmaking machine, the sort of one where you throw the ingredients in a metal pan/tin/container, push a couple of buttons, and 3-4 hours later a nice hot fresh load of bread is ready.

This has worked well; it tastes better than store bought, is cheaper, and it's kinda fun to stick it to The Man on a regular basis.

After a bit of experimentation early on, I settled on a recipe that works well, and quickly memorised it. I was able to prepare a loaf almost on auto pilot without much effort at all. Every now and then, I'd make a mistake in one of the minor ingredients and the loaf would come out wrong, but it was typically easy to diagnose. It's pretty obvious to taste if you forget the salt, or to size if you forget the sugar.

Then a couple of weeks ago, it all went wrong. A loaf turned out wrong; it was half the normal height, dense, and the texture was full of bubbles, not a nice bread crumb, and it had a flat top. It looked like the yeast was working (creating the bubbles), but something else had gone wrong. Maybe I'd only put in 2 cups of flour, not 3. Maybe too much salt? Or not enough? It tasted largely ok; a bit different perhaps, but not really off.

Oh well, this happens. So I put another loaf on. It happened again, in exactly the same way. OK, that's annoying. The next day, I tried again, being really careful with the recipe. And it worked, mostly. The loaf was still a bit small, but had risen fairly well and had a nice domed top. Not ideal, but things were back on track.

The next loaf was bad again, in the same way. I fiddled a bit with things like the temperature of the water, and using other baking plans (e.g. the one for wheat-grain bread which has a 45-minute pre-soak with a bit of heating, to soak the grains). Nothing worked. I did notice that the dough wasn't holding together in as much of a solid lump during the early phases, but I hadn't looked at the dough for years and couldn't be sure if it was normal or not. Some research suggested that the texture was possibly a case of too much liquid; the bigger bubbles could be (if I understand correctly, and I'm not an expert) because the gas produced by the yeast wasn't held properly by the structure of the dough, and it couldn't sustain the height during rising.

But I knew this recipe backwards and inside out. It called for 1.25 cups of water, and I'd been doing that for years. In the interests of science, and taking the actual observations into account, I dropped that to 1 cup, and lo, the loaf turned out normal.

This should make me happy. But it doesn't, because there's only 2 explanations I can think of:

Something invisible has changed; I'm thinking the mineral content of the town water supply, or maybe the yeast has gone off (except it's a fairly fresh batch, and I'd used some from that batch successfully before it all went wrong). There was no obvious time correlation with any other changes I could see.
I had mis-remembered the recipe.

At this stage, option 2 seems the most likely. This distresses me, because it means not just that my memory is playing silly buggers, but that it's lying to me. I didn't wake up one morning and forget the recipe. I woke up, was convinced I knew the recipe, but a cosmic ray had flipped a bit in my memory, and I was now wrong. Before the successful loaf, I would have sworn in a court of law that the recipe called for 1.25 cups of water, and something else must have changed. But the evidence suggests that I was simply wrong.

I don't know that I can trust my memory any more, and this is unpleasant.

Getting older sucks.

Situational Awareness as a Sysadmin

craig — Sat, 15 Jun 2019 04:24:43 +0000

System Administration, or more accurately the Operations side of IT, is at its heart a technically complex job. However, there are some soft skills that are important. Note that I'm using 'soft' in a non-derogatory sense. The raw technical aspects are 'hard' in that they have well-defined edges, and typically very clear right and wrong answers. The 'soft' aspects tend to be fuzzier, with softer edges and more nuance. One of these soft skills is Situational Awareness.

Situational Awareness is an interesting skill which I've only recently become aware enough of to give a name. It's the ability to pay attention to the various clues about what's going on around you, remember them for a while, and thus infer things about what's going on in the systems you're managing.

At it's simplest, it's being aware of the areas that your colleagues are working on Right Now(tm). Then, when you see alerts, errors, or your log review shows something unusual, you can correlate that to either ignore it (because you're confident it's transient, being worked on, and/or not a real problem), point it out to the right person to fix it, or (in rare, sad cases), fix it yourself because you know the person who did it has no clue and never will, and it's just easier that way. I'm not advocating that last option as ideal, I just know that in some situations it truly is the right choice.

You don't need to keep track of the full details of the work going on, just the general areas e.g. which set of servers, the name of the software, etc.

Deeper awareness can be obtained by other channels like overhearing local (non-private) conversations, or reading other communications flying around. For example, if you see the e-mails or chat indicating someone has just purchased a replacement SSL cert, then you'll have a plausible cause should there be some unexpected certificate errors in the near future. If absolutely nothing else, pay attention to the change control process for your org (however weak it might be); that should be the highest quality signal for Things That Are Happening. YMMV based on the sanity of the process, of course, but even scanning a huge list of changes and noting the important looking ones is better than nothing.

The difficult bit is to collate, remember, filter, and eventually forget this information. I don't have specific suggestions for this, it's just something I've learned how to do. If you're finding it hard to do in your head, perhaps take the opportunity to write things down. If you do standups or similar regular group meetings, write down what people are doing, and then at the next one, see what's still ongoing vs what's finished, and update your list. The long-running projects are often the ones most likely to be relevant anyway, so if you find yourself copying an item from the old list to the new, that will help cement it. The tricky bit is managing to discard, or disregard in the first place, information that isn't relevant. Repetition is probably important here, so play with your own habits and tools, such as typing it out, if simply seeing it once isn't enough to remember.

One key detail to this may be something I was reminded of when I started at my current job: "You don't have to know everything". And they're right, but I believe strongly that you should be 'aware' of everything, or at least a decent subset, so that you know the detail exists and can be looked for later when it is relevant. A teacher once told me a story about an aircraft mechanic from Britain in World War 2. Years after the war, he could still remember which page of the manual had the details of something like the spark plug gaps in the engines he worked on, but didn't remember the actual numbers. But he said that it was much more important that he be able to look up the correct numbers quickly and get the adjustment correct, than mis-remember a number, get it wrong, and have a pilot die because the plane didn't perform. What we do in Ops isn't usually quite so life-or-death, but a similar principle can be derived: knowing the information exists, and where to find it is definitely preferable to not knowing anything at all, or guessing, and better than remembering it wrongly.

But why? Why care at all? Why expend all this effort? To be quite fair, it's entirely possible to isolate yourself in your corner of the world. If you're part of a team, you can focus on your specific areas of work and let everyone else take care of their corners. If you're not on-call, this may well be fine. If you do participate in an on call roster, and have enough clues who to call when a specific area goes wrong, you may also be fine, although you're on shaky ground. But I promise you that you'll save a lot of time and effort at the pointy-end of an incident if you've been paying attention.

And above all, it's kinda fun, and is definitely satisfying. It develops your sense of mastery, understanding, and comfort. If you know what's going on around you, then you can work in and around others without stepping on technical toes, avoid breaking things more than normal, and not make things worse in a time-critical situation.

HTTP Cookie Date format - oh the huge manatee

craig — Tue, 01 Jan 2019 01:46:44 +0000

For a holiday project I'm enhancing Cookiemaster to be able to force cookies for your choice of domains to be Session (transient, go away when you close the browser) rather than persistent. In doing so, I found it wasn't parsing the date in 'Expires' correctly, and in discovering why, I found the horror that is the Date format as specified in the RFC.

Specifically, I mean https://tools.ietf.org/html/rfc6265#section-5.1.1. Feel free to go read, ponder and digest it. If you need somewhere to scream, I recommend you do so in the shower. The acoustics of the average bathroom are excellent for such activities.

If you don't want to read it in full (and frankly, I don't blame you), I draw your attention to the 'delimiter' definition:

    delimiter = %x09 / %x20-2F / %x3B-40 / %x5B-60 / %x7B-7E

One or more delimiters can appear between any of the meaningful parts (e.g. day, month, hour, second etc). For those who don't have an ascii table handy (or memorised; I'm looking at you, lurbs), the set of characters that can be delimiters is:

    !"#$%&'()*+,-./;<=>?@[\]^_`{\}~

as well as a space, and a hard tab (%x09). This is absurd, IMO. This means that the following is a perfectly valid date to include as the Expires value of a cookie:

    Tue!#01&Jan<2019?00:24:20}GMT

I mean, it's parseable in only one way, so is perfectly fine for a machine. But it is still absurd.

Thankfully, the splendid people at Salesforce have published javascript code for parsing this abominable format already, so I didn't have to do it myself: https://github.com/salesforce/tough-cookie/blob/master/lib/cookie.js#L152

I am eternally grateful, while also still being just a little bit saddened that this is necessary. ISO8601 fo' lyfe, yo.

Fixing (one case of) AWS EFS timeouts/stalls

craig — Thu, 20 Sep 2018 03:40:26 +0000

AWS Elastic File System (EFS) is an NFS compatible network-accessible shared storage system. It allows you to outsource the problem of HA network storage, which is highly attractive in some circumstances. But, there are some sharp edges, which we discovered at work.

The problem

We're using it for the shared file storage of our gitlab HA cluster. For quite a while it worked fine, then a month or two back it started occasionally stalling/timing out. In kern.log we'd see:

nfs: server <redacted> not responding, still trying

Then a few minutes later:

nfs: server <redacted> OK

Load would spike as a bunch of processes tried to do I/O to a non-responsive NFS share, then it'd all calm down. During this period, gitlab was unresponsive (no big surprise there). For a while, we tolerated this, assuming some sort of slow down/rate limiting with EFS. But eventually we looked closer and found that there was no network traffic at all between our server and EFS when it was broken. We parked that for a bit, then I got curious one day and decided to dig into it.

The hunt

The first clue was that when the outage ended, the first packets was our server starting a new TCP connection (with NFS4, TCP 2049 is the only port used, thank goodness). This suggested that it wasn't a simple network slow down, and that our end was possibly partly at fault, i.e. our server had decided something was wrong, and it cleared after a timeout when it reconnected. This gave me fresh inspiration to go hunting. I found, on the AWS forums, a lot of people reporting similar problems, but it usually ended with someone from the EFS team contacting them by private message, and there were no further updates. Eventually though, I found https://forums.aws.amazon.com/thread.jspa?threadID=280554 . You need an AWS account login to see it (annoyingly), so in case you don't have that to hand, the summary is that the linux NFS client (deliberately) re-uses the same TCP source port on a re-connection, and this is sometimes confusing some stateful connection tracking somewhere (I'd bet within EC2, not in the linux kernel) meaning packets get dropped. The post had a suggested solution about adding wide open security group rules to the affected instances, from the EFS instance, but this felt a bit wrong to me.

So i looked harder.

The solution

https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-general.html came to my attention, as did the man page for nfs. Now I'm sure we looked at this documentation page when we were setting up this NFS share, and that it didn't have what it does now. Which suggests a lot of other people may be in the same boat as us. The short of it is:

Turn on the NORESVPORT mount option

To explain:

On the AWS page it says (currently):

With the noresvport option, the NFS client uses a new Transmission Control Protocol (TCP) source port when a network connection is reestablished. Doing this helps ensure uninterrupted availability after a network recovery event.

Then a little bit later:

Amazon EFS ignores source ports. If you change Amazon EFS source ports, it doesn't have any effect.

This relates to the man page which says, in the section on security considerations

An NFS server assumes that if a connection comes from a privileged port, the UID and GID numbers in the NFS requests on this connection have been verified by the client's kernel or some other local authority.  This is an easy system to spoof, but on a trusted physical network between trusted hosts, it is entirely adequate.

Roughly speaking, one socket is used for each NFS mount point.  If a client could use non-privileged source ports as well, the number of sockets allowed, and thus the maximum number of concurrent mount points, would be much larger.

Using non-privileged source ports may compromise server security somewhat, since any user on AUTH_SYS mount points can now pretend to be any other when making NFS requests.  Thus NFS servers do not support this by default.  They explicitly allow it usually via an export option.

To retain good security while allowing as many mount points as possible, it is best to allow non-privileged client connections only if the serverand client both require strong authentication, such as Kerberos.

I was suspicious about the AWS assertion that noresvport made the client use a new source port for a reconnection (seemed like an awfully arbitrary bit of behaviour that wasn't mentioned by the man page) but some quick testing confirmed it. And clearly EFS didn't care that the source port wasn't 'reserved' (privileged). Also, for the record, I was able to observe a reconnection in the wild on our gitlab servers (in the default state), where it re-used the same source port.

This all looked promising enough, and I have been running the production gitlab servers with noresvport for a couple of days now, with no stalls/timeouts. I'd like to see it go for a week or so before I call this done, but I've seen reconnections using a new source port with no hiccups, and I'm fairly confident this will be the solution.

Remaining concerns

I'm not entirely happy with the security model of all this. The implication is that if someone can implement NFS4 in user space (there are some promising candidates, but none of them quite work yet, for various reasons) then they could mount anything on the EFS instance simply by executing code as any user running on the NFS client server. At least in the default linux case there's a little bit of a hurdle to overcome (you'd have to be root to bind to a reserved source port) although that's still just a matter of the NFS server trusting an arbitrary number being below an arbitrary limit. But, this is out of our control with EFS; noresvport doesn't materially decrease the actual security of the system, it just requires and takes advantage of the existing, possibly flawed, situation. And it solves my problem, so I'm good.

Actively avoiding bias

craig — Sun, 26 Aug 2018 05:29:20 +0000

I recently read this excellent article https://www.producttalk.org/2018/02/co-creating/ which, amongst other things, talks about avoiding a commitment and confirmation bias. It's well worth a read, quite by itself.

But I had an additional thought. I have, on the wall of my office, one of these: https://www.designhacks.co/products/cognitive-bias-codex-poster which really just goes to show how many ways humans have of completely mucking up logical/rational thought processes. They're all traps to be avoided, and I have up until now largely assumed that my best bet was to be aware of them and try harder to avoid them. Sadly, that way lies inevitable failure, because by definition these biases are slippery buggers and we fall into them without realising. Trying harder may help a little, but isn't going to be a magic wand.

What the co-creating article suggested to me, was that I should actively avoid them. I should try to find ways to structure my decision making processes that minimise the opportunity for those biases to manifest. As yet I have very little idea how to do this, but I think it's a good plan. Which is probably the result of a bias. Sigh.

NB: All this presupposes that rational and logical thought is the ultimate desirable state. I agree with that whole-heartedly, and I'll take that as a good starting point.

AWS Security Groups: A glimpse behind the curtain

craig — Sun, 05 Aug 2018 03:50:22 +0000

Or: a little clue as to how the sausage is made.

At work recently I was creating an internal NLB (Network Load Balancer) in AWS. An NLB is a magic bit of engineering that behaves a lot like LVS (Linux Virtual Server); it does healthchecks like a normal load balancer, but when forwarding requests it sends the packets on, having only modifyied their destination IP to that of the chosen backend. Compare this to the original ELB (Elastic Load Balancer) or ALB (Application Load Balancer) both of which create a brand new TCP/IP connection from the ELB to the backend, and will have gotten all up in the HTTP request and potentially done all manner of things to/with the HTTP payload (adding headers at least, perhaps much much more).

This makes NLBs really useful, because the IP packet that arrives at your backend instance has the original source IP address not the internal IP of the ELB/ALB. Sure, for HTTP(S) load balancing you've got the X-Forwarded-For header, but for something like an SSH ELB that's just rewriting and forwarding packets, this is quite handy

As an aside on security groups (if you're not familiar), they have two quite distinct roles:

Contain security group rules that specify allowed traffic to (inbound) or from (outbound) the entities (e.g EC2 instances, RDS instances, or ELBs/ALBs) that have the SGs attached
By being attached to entities, act as an identifier that can be used as the source/target of the rules just described.

It's worth remembering that it is entirely valid and reasonable to create an SG with no rules in it, assign it to entities as a tag that they're in a certain class, then using that group as the source/target in the actual rules included in other SGs.

Interestingly, NLBs do not have security groups (SGs) attached to them like ELBs or ALBs, and this is where things get interesting. I was setting up a path from our internet-facing reverse proxy (P) to an internal backend target service (B) that lives on 2 identical servers, with an NLB in between to deal with healthchecks and load balancing the 2 backends. Yes, I do know I could have done this many other ways, including with HAProxy or something similar hosted on the reverse proxy, or with an ELB etc; don't judge me for my poor life choices. Because the source of the traffic arriving at the backend was from the internal interfaces of the reverse proxy, it felt a little weird adding a rule allowing traffic from 0.0.0.0/0 to the SG attached to the backends (B), so I instead added a rule for each of:

The /24 address for the subnet the NLB lives in, to allow the healthchecks to succeed
A security group uniquely associated with the reverse proxy instances, for the traffic that has come through the NLB.

I was expecting the latter to allow traffic because a packet arriving at a backend would have the source IP of the reverse proxy instance the packet originated from. I was rather surprised when it didn't work; cracking out tcpdump showed that the packets from P -> NLB -> B were simply not being seen by the backends. I checked, and the proxy instances could connect directly to the backend instances, as would be expected. So for debugging I added 0.0.0.0/0 to the SG on the backends, and the traffic started flowing; tcpdump showed that the source IP addresses were exactly what they should be (the internal IP of the proxy instances). So I removed the 0.0.0.0/0 rule and added one for the /24 address for the subnet the reverse proxy instances were in. Everything continued to work, so it wasn't some magic caused by a rule for 0.0.0.0/0. Curiouser and curiouser.

Up until this moment, I had believed that the Security Groups worked much like a traditional firewall. In my mind, a rule that allowed packets to port 80 from Security Group 'A' was implemented as a bunch of IP-address based rules that had an entry for the IP address of each entity that had Security Group 'A' attached to it. If you added a Security Group to an instance/entity, something would trawl around and update all those firewall entries. Clearly, based on what I had just observed, this was not true. When the rule had a Security Group as the allowed source, the actual IP address of the packet had no bearing on whether it was allowed or not, after the packet had passed through the NLB. The most obvious conclusion is that under normal circumstances, rules with Security Group references are implemented with some sort of tagging; a packet leaving an entity is tagged (encapsulated I guess) with identifiers of the SGs attached to that entity. When the packet arrives at the target, those tags are compared against the SGs in the rules of the target, not the literal source IP in the packet.

It also appears that when passing through an internal NLB those SG tags are stripped, so that on arrival at the backend, the only thing left to check against is the IP address, giving the behaviour I saw.

I find this absolutely fascinating, even delighting. It's fairly obvious in hindsight, and is quite a reasonable way to work (for reasons I'll go into shortly), but in nearly 6 years working with AWS I had not once seen even the slightest clue that this was the case.

So why is this reasonable? Mainly, I posit, for performance. It is likely much quicker to check a set of tags against the rules than it is to do all the usual IP address matching (even /32's). In thinking about this some more, it seems likely that the tags won't be the security group names/identifiers that are used in the UI/API, because they're quite long and not quick to check. Rather, I'm guessing that they are small numeric identifiers, probably only unique per VPC. This is slightly backed up by the default limits in EC2:

5 security groups per interface, which can be increased if you ask, but only to a maxium of 16, with a corresponding constraint on the limit of rules per security group. Most importantly though, 16 looks to me like a limit to the number of tags the encapsulation can contain
500 Security Groups per VPC which has no stated direct maximum, but the documentation contains the statement that "The multiple of the number of VPCs in the region and the number of security groups per VPC cannot exceed 10000."

This suggests that maybe it's a customer or account id in the tag, not a VPC id, and that each security group tag is perhaps 14-bits long (up to 16K groups), although it seems odd that there's such a gap from the documented limit (10K) and the theoretical capacity of that size of field, so I may be missing something interesting here.

The other reason it's a good way to work is that there's no need to go updating firewall rules when an entity gets a Security Group attached to it; packets leaving the entity after that will get the new SG tag, and the rules being applied at the target end will be not need to be even identified, let alone touched. Clearly this is much more efficient.

It's always fun finding little clues like this, and using them to extrapolate how things are working under the hood. It's one of the most delightful aspects of working in IT, in my humble opinion. No doubt in another 5 years I'll find some other hint that will blow my mind further. Yay!

Fun with SCSI tapes on Linux

craig — Sun, 20 May 2018 00:01:20 +0000

I inherited an LTO-4 SCSI-attached tape drive recently (yes, old-school, I know, it's mostly just for fun), and have been fiddling with it to do some tape-based offsite backups (to complement my other offsite backups). In doing so, I learned a bunch of things about the SCSI protocol (including its many many tendrils and offshoots), and about handy tools for interacting with SCSI devices. Some of these things may be useful to others, so I'm putting them here. Enjoy!

Talking to the tape drive natively is trivial; it just showed up as /dev/nst0 (st for "SCSI Tape", n for non-rewinding; /dev/st0 will automatically rewind after every operation, which can be a bit annoying at times). However, I want encrypted backups, because my offsite tapes will spend a non-trivial amount of their life in places other than my house. While I can afford to lose one or more of these (there will be many copies), I'd rather random strangers weren't able to trivially restore all my precious data and rifle through my digital belongings. Yes, I'm aware of the slight silliness of being worried that a random stranger might have even a first clue what to do with an LTO-4 tape, let alone have access to a functioning LTO-4/5 tape drive. But still, the principle is there.

Now I can roll my own encryption with openssl (surprisingly easy, basically pipe tar through openssl enc -aes256 and redirect the output of that to the tape drive; it prompts for a passphrase to derive a key from, and then just gets on with the job), but I was vaguely aware that some LTO tape drives could do encryption. Some googling convinced me that any LTO-4 tape drive should be able to. Incidentally, I was wrong, it's an optional feature (sort of - see https://en.wikipedia.org/wiki/Linear_Tape-Open#Encryption). It seemed like a handy thing to try and achieve though; it would require no CPU on the backup machine, and should in theory be able to happen at wire speed, where for some reason putting openssl in my backup pipeline causes throughput to drop.

The next question was "how?". There's surprisingly little information about it, but what there was suggested stenc was the software package I needed. It's not available in Ubuntu standard package repositories, so I had to download and compile it. If you find yourself needing this, here's some tips:

If you just clone the git repo from https://github.com/scsitape/stenc then you're gonna have to figure out how to get it to create the configure script to use to create a Makefile so you can build it. I couldn't (although I didn't try long). My naive invocations of autoconf and automake couldn't get there, and it wasn't clearly documented.
The easier way is to go to https://github.com/scsitape/stenc/releases instead and download the source archive. This has been prepared for building. You can run ./configure then make then optionally make install
The author of stenc, while a splendid person for releasing the code, is a white-space monster. The indenting is a mix of tabs and spaces (not consistently so), and only formats nicely with tabstop set to 8.

Having compile stenc, I run it. It fails:

Sense Code: Illegal Request (0x05)
ASC: 0x24
ASCQ: 0x00

Running with --detail shows it's actually talking to my tape drive and knows what model it is, but still with the Illegal Request. To be honest, I searched around a bit, couldn't find much at all, let alone a solution, and gave up for a week. Coming back to it a week later I was refreshed. Checking the code for stenc (the reason I love opensource so much is being able to go to the source when necessary) it turns out there's a compile-time option to spit out all the SCSI commands/responses. I was delighted; there's very little I like more than turning on debugging and gleaning clues from whatever torrent of data it spits out. So, I run ./configure --with-scsi-debug; make clean; make and try again. The command it was failing on was the ever delightful:

a22000200000000020040000

The stenc code gave me a mere hint of the structure of this block of bytes, with a couple of slightly helpfully name constants, and some hard coded numbers (zeros, and 0x20). More helpfully, it gave me the acronynm SSP which (checking stenc docs) I learned stands for SCSI Security Protocol. SPIN was also relevant. It took a bit of searching to find the following gem of a document: https://www.seagate.com/staticfiles/support/disc/manuals/Interface%20manuals/100293068c.pdf [1] the SCSI Commands Reference Manual, 446 pages of nerd delight, that doesn't seem to be Seagate specific, they just happen to host the docs. Command code 0xA2 is the SECURITY PROTOCOL IN command. This explains the spin_ prefix in some of the code. http://www.t10.org/lists/asc-num.htm tells me that ASC 0x24, ASCQ 0x00 means Invalid code in CDB. CDB is the command block, so one of the bytes in a22000200000000020040000 is 'wrong'.

How to experiment from here? Recompiling stenc to try and change the commands seems like an annoying proposition. I wondered if there was a way to send arbitrary SCSI commands to a SCSI device, from a linux command line. Turns out there is; it's the sg_raw program, from (on Ubuntu) the sg3-utils package. Rapture! Delight! I run this:

sg_raw /dev/nst0 a2 20 00 20 00 00 00 00 20 04 00 00

And it tells me:

SCSI Status: Check Condition

Sense Information:
Fixed format, current; Sense key: Illegal Request
Additional sense: Invalid field in cdb
Field replaceable unit code: 48
Sense Key Specific: Error in Command: byte 2

Well that's handy. Some tips:

sg_raw is much better at translating errors, and also seems to have some other information that tells me it's byte 2 that's at fault.
From what I learned later, it appears the byte count for '2' is 1-based, not 0-based like you might expect. It's the 0x20 that's the problem, not the first 0x00. Why do these people do this? WHY????
Don't run sg_raw on an active device like a SCSI disk drive, or a tape drive that's actively in use. You will screw things up for whatever thinks it's in control of the state of the device. But on a tape drive that's doing nothing? Sure, go for it. Probably worth being careful which commands you send (don't send random bytes and expect a good result), but I imagine it's pretty hard to break it in a way that coulnd't be resolved by a reboot or cold power off/on, as long as you're generally paying attention. It's probably a very good way to learn how SCSI works.

This all gave me some more clues to plug into a search engine, which got me to https://www.veritas.com/support/en_US/article.100037886. I could probably have saved a lot of time by finding this page a week earlier, but no matter. This suggested I might like to run:

sg_raw /dev/nst0 -r 44 a2 00 00 00 00 00 00 01 00 00 00 00

which I did, giving:

SCSI Status: Good

Received 9 bytes of data:
00 00 00 00 00 00 00 00 01 00

The Veritas page then says that the 7th/8th bytes (00 01) means it supports 1 page, and the 9th byte (0x00) means it only supports page 00h. If my tape drive actually supported encryption, the response would be longer, the 01 would be at least 02, and the bytes after that contain an 0x20, indicating it supports page 02x20, which is the SPIN/SPOUT capability.

Tips:

The 44 is fairly arbitrary, and just needs to be longer than the response we're expecting (maybe 44 is the longest it can ever be, I don't know for sure, and haven't looked)
The allocation length in the command (00010000) is just a "very large" number compared to what we expect to get back. It could actually be as low as 00000044 and this would all work. I think.

I now have some fairly good proof that my tape drive doesn't natively support transparent magical encryption. I sort of wish stenc could have told me this, rather than just giving me sense errors and Illegal Request messages. It would have saved me some time, but then I wouldn't have learned as much, so it's not a complete loss. Some further internet searching reveals that my model of IBM Tape Drive only supports 'Application Managed Encryption' in the SAS connection form factor, and mine connects by Ultra160 SCSI. This seems arbitrary to me, but I'm sure there's a good reason for it (hah).

So that was my journey. I learned about some fun tools (sg_raw) and got a lot more comfortable with SCSI in general.

Oh, also, the SCSI command reference mentioned Security Protocol 0x41: IKEv2-SCSI. I thought it might have been a co-incidental name collision with the IKE from IPsec, but no, it turns out it's IKEv2 adapted for SCSI (see http://www.t10.org/ftp/t10/document.06/06-449r5.pdf) It is, quite literally, IKEv2 from RF4306, adapted to SCSI, to provide transport-encryption of your SCSI bus. Before today, I had no idea this could even possibly be a thing. I'm not sure if I should be delighted, or horrified, but I'm tending towards the latter. That might just be my IPsec experience speaking, mainly the horror of interop between heterogenous endpoints. Oh well. The more you learn, the more you realise you know so little.

[1] Update 2020-06-20: The link is now https://www.seagate.com/files/staticfiles/support/docs/manual/Interface%..., and 518 pages; looks like they update it occasionally and increment the last letter (was 'c', now 'j')