Fun times with random IPSec corruption

Let me tell you a story of woe, intermittent/random corruption, and confusion.

Background

We had reason to stand up a new VPN, from a new data center, to a VPC in AWS. For some semi-philsophical semi-technical reasons, this was not a "VPC VPN", but rather a GRE-over-IPSec tunnel to an EC2 instance inside the VPC, and it was the first of this kind we'd deployed (i.e. GRE over IPSec, to/from AWS).

Pre-problem 1: Racoon sucks.

It sucks even harder when there is the sort of NAT going on in AWS (the bi-directional NAT, where the instance has an RFC-1918 address on it's network interface, and that is NAT'd in both directions to its assigned "Elastic IP", by the networking layer of AWS.

The upshot is that the IP that the non-AWS end thinks of the AWS end as having is not the IP address on eth0 on said EC2 instance. In the end, a colleague of mine discovered a combination of NAT and duplicated ipsectools configuration that mostly worked, most of the time. Even he didn't truly understand how/why it worked, but it did.

The actual problem: round 1

With the VPN up, and ping working, it was time to pass some real traffic. The primary purpose of this VPN was for backup traffic, so, we fired up dirvish and initiated the rsync-over-ssh. Within seconds, BOOM. Checking the rsync logs, I saw "Corrupted MAC on input" and "Bad packet length" messages. Well, that's weird.

Retry: BOOM, same again.

Never at quite the same point, i.e. it wasn't a single file that caused the problem.

Eliminations steps:

It's all a bit of a blur now, and this was spread over several days, but here's some of the things we tried:

  1. Eliminate software: rsync over ssh is two layers of complexity; the messages are nominally from ssh, but it's not clear that rsync isn't involved in causing them. So, we tried scp instead. Same results.
  2. Eliminate hops: the first rsyncs were with a source/target one hop either side of the VPN terminators. So, we moved to just SCP between the IPSec endpoints, eventually just over the GRE tunnel itself (so to/from the link-net IP addresses on the GRE interface itself. Same results, although we did tend to get further through the file before it fell over.
  3. The most common google result for these messages implicated various "offload" functionality on network interface cards (e.g. Transmit/Receive Segmentation offload). This certainly seemed the most plausible possibility. Sadly, disabling all available such options on all the hops had no effect.
  4. MTU: The GRE tunnel encapsulation was definitely reducing the MTU and it was possible something in the path was behaving badly as a result. So, we tried dropping the MTU on all the interfaces involved, right down to 576 bytes, but without any positive effect.
  5. Change the IPSec software: given the issues and hackery required with racoon (noted earlier), it seemed worth trying something else. The non-AWS end was already using StrongSwan, so we converted the AWS end to that also. The configuration was simpler and clearer (not requiring quite the level of hackery), which was a win. But it still didn't solve our corruption problem.

The end game:

Now we were getting mad. All the relatively simple stuff was not it. We weren't sure that this wasn't just some issue specific to SSH. Were we seeing actual corrupted packets, or was SSH being silly? Therefore, my colleague used a ping flood with a 1000+ byte payload of all letter 'a'. With some clever tcpdump and grep on the receiving end, we could see actual corruption in the decrypted ping packets after receipt at the non-AWS end. Intermittently, 16 bytes were simply not the original series of 'a's. This was helpful, as it confirmed it was definitely packet corruption somewhere, not something weirdly openssh-specific

I ended up using some more ping flood + tcpdump + grep, and the ability of a modern version of wireshark to decrypt the encrypted packets, captured as they went over the internet, using the SPI and current encryption/HMAC keys as obtained from "ip xfrm state" from one of the endpoints. This is really really cool to be able to do, and I see myself doing this again in the future.

Doing so, I was able to quickly confirm that the encrypted contents of the packets were corrupted. The HMAC was passing, so it must have been correct, therefore it was the encryption or encapsulation step in AWS that was corrupting the packets (before the HMAC was calculated and shoved into the packet).

My colleague then looked at the AWS endpoint, and asked (nonchalantly, as is his way): "Can you disable the aesni_intel module?".

At this point, I was willing to try anything, so I did. Some modprobe config, a reboot, and the EC2 instance was up with aesni_intel disabled. And our corruption issues went away.

We were gobsmacked. AESNI is a set of instructions (created by Intel, duplicated by AMD et al later) for doing AES on the processor itself, rather than in software (I'll ignore arguments about whether modern microcode on a processor is itself software or not :D). The aesni_intel module is some useful bits of code to wrap that up and presumably do it properly inside a multi-processing kernel. And somehow, that was messing up the encryption. The 16-byte block of corruption made more sense now, that being the size of the blocks that AESNI is working with (at least as far as I could tell). Knowing the right keywords, Google was able to throw up this little gem of a thread: http://www.serverphorums.com/read.php?12,967232, which is pretty much exactly what we were seeing. Xen + paravirt + aesni_intel => sporadic corruption of the encryption results. The paranoid part of me suspects some sort of state-leakage between guests, which may explain the private RedHat bugzilla bug for this issue. So, aesni_intel on AWS is no-go. At least not if you want consistent encryption results. Who knew? :)