Fixing (one case of) AWS EFS timeouts/stalls

AWS Elastic File System (EFS) is an NFS compatible network-accessible shared storage system.  It allows you to outsource the problem of HA network storage, which is highly attractive in some circumstances.  But, there are some sharp edges, which we discovered at work.

The problem

We're using it for the shared file storage of our gitlab HA cluster.  For quite a while it worked fine, then a month or two back it started occasionally stalling/timing out.  In kern.log we'd see:

nfs: server <redacted> not responding, still trying

Then a few minutes later:

nfs: server <redacted> OK

Load would spike as a bunch of processes tried to do I/O to a non-responsive NFS share, then it'd all calm down.  During this period, gitlab was unresponsive (no big surprise there).  For a while, we tolerated this, assuming some sort of slow down/rate limiting with EFS.  But eventually we looked closer and found that there was no network traffic at all between our server and EFS when it was broken.  We parked that for a bit, then I got curious one day and decided to dig into it.

The hunt

The first clue was that when the outage ended, the first packets was our server starting a new TCP connection (with NFS4, TCP 2049 is the only port used, thank goodness).  This suggested that it wasn't a simple network slow down, and that our end was possibly partly at fault, i.e. our server had decided something was wrong, and it cleared after a timeout when it reconnected.  This gave me fresh inspiration to go hunting.  I found, on the AWS forums, a lot of people reporting similar problems, but it usually ended with someone from the EFS team contacting them by private message, and there were no further updates.  Eventually though, I found https://forums.aws.amazon.com/thread.jspa?threadID=280554 .  You need an AWS account login to see it (annoyingly), so in case you don't have that to hand, the summary is that the linux NFS client (deliberately) re-uses the same TCP source port on a re-connection, and this is sometimes confusing some stateful connection tracking somewhere (I'd bet within EC2, not in the linux kernel) meaning packets get dropped.  The post had a suggested solution about adding wide open security group rules to the affected instances, from the EFS instance, but this felt a bit wrong to me.

So i looked harder. 

The solution

https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-general.html came to my attention, as did the man page for nfs.  Now I'm sure we looked at this documentation page when we were setting up this NFS share, and that it didn't have what it does now.  Which suggests a lot of other people may be in the same boat as us.  The short of it is:

Turn on the NORESVPORT mount option

To explain:

On the AWS page it says (currently):

With the noresvport option, the NFS client uses a new Transmission Control Protocol (TCP) source port when a network connection is reestablished. Doing this helps ensure uninterrupted availability after a network recovery event.

Then a little bit later:

Amazon EFS ignores source ports. If you change Amazon EFS source ports, it doesn't have any effect.

This relates to the man page which says, in the section on security considerations

An NFS server assumes that if a connection comes from a privileged port, the UID and GID numbers in the NFS requests on this connection have been verified by the client's kernel or some other local authority.  This is an easy system to spoof, but on a trusted physical network between trusted hosts, it is entirely adequate.

Roughly speaking, one socket is used for each NFS mount point.  If a client could use non-privileged source ports as well, the number of sockets allowed, and thus the maximum number of concurrent mount points, would be much larger.

Using non-privileged source ports may compromise server security somewhat, since any user on AUTH_SYS mount points can now pretend to be any other when making NFS requests.  Thus NFS servers do not support this by default.  They explicitly allow it usually via an export option.

To retain good security while allowing as many mount points as possible, it is best to allow non-privileged client connections only if the serverand client both require strong authentication, such as Kerberos.

I was suspicious about the AWS assertion that noresvport made the client use a new source port for a reconnection (seemed like an awfully arbitrary bit of behaviour that wasn't mentioned by the man page) but some quick testing confirmed it.   And clearly EFS didn't care that the source port wasn't 'reserved' (privileged).  Also, for the record, I was able to observe a reconnection in the wild on our gitlab servers (in the default state), where it re-used the same source port.  

This all looked promising enough, and I have been running the production gitlab servers with noresvport for a couple of days now, with no stalls/timeouts.  I'd like to see it go for a week or so before I call this done, but I've seen reconnections using a new source port with no hiccups, and I'm fairly confident this will be the solution.

Remaining concerns

I'm not entirely happy with the security model of all this.  The implication is that if someone can implement NFS4 in user space (there are some promising candidates, but none of them quite work yet, for various reasons) then they could mount anything on the EFS instance simply by executing code as any user running on the NFS client server.  At least in the default linux case there's a little bit of a hurdle to overcome (you'd have to be root to bind to a reserved source port) although that's still just a matter of the NFS server trusting an arbitrary number being below an arbitrary limit.  But, this is out of our control with EFS; noresvport doesn't materially decrease the actual security of the system, it just requires and takes advantage of the existing, possibly flawed, situation.  And it solves my problem, so I'm good.