AWS: Having Issues in Your Infrastructure? It Might Be Your ELBs

Over the the course of 2015 (and the first part of 2016) I was lucky enough to support a reasonably major API service located within AWS. It was a fairly heavily utilised service (approximately 1700 requests/second) and depended to a large extent on Amazon's Elastic Load Balancers. As time progressed and the amount of traffic to the service increased, we started seeing strange and apparently random errors appearing as 4xx and 5xx at the ELBs, which appeared to be emanating from both the ELB itself and the backend servers.

Many calls were submitted to the developers over the course of 6 months to request them to re-analyse their code and come up with a solution to these issues, however nothing was forthcoming. The Ops team (including myself) started to go through our infrastructure one element at a time to try and identify why this was happening, we scaled out much more than was actually necessary to try and alleviate the problems, but still we were seeing errors. The support calls into AWS started to increase every time we highlighted a problem and it got to a point where we couldn't add or remove instances to the ELB without errors occurring. Things were not looking good.

Then one day a conversation between myself and an AWS support representative led me to have a look at a particular latency graph. I've published it below;

Maximum Latency Metric

On an standard day, I would normally look at the 'Average' or 'Sum' metrics of the graphs to see the latency of the incoming traffic and make sure our customers were getting a low response time, but when I switched to using the 'Maximum' metric, it was immediately obvious what was wrong. The graph shows a plateau at a certain value on the Y axis (it's hard to see the value on the screenshot, but it is at 120 seconds.) What was this value? Well, the answer had been there all along but we hadn't even looked at it before that point because we all thought it was a code problem.

The core of the issue was based around an incorrectly configured parameter of the ELB known as the 'Idle Connection Timeout', and in our ELBs they were set to 120 seconds. This setting dictates how long a connection is open from the client to the ELB and from the ELB to any attached backend servers, it is also used by the ELB to manage TCP connections behind the scenes, so it's fairly important to get this value right. This value will kick in if no data has been detected over the connection after the timeout expires, and if this happens, the ELB assumes the connection is finished and closes the connection. Now, in practical terms this means that if a process takes a long time to complete on the backend server, and nothing has responded back to the ELB within this idle time, the connection will close. If this happens repeatedly, your graph of Maximum Latency will look like the one shown above. To add a bit more complexity to the whole thing, there is also a timeout value on the backend HTTP server (regardless of whether that is Apache httpd, Nginx or IIS) which will also come into play to affect the overall performance.

So now we were in a situation where we had found the issue, we knew why it was happening, we just had to fix it.

In a basic ELB/Backend Server scenario, fixing the issue is fairly simple, with both timeout settings being altered as shown in the diagram below;

In this situation, we just have to make sure that timeout on the Backend EC2 Server is set higher than the timeout of the ELB. This means that it is the ELB which terminates any long running requests, not the server, and therefore we allow the ELB to manage the TCP connections correctly and everything is great. :-)

(Incidentally, the best way to start working out the timeouts in this situation would be to first determine what the timeout should be on the Backend Server and then move forward to the ELB. As an example, let's say that it takes your application 190 seconds (at most) to process the requests. To allow for some extra room, we would put the idle timeout of the App Server at 200 seconds. The ELB sitting in front of this has to be less than this timeout to ensure that your App Server does not close the connection before the ELB does, which would cause HTTP 504 errors to be generated.)

Of course in our particular configuration, we had multiple ELBs and Servers in series i.e. a Public ELB server, connecting to Proxy Server which in turn connects to an Internal ELB which has App Servers attached. A bit like this;

The solution to setting the timeouts here is to use the same principle as before i.e. choose a higher timeout on the Server than the ELB, but also remember that the Public ELB timeout must be greater than the Application Server, so that it does not close the connection before the Application Server can finish processing the request... 210 seconds for the Public ELB is fine. Finally, the Proxy Server timeout would need to be greater than the Public ELB timeout for the same reason as the Application Server needs a greater timeout than the Internal ELB.

In both scenarios, correctly and incorrectly configured timeout settings, you will see an HTTP 504 error generated when the time out occurs, but the difference is that having the ELB idle timeout lower, allows the ELB to close the connection instead of the backend. This is important because the ELB pre-opens connections to the backend and maintains them for the period of the idle timeout. This ultimately reduces the TCP handshake overhead as ELB forwards the request using this pre-opened connection.

To summarise the correct timeout settings;

  • Internal ELB Timeout < Application Server Timeout 
  • Reverse Proxy Timeout > Internal ELB Timeout 
  • Public ELB Timeout < Reverse Proxy Timeout 
  • Public ELB Timeout > Application Server Timeout 
Anyway, to cut an exceptionally long story short, by altering the timeout values using the methods shown above, we were able to fix all of the issues that were occurring within our infrastructure and the Developers breathed a huge sigh of relief! So, by way of a moral to this story, and if you take nothing else from it, remember these two things;
  • If you see spurious and seemingly random 4xx and 5xx at the ELB, don't necessarily assume it's something wrong with the code, it might be the ELB incorrectly configured. 
  • When using an ELB, there is a fair chance that the default Idle Connection Timeout value is not what you should be using.

No comments: