Resolving 502 and 504 errors in ECS apps

A while ago, I noticed a significant number of 502 and 504 errors in our production APIs running on AWS. This is a story of resolving those.

Our setup:

RESTful API running behind an Apache server
Elastic Container Services (Fargate)
Application Load Balancer directing traffic to the ECS containers

What I learned after hours (days?) of investigation, is that everything came down to various timeout settings throughout the environment. It's important to align timeout values in a number of areas within an ECS cluster to avoid issues such as gateway timeouts and inaccessible backend server errors.

TLDR

Possible causes of 502 errors:

backend server keep-alive timeout is shorter than the load balancer's idle timeout (load balancer tries to reuse an already closed connection to backend server)

Possible causes of 504 errors:

load balancer's idle timeout is too short (long-running task is timed out)
target group deregistration delay is too short (long-running task is timed out)

Application Load Balancer Idle Timeout

The Idle Timeout on the ALB will affect how long a connection to a client will remain open while no data is being sent. In our case, this would happen when a long-running API call is being processed and the client is waiting for a response. On the API, we needed to set this to 120 secs to allow plenty of time for typical long-running tasks to complete.

While the application is processing a request, there is no data being sent between the load balancer and the client and so the connection is considered idle. If no data is sent in either direction before the idle timeout is hit, while the request is still being processed and client waiting for a response, the ALB will throw a 504 error.

Apache/NginX Keep-Alive Timeout

Keep-alives on backend servers like Apache and NginX allow a client to reuse an open connection to the server to make multiple requests. By default, Apache has keep-alives turned on and the keep-alive timeout is set to 5 seconds. These are typically the recommended settings because:

Having keep-alives enabled improves performance by not requiring new connections to be open for each request that a client makes (as long as these requests come frequently enough)
5 seconds ensures that connections aren't left open too long and unnecessarily block new connections from being made once the limit of concurrent connections has been reached. By default there is a limit of 100 concurrent keep-alive connections (additional non keep-alive connections can be made beyond this)
Each connection takes up memory on the server, so having the keep-alive timeout set low (5 secs) ensures that these are cycled in and out frequently and memory usage remains under control

With the load balanced scenario, things are a little different. Instead of each client connecting directly to Apache, the connection from the client is made to the load balancer and the load balancer makes a connection to the backend server. In the case of AWS, connections from the ALB to the backend server are reused for multiple clients/requests. This means that one single keep-alive connection made from the ALB to Apache can serve multiple client connections to the ALB. Also the same client's multiple connections can be served in multiple backend connections, even to different backend servers. This has the following implications:

Connections to the backend server (Apache) will be limited to the number of connections that ALB will open to the server, not by the number of clients hitting the ALB. The load balancer is good at spreading traffic across multiple backend servers, so each one will get a more limited number of keep-alive connections
Because connections between the ALB and the backend server can be reused for multiple client connections, keep-alive connections won't be wastefully staying open if a large keep-alive timeout is set on the backend server, i.e. Client B's request can be served by the same keep-alive connection that was created to serve Client A's request, etc
If a keep-alive timeout is set to a lower value than the ALB's idle timeout, there will inevitably be times when no data is sent on an open keep-alive connection in the backend server until the timeout is hit and the backend server will close that connection. The problem with this is that until the idle timeout has been hit on the ALB, it will try to reuse the same keep-alive connection that it opened with the backend server. When it learns that the connection has already been closed by Apache/NginX, the ALB will throw a 502 error

With the above in mind, it's best to set the keep-alive timeout on the backend servers to be larger than the idle timeout on the ALB. This will prevent the backend server from closing connections that the ALB is expecting to still be open. Since it is the ALB and not each client that opens these connections to the backend server, it is safe to increase the keep-alive timeout. So if you set the idle timeout on the load balancer to 120 secs, you could set it on Apache/NginX to 150 secs (longer to provide a buffer).

Apache/NginX Connection Timeout

There is another directive in backend servers that's important here - connection timeout. This should also be set long enough to handle all appropriate calls to the server. Unlike the other 2 types of timeouts above, this one doesn't care whether data is being sent to the server or not. Let's say there's an endpoint that serves data for 45 secs. If the connection timeout is set to 30 secs, the stream will be dropped at 30 secs. In the API, Apache is set to the default 5 min connection timeout, which is more than sufficient for the types of requests that are served in our example scenario. Namely, it easily covers the 120 secs idle timeout and any additional non-idle time on typical requests.

Target Group Deregistration Delay

One last type of timeout that needs to be catered for is the deregistration delay on the Target Group in AWS. This is important when dealing with ECS, which will deregister old targets and register new ones when a code deployment is made (in a "rolling update" deployment type). When a deployment is initiated, ECS will start up and perform health checks on new containers. If all is well, it will add the new containers as targets to the target group and mark the old targets as "draining". While in the draining state, no NEW connections will be made to those targets, but existing connections will be allowed to remain open until their requests complete OR the deregistration delay time has been hit. If the deregistration delay was set to 0 secs, these targets would be deregistered and connections to them would be closed immediately once they entered the draining state. If connections are closed in this fashion, the ALB will throw a 504 error.

As such, it's important to set the deregistration delay to a high enough value that all existing, appropriate requests should have enough time to complete. It's also a good idea to keep this value low enough that these old targets won't hang around unnecessarily. In our example scenario the deregistration delay could be set to 150 secs, which is more than the 120 secs idle timeout on the ALB. During this deregistration period, all processes will either complete or be cancelled by the ALB by the 150 secs mark.