Loadbalancer health checks

Posted almost 6 years ago. Visible to the public.

A load balancer has to know where and when it can traffic forward to. A malfunctioning application server or one that is offline should not receive traffic. To be able to detect both of those cases, we're using active health checks.

Basic functionality

In our default configuration a hosted application on a server is only passively checked. This means the load balancer will forward a received request directly to the application server with the least connections. If the request can not be served by the application server it is marked down for 30 seconds and the request will be forwarded to next available application server. After those 30 seconds the previously failed application server will be marked up again. Thus the load balancer will again try to forward requests to this application server, starting the same procedure from the beginning.
This is not visible from the outside as failing requests will be forwarded to the next application server without notifying the client, but there are some drawbacks:

  • this check only ensures that the web server is running and the corresponding port is open
  • if the application server is not completely down but extremely slow (e.g. due to load), the request has to wait for a timeout to occur, until it is forwarded to the next application server
  • if the application server or the web server is returning an HTTP 5xx error code it will be delivered to the client without even trying other possibly available application servers

Limits of the default check

Our default check can not check if the application is working correctly. There could be an issue in the code or the application server could be overloaded and just responding very slow. It is not possible to create a generic check that works the same for all our customers, so we need your help to set up HTTP health checks (see below).
If for example our default check would be a generic HTTP check to the / URI expecting the web server to return a 200 HTTP response code an application not serving the / URL at all would be marked down leading to an outage of his platform. The next issue might be, that / does not reflect the state of the application very well, as / might be a static file, but the database can't be accessed. Another example would be a very slow application rendering many seconds for every health check. All load balancers (even the followers) perform these health checks. As there are 3 load balancers and the check interval is 3s there would be 1 request per second (assuming that the requests do not happen at the same time).
To improve the health check to your application read the following and contact our operations team.

Improving load balancer health checks

As explained above our default health check is pretty basic. Before enabling HTTP health checks for your application you should think about which request to your application shows that it's really working. Best practice is to implement a health check endpoint which triggers a self test of your application or at least touches all critical parts. If everything is fine it then returns the status code 200. If something is deemed not healthy it returns 503.
We implement a check to request this endpoint and will expect it to return a status code 200. If anything else than 200 is returned (or the configured timeout is exceeded) the load balancer assumes that this server is not healthy and will mark it as down as well as subsequently stop forwarding traffic to it.

Please be aware, that if all application servers respond with an unhealthy status code or if they are unresponsive for a different reason, all of them could be marked down essentially taking your application offline. That is why it is important that the health check thresholds should be adapted to work under high load situations too. We can help you find the right URLs and check thresholds for your application.

Asynchronous deployment without downtime

If you have implemented an health check endpoint in your application you are able to configure your deployment process so that it returns HTTP status code 503 during a deployment. This would give you the possibility to deploy to the first application server and marking it down on purpose while the other application server(s) still serve incoming requests. After the deploy is done on the first server you can return HTTP status code 200 on the health check endpoint again. The load balancer will forward traffic to this server again. Afterwards you can continue with the other application server(s) in the same way. Please mind that this way of deployment is complex and will take a long time. Also it is not suitable for breaking changes and migrations! If you need consultancy regarding this, feel free to contact our operations team.

Claus-Theodor Riegg
Last edit
Over 1 year ago
Marius Schuller
Source code in this card is licensed under the MIT License.
Posted by Claus-Theodor Riegg to opscomplete (2017-06-01 17:26)