Posted over 1 year ago. Visible to the public.

Loadbalancer health checks

A load balancer must know to which application server traffic can be forwarded. An malfunctioning application server or one that is offline should not receive traffic. For this purpose we're using active health checks.

Basic functionality

In our default configuration a hosted application on a server is passively checked. This means the Loadbalancer will forward a received request directly to the application server with the least connections (depending on the configuration of your Loadbalancer Upstream this behavior can differ, see 310 Load Balancers). If the request can not be served by the application server, the appserver is marked down for 30 seconds and the request will be forwarded to next application server. After the 30 seconds the previously failing appserver will be marked up again. The Loadbalancer will try to forward requests to this appserver again.
This procedure is not visible from the outside as failing requests will be forwarded to the next app server without notifying the client.
But there are some drawbacks:
- basically this check only ensures that the webserver is running and the corresponding port is open
- if the appserver is not completely down but extremely slow, you will have to wait for a timeout until it is forwarded to the next server
- if the application or the webserver is returning an HTTP 5xx error code it will be delivered to the client without trying other appservers

Limits of the default check

The default check does passive checks. Though this does not necessarily mean that the application is working correctly. There could be a problem in the code or the application server could be overloaded. In both cases an HTTP 5xx error code would be returned or the request would take a very long time. The passive check can not recognize that there is a problem. We can't implement any other check method by default as the diversity of the customer deployments leads to exceptions which will not work.
If for example our default check would be a HTTP check to the / URI expecting the Webserver to return a 200 HTTP response code an application not serving the / URL at all would be marked down leading to an outage of his platform. An other example would be a very slow application rendering many seconds for every health check. All load balancers (even the slaves) perform these health checks. As there are 3 load balancers and the check intervall is 3s there would be 1 request per second (assuming that the requests do not happen at the same time).
To improve the health check to your application read the next two points and contact our operations team.

Improving load balancer health checks

As shown above our default health check does not honor a malfunctioning application or an unresponsive (slow) application server. But before you tell us to enable HTTP health checks for your application you should think about which request to your application shows that it's working. You could implement an health check endpoint which triggers a self test of your application. If everything is fine it returns the statuscode 200. If something is not working it returns 503. We would implement a check to this endpoint and will expect statuscode 200. If 200 is not returned (or the configured timeout is exceeded) the load balancer assumes that this server is down and will stop forwarding traffic to it.

Asynchronous deployment without downtime

If you have implemented an health check endpoint in your application (see preceding point) you could configure your deployment to return HTTP status code 503 during a deployment. This would give you the possibility to deploy to the first application server and marking it down on purpose while the other application server(s) will still serve the incoming requests. After the deploy is done on the first server you can return HTTP status code 200 on the health check endpoint again. The load balancer will forward traffic to this server again. Afterwards you can continue with the other application server(s) in the same way. Please mind that this way of deployment is complex and will take a long time. Also it's not suitable for breaking changes and migrations! If you need consultancy regarding this, feel free to contact our operations team.

Owner of this card:

Avatar
Claus-Theodor Riegg
Last edit:
8 months ago
by Claus-Theodor Riegg
Posted by Claus-Theodor Riegg to opscomplete
This website uses cookies to improve usability and analyze traffic.
Accept or learn more