300 Monitoring

Managed Hosting includes a wide variety of monitoring methods to ensure stable operation of your servers and applications.

Corresponding checks for all services will be created automatically via our configuration management. When we set up a new server basic performance checks are generated. When we install a web server HTTP checks are generated and when we install a new PostgreSQL server there are PostgreSQL service checks that get generated.

The accuracy of the checks is limited by the complexity of each service setup. That means we can't monitor your applications status-check JSON that gets returned via a custom URL automatically.

We don't just check for thresholds and get notifications about it, many of our metrics (e.g. load, memory or disk usage) get recorded as a time series so that we can analyze problems in the past or forecast possible issues in the future.

Custom service checks

Before you get lost in a long list of service checks already managed by us you should know that we can implement service checks customized for your environment and your application. The effort and costs for creating such services can vary heavily. Please ask us via ops@makandra.de before going on a monitoring excursion without us.

Notifications

By default only the makandra operations team receives our monitoring notifications. Service checks implemented for your server can send notifications to an e-mail address of your choice. Contact us if you want this.

Default service checks

This is only part of all the service checks created by makandra. These services ensure the monitoring of basic functionality of the hosted services.
We create at least one service check for every service we're supplying for our customers (no matter if it's Elasticsearch, PostgreSQL, Redis or something else).

As these checks change their thresholds from time to time and new checks are added or old ones get removed, so we can't supply a complete and current list.
If you think one of your services is not sufficient monitored don't hesitate to contact us, though!

Virtual or physical server

server time is correct
ntpd is running
processes not in state dead or zombie
mailq size not greater than a threshold
enough disk space free
enough swap free
number of logged in users not exceeding a limit
cron is running
load is not exceeding a threshold depending on the number of CPU cores of the server
puppet runs regularly and runs are successful

physical server

out of band management is reachable (DRAC-Card or ILO)
no IPMI failures (hardware health check)

backups

timestamp created on backup source is recent on backup destination
newest database backups is at most one day old (if you use a database)

glusterfs (shared directories on application servers)

gluster directory is mounted
gluster has no split-brain
gluster memory usage

Website/application VHost (monitoring `/`)

response code of the VHost is in [200,301,302,401,403,404] (contact operations team to change default allowed response codes)
(if ssl is enabled) the ssl certificate is valid and not expiring soon
at least the first (primary hostname) of the VHost is pointing via DNS to the configured IP
the state of the configured upstream servers (appservers behind the loadbalancer)

Elasticsearch

Cluster state of Elasticsearch (must be green)

ClamAV

check for recent version of the database

MySQL

Replication is healthy
Number of connections
MySQL is reachable

nginx / Apache

number of connections not exceeding 75% maximum
VHost checks (see Website/application)

Redis

Redis answers on the configured port (PING, PONG)
Redis dump size is not exceeding a limit
Redis cluster nodes are in the configured state (MASTER, SLAVE)
(with sentinel) sentinel answers on the configured port (PING, PONG)
(with sentinel) there are enough quorums to trigger a fail over

PostgreSQL

number of connections is not exceeding a limit
connection to PostgreSQL is possible
replication is in a healthy state (or delay is not to high)
no queries running too long
PostgreSQL time is synchronous to system time
transaction time not exceeding a limit
locks not exceeding a limit or time threshold
PostgreSQL instance must have the configured replication state (MASTER, SLAVE)

Claus-Theodor Riegg

Say thanks1

Last edit

2021-08-09

Marius Schuller

License

Source code in this card is licensed under the MIT License.

Posted by Claus-Theodor Riegg to opscomplete (2017-03-31 11:33)