Posted over 1 year ago. Visible to the public.

Monitoring

Managed Hosting includes a bunch of monitoring methods to ensure a stable operation of your servers and applications.

Corresponding checks for all services will be created automatically via our configuration management. If we install a server there will be host performance checks. If we install a Webserver, there will be HTTP checks and if we install PostgreSQL there will be PostgreSQL service checks.

The accuracy of the checks is limited by the complexity of each service setup (so we can't monitor your applications status json returned via a custom URL automatically).

We don't just check for thresholds and get notifications about it, some of our metrics (like load or disk usage) get recorded so that we can analyze problems in the past or forecast possible issues in the future.

Custom service checks

Before you get lost in a long list of service checks already managed by us you should know that we can implement service checks customized for your environment and your application. The effort and costs for creating such services can vary heavily so please ask before implementing something in your service which can only be monitored by us with huge effort. An e-mail to ops@makandra.de with an idea about what you want to achieve is enough.

Notifications

By default only the makandra operations team receives our monitoring notifcations. Service checks implemented for your server can send notifcations to an e-mail address of your choice. Contact us if you want this.

Default service checks

This is only part of all the service checks created by makandra. These services ensure the monitoring of basic functionality of the hosted services.
We create at least one service check for every service we're supplying for our customers (no matter if it's Elasticsearch, PostgreSQL, Redis or something else).

As these checks change their thresholds from time to time and new checks are added or old ones get removed we can't supply a complete list all the time.
If you think one of your services is not sufficient monitored don't hesitate to contact us.

Virtual or physical server

  • server time is correct
  • ntpd is running
  • processes not in state dead or zombie
  • mailq size not greater than a threshold
  • enough disk space free
  • enough swap free
  • number of logged in users not exceeding a limit
  • cron is running
  • load is not exceeding a threshold depending on the number of CPU cores of the server
  • puppet runs regulary and runs are successful

physical server

  • out of band management is reachable (DRAC-Card or ILO)
  • no IPMI failures (hardware health check)

backups

  • timestamp created on backup source is recent on backup destination
  • newest database backups is at most one day old (if you use a database)

glusterfs (shared directories on application servers)

  • gluster directory is mounted
  • gluster has no splitbrain
  • gluster memory usage

Website/application vhost (monitoring /)

  • respsonse code of the vhost is in [200,301,302,401,403,404] (contact operations team to change default allowed response codes)
  • (if ssl is enabled) the ssl certificate is valid and not expiring soon
  • at least the first (primary hostname) of the vhost is pointing via DNS to the configured IP
  • the state of the configured upstream servers (appservers behind the loadbalancer)

Elasticsearch

  • Cluster state of elasticsearch (must be green)

Clamav

  • check for recent data of the database

MySQL

  • Replication is healthy
  • Number of connections
  • MySQL is reachable

Nginx / Apache

  • number of connections not exceeding 75% maximum
  • Vhost checks (see Website/application)

Redis

  • redis answers on the configured port (PING, PONG)
  • redis dump size is not exceeding a limit
  • redis cluster nodes are in the configured state (MASTER, SLAVE)
  • (with sentinel) sentinel answers on the configured port (PING, PONG)
  • (with sentinel) there are enough quorums to trigger a failover

PostgreSQL

  • number of connections is not exceeding a limit
  • connection to postgresql is possible
  • replication is in a healthy state (or delay is not to high)
  • no queries running too long
  • postgresql time is synchronous to system time
  • transaction time not exceeding a limit
  • locks not exceeding a limit or time threshold
  • postgresql instance must have the configured replication state (MASTER, SLAVE)

Owner of this card:

Avatar
Claus-Theodor Riegg
Last edit:
about 1 year ago
by Claus-Theodor Riegg
Posted by Claus-Theodor Riegg to opscomplete
This website uses cookies to improve usability and analyze traffic.
Accept or learn more