300 Monitoring

Posted About 7 years ago. Visible to the public.

Managed Hosting includes a wide variety of monitoring methods to ensure stable operation of your servers and applications.

Corresponding checks for all services will be created automatically via our configuration management. When we set up a new server basic performance checks are generated. When we install a web server HTTP checks are generated and when we install a new PostgreSQL server there are PostgreSQL service checks that get generated.

The accuracy of the checks is limited by the complexity of each service setup. That means we can't monitor your applications status-check JSON that gets returned via a custom URL automatically.

We don't just check for thresholds and get notifications about it, many of our metrics (e.g. load, memory or disk usage) get recorded as a time series so that we can analyze problems in the past or forecast possible issues in the future.

Custom service checks

Before you get lost in a long list of service checks already managed by us you should know that we can implement service checks customized for your environment and your application. The effort and costs for creating such services can vary heavily. Please ask us via ops@makandra.de before going on a monitoring excursion without us.

Notifications

By default only the makandra operations team receives our monitoring notifications. Service checks implemented for your server can send notifications to an e-mail address of your choice. Contact us if you want this.

Default service checks

This is only part of all the service checks created by makandra. These services ensure the monitoring of basic functionality of the hosted services.
We create at least one service check for every service we're supplying for our customers (no matter if it's Elasticsearch, PostgreSQL, Redis or something else).

As these checks change their thresholds from time to time and new checks are added or old ones get removed, so we can't supply a complete and current list.
If you think one of your services is not sufficient monitored don't hesitate to contact us, though!

Virtual or physical server

  • server time is correct
  • ntpd is running
  • processes not in state dead or zombie
  • mailq size not greater than a threshold
  • enough disk space free
  • enough swap free
  • number of logged in users not exceeding a limit
  • cron is running
  • load is not exceeding a threshold depending on the number of CPU cores of the server
  • puppet runs regularly and runs are successful

physical server

  • out of band management is reachable (DRAC-Card or ILO)
  • no IPMI failures (hardware health check)

backups

  • timestamp created on backup source is recent on backup destination
  • newest database backups is at most one day old (if you use a database)

glusterfs (shared directories on application servers)

  • gluster directory is mounted
  • gluster has no split-brain
  • gluster memory usage

Website/application VHost (monitoring /)

  • response code of the VHost is in [200,301,302,401,403,404] (contact operations team to change default allowed response codes)
  • (if ssl is enabled) the ssl certificate is valid and not expiring soon
  • at least the first (primary hostname) of the VHost is pointing via DNS to the configured IP
  • the state of the configured upstream servers (appservers behind the loadbalancer)

Elasticsearch

  • Cluster state of Elasticsearch (must be green)

ClamAV

  • check for recent version of the database

MySQL

  • Replication is healthy
  • Number of connections
  • MySQL is reachable

nginx / Apache

  • number of connections not exceeding 75% maximum
  • VHost checks (see Website/application)

Redis

  • Redis answers on the configured port (PING, PONG)
  • Redis dump size is not exceeding a limit
  • Redis cluster nodes are in the configured state (MASTER, SLAVE)
  • (with sentinel) sentinel answers on the configured port (PING, PONG)
  • (with sentinel) there are enough quorums to trigger a fail over

PostgreSQL

  • number of connections is not exceeding a limit
  • connection to PostgreSQL is possible
  • replication is in a healthy state (or delay is not to high)
  • no queries running too long
  • PostgreSQL time is synchronous to system time
  • transaction time not exceeding a limit
  • locks not exceeding a limit or time threshold
  • PostgreSQL instance must have the configured replication state (MASTER, SLAVE)
Claus-Theodor Riegg
Last edit
Over 2 years ago
Deleted user #4941
License
Source code in this card is licensed under the MIT License.
Posted by Claus-Theodor Riegg to opscomplete (2017-03-31 11:33)