In Linux System Monitoring, Part 2, we looked at trend and log monitoring. We continue in this article with a closer examination of exactly what should be monitored.
Most monitoring systems will monitor a wide range of parameters out of the box, including:
- free partition space for every disk partition
- validity of SSL certificates
- that specific applications running
- mounts of network disks
- data replication status (databases, LDAP)
- UPS status
- RAID status
It’s also possible to write custom checks to monitor additional parameters, which might be operational parameters or business-specific parameters. Such custom checks might include:
- status of housekeeping tasks (database dumps, cron jobs, log rotation)
- status of backups (age, errors, how long to run)
- outstanding security updates to install
- whether the “orders” database table has been updated in the last ten minutes
In reality, the list is almost endless. One popular Open Source monitoring tool is Nagios, and as well as being a monitoring system in its own right, it also forms the basis of a number of other monitoring systems. The principle Nagios (as well as many other non-Nagios based systems) uses is to run a small program on the client server and report the outcome.
From the perspective of Nagios, there are four possible “statuses” when the client program has finished running:
- Internal error
The last is just a safety check which should be interpreted as “The monitoring process itself has a problem”. In the case of the Nagios web interface, hosts and services that are “OK” are displayed with the green background; those in state “Warning” are displayed with a yellow background, and “Critical” with a red background. Any checks that give an internal error are displayed with an amber background.
Let’s look at a simple example. Suppose we have a job that runs every night and produces a report. A simple check that can be incorporated into Nagios might be:
Does the report file exist?
Yes: output "Report present" and set status to OK
No: output "Report not present" and set status to Critical
If the report file is present, Nagios will display this check with a green background with the text, “Report present”, otherwise the check will have a red background and the text “Report not present”.
That test is very simplistic in that there’s no guarantee that the report was actually created in the previous 24 hours, but it’s not hard to extend the test to fix that (and other problems).
The resulting test will be far more robust, but maybe one day we discover that the report file exists, it was created in the previous 24 hours, but it is a zero byte (ie. empty) file. Something in the report creation process failed, but our monitoring script didn’t detect that.
The correct approach now is to consider that we have two problems:
- Our report creating job has failed in some way, resulting in a zero byte report file.
- Our monitoring has failed because it didn’t report the problem.
It’s straightforward to modify the check to also ensure that the report file is longer than zero bytes, and that in turn improves our monitoring system. Taking this approach – that every failure our monitoring system doesn’t detect is actually two failures – quickly leads to a robust monitoring system.
As well as checks that are proactively carried out by the monitoring system (disk space, page load times, etc), many monitoring systems allow external processes to create status reports. For example, in the case of backups, the backup process may detect a problem and directly notify the monitor. A good habit is for routine home-grown scripts to report their status directly to the monitoring system.
Using external notifications in this way is more efficient than waiting for the monitoring server to check the backups itself. It also reduces the complexity of the monitoring system: rather than having it understand how to check backups, it passively reports what it has been told by the backup process.
It’s also simpler. The goal should be to have only one place to go to understand the status of the infrastructure. A mixture of “status emails” and multiple “dashboards”, all reporting on different elements of the infrastructure, is not nearly as helpful as One Place To Go For Everything.
In the next article in this series, we’ll look at how to use monitoring as a practical part of your day to day operations.
Was This Article Helpful?
Let us know in the comments below.