Linux System Monitoring, Part 1

Key to the very best availability of your servers is effective system monitoring. This is the first in a series of articles where we’ll look at exactly what that means. We won’t get too technical, but if you believe you already have effective system monitoring in place, you may want to check you’ve incorporated all these suggestions.

Disk Failures

System monitoring is about more than just reducing downtime. Take disk drives, for examples. How likely is it that a disk will fail? Research by Google suggests that, for disks aged 2-5 years, an annual failure rate between 6% and 9% is realistic. If you have 15 drives installed across your servers – a very modest number of disks – you should expect around one failure per year. So are you going to experience disk failure, and soon.

The impact of failure of a disk is high: it’s likely that some, maybe all, data on that disk will be lost; however, that’s no problem if you have RAID implemented in such a way as to have redundant disks. The data on the failed disk has gone, but that data is replicated on other disks.

Part of the point of RAID is to make disk failures transparent. So long as you are monitoring the health of the RAID arrays, you’ll be aware of, and can react to, the disk failure. If you’re not monitoring them, you’ll only find out that you have a problem when sufficient disks have failed to give you data loss.

Even if you detect the failed disk, you could be unlucky. You wouldn’t be the first person to replace a RAID disk and have another disk fail during the RAID array rebuild. It’s not uncommon for two reasons. Firstly, it’s likely that all the disks in the RAID array are the same age from the same batch from the same manufacturer, and are thus likely to have a similar lifetime. Secondly, rebuilding a RAID array will stress the remaining disks, possibly enough to cause one more to fail.

But that’s OK: you have backups – but they need to be monitored too. Was the last backup successful? If it was successful, how old is it? Monitoring the success and age of the most recent backup is a sensible policy.

So we can see that monitoring is not simply about making sure things services are available. It’s also about checking that our safety measures – RAID disks, backups, etc – are working as intended.

Types of Monitoring

We will consider three types of monitoring:

Status monitoring
Trend monitoring
Log monitoring

Status Monitoring

The principle of status monitoring is simple: periodically, the monitoring system connects to each server in turn and runs some checks, usually with aid of some locally-installed agent. This takes place every five minutes or so. The results of those checks are passed back to the monitoring server, which will typically:

record the data for later analysis if required
present them, perhaps via a web page or other program
instigate notifications as required

The monitoring server will usually also facilitate:

running single checks, or perhaps all checks for a specific client, on demand
creation of reports showing host or server availability statistics, periods of downtime, notifications issued, and so on
scheduling downtime for a host or service such that checks (and notifications) are suspended
the acknowledgement of issues reported
the recording of performance data – for example, CPU load

An example of status monitoring would be monitoring how full a disk partition is. If we are alerted once a partition is 80% full, we can investigates. Crucially, at that point there is nothing wrong with the system: all that has happened is that the partition has gone from below 80% full to above 80% full. It may be that someone has saved a number of large temporary files or that user data in general has increased, or even that there is a system problem that is causing free disk space to be used, but the warning should allow us to resolve the problem before it causes an unscheduled outage.

Summary

The concept of system monitoring has been around for a long time, but to use it effectively we need to think carefully about exactly what should be monitored.

In the next article, we’ll look at trend and log monitoring.

Could This Article Be Improved?

Let us know in the comments below.

Linux System Monitoring, Part 1

Disk Failures

Types of Monitoring

Status Monitoring

Summary

Could This Article Be Improved?

Further Reading

Filter

Is Proprietary Software Holding Back Your Research? How to Get More from Linux and Open-Source Tools

Your Research is Cutting-Edge – So Why is Your Linux Infrastructure Stuck in the Jurassic Era?

From IT Support to Strategy: How MSPs Can Move Up the Value Chain

Take the next step towards hassle-free Linux support.

Linux Services

Who We Serve

Linux Insights

About Us

Contact