Server monitoring: a practical example

Users normally save their documents, spreadsheets, etc, to a network disk on the server. One day, one user tries to do that and they get an error. The diagnostic process, a process that costs time and money, starts. Is this a problem with the user, the program, the user's PC or something else? Does rebooting the PC help? Are other users affected? After a while, it becomes apparent that there is a problem on the server, but even before that problem is resolved there has been considerable disruption, probably to multiple users.

This is a situation where server monitoring would have prevented any disruption.

Types of monitoring

There are three types of remote monitoring: status monitoring, trend monitoring and event monitoring. Let's put event monitoring to the side for the moment and consider the other two.

Status monitoring

Status monitoring looks at the state of the server right now. There are a number of parameters that may be monitored including, but not limited to:

  • How much free diskspace available
  • How busy the server is
  • How much of the available memory is being used
  • The speed of the fans inside the server
  • The temperature inside the server
  • Whether essential services are running
  • Whether any security updates need to be installed

In reality, many more parameters are measured. Each parameter has defined acceptable limits: for example, it may be that a disk is permitted to be up to 80% full. Once that threshold is breached, the support staff are notified and corrective action can be taken. In some cases the support staff can resolve the problem without involving the customer at all. This, however, is only half the story.

Trend monitoring

While status monitoring is helpful in identifying problems when they begin to make themselves apparent, system trend monitoring is concerned with looking at various system parameters over a longer period of time. Many of the same parameters are measured, but are displayed as graphs, which  allows  reasonable predictions to be made. The graph below shows the disk usage over time of our example server.

Image

It can be seen that one part of the disk, represented by the top line, was filling up between June and early October. Once it breached the 80% mark, the system status monitor alerted support staff. By looking at the trend graph, a judgment was made that, unless something was done, the system would run out of space in about two months’ time. The customer was informed, and in this particular case some files that were no longer required were deleted as shown by the drop in mid-October. Other possible courses of action would have been  to schedule the fitting of an additional or larger disk, or to archive old data. The important point is that action was taken proactively rather than reactively.

Event monitoring

Event monitoring is looking for needles in haystacks. There are many things that happen on a server that are exactly what is expected – a user logs in, a mail is sent, another is received, the server corrects its internal clock by 12 milliseconds, the anti-virus utility is updated, and so on. All of these events are logged on the system just in case we need to be able to check what has happened, but largely the system logs are ignored. What, then, if there is something more serious in there? An error reading from a disk that succeeds on the second attempt, which could signify a failing disk? Someone trying to gain access to the administration account on the server?

One way of analysing these log files is to look for anything suspicious. The problem with that approach is that one must define in advance what constitutes “suspicious” in order for an automated process to find it. The alternative is to use a human being, but quite apart from being time-consuming, and thus expensive, humans are not particularly adept at spotting needles in haystacks.

A far better approach is to have an automated process that has been told what all the "expected" events are. It discards them and reports what's left to the support staff for further analysis. That way, anything that is unexpected will be found.

Summary

Effective server monitoring will allow proactive measures to be taken to resolve potential issues before they affect your business, and will also alert support staff to unexpected server activity, increasing both security and reliability. All of the features discussed on this page can be provided with Open Source software.

 
quotation mark
Crazy Colour
"What particularly impressed me with your service was the additional effort you put in to provide advice in improvements and additions that would otherwise have gone unnoticed."
Scott Spence, IT Manager, CC Consultancy Limited

© 2008 Tiger Computing Limited. All rights reserved. Registered in England, number 3389961. Privacy statement

RocketTheme Joomla Templates