Do I Need 24/7 Support?
Before you ask suppliers whether they can offer you 24/7 support, you might want to consider that what you may be looking for is 24/7 availability of your services rather than support that will fix them when they fail. Let’s look at some of the techniques you can use to reduce the likelihood of a service failure.
In our experience, effective system monitoring can warn about the vast majority of problems before they impact business services. We monitor around 45-50 parameters on every server we support, but let’s take a simple example to illustrate the point.
If a disk (or, more accurately, a disk partition) runs out of space it is likely to cause a problem, so we monitor how full each disk partition is. Once a partition becomes 80% full, we get a notification. Now at this point, of course, nothing is broken. All that has happened is that the disk usage has moved from 79% to 80%. We then investigate to find out why.
It may be that there is a developing problem – perhaps temporary files are not being removed for some reason, or maybe the log files are growing faster than expected. Or it may just be that the amount of data being stored has simply increased over time, as it will, and that increase has filled the disk to 80% capacity. Whatever the cause, it can be identified and the appropriate action taken – all before the service has been affected.
Key here, of course, is ensuring that we are monitoring anything and everything that has the potential to impact service availability. Back in the early days of Tiger Computing, fifteen-plus years ago, we’d sometimes get a call from a client telling about us a problem we weren’t aware of.
We would treat that as two problems: the problem they had just told us about, and the failure of our monitoring to detect the problem. We would add new checks to our monitoring, or enhance existing ones, to ensure that if that problem ever occurred again – for any client – we’d know about it before they did. By developing our server monitoring in this way, we improved the reliability of all of our clients’ systems.
We’ve been doing that since 2002, so you’d expect our monitoring to be very comprehensive by now – and you’d be right.
No matter how effective the monitoring, it can seldom predict hardware failures. You can mitigate against some types of hardware failure by having, for example, redundant power supplies or redundant disks (“RAID” systems), but it is critical that such redundancy measures are monitored.
The point of having redundant disks is that if one fails, the system carries on working as normal, so unless you are monitoring the health of the RAID arrays, you won’t know the disk has failed. We have been contacted in the past by businesses that have redundant disks and they can’t read from them. Upon investigation, we have found that one disk failed years ago, and now that a second one has failed, the data is lost. If you use RAID, monitor it.
There are some hardware failures that cannot be mitigated against. For example, problems with the system motherboard, CPU, memory or disk controller will often cause a system to fail without warning. Although the system may have failed, it is possible to ensure that the services it supplies continue to be available.
How Does That Work, Then?
By way of example, we have a number of clients who run web-based applications that require a database. One technique to mitigate against catastrophic hardware failure is to configure two servers such that the web service (for example, Apache) must always be running on one server, and the database must always be running on one server – and ideally those two services are not running on the same server.
Under normal circumstances the web service is running on Server A and the database is running on Server B. The data on each server is automatically replicated between them in real time. Suddenly Server B develops a hardware fault and goes offline. The management software on Server A notices that there is no longer a database running. It is configured to ensure that there is always a database service running, and preferably not on the same server as the web service – but now there is only one server available, so it starts the database on Server A.
So, for the most part, the workload is shared between the two servers. When a failure takes one server down, the services that were running on the failed server are brought up on the remaining one, typically within a few seconds, and the service offered to the business or customers continues.
There are other techniques for ensuring services are available around the clock, and naturally we’d be happy to discuss your requirements with you and help find the most appropriate solution for you.
But I Really Want 24/7 Support!
No problem. Our Premier Support contracts provide 24/7/365 support for your Linux systems.