Modern hardware – like modern cars – is reliable, but not immune to failure. One server may run for ten years without a hiccup, whereas its sibling may fail twice in the first two months.
When a working server is critical to your business success, you need to have a plan to deal with server failure. More accurately, it isn’t (or shouldn’t be) the server that is critical, but rather the services it provides. High Availability Linux Systems can help keep services available in the face of hardware failures.
Beef It Up
One approach is to build as resilient a system as possible. A server should have a fault-tolerant storage system, and a redundant power supply is a low-cost option that increases reliability.
That takes care of some of the more vulnerable parts of a server, but you can’t mitigate against every failure. You can have multiple CPUs on a system board, but the if system board itself fails, all bets are off. Likewise with memory: errors there can bring down the system.
Having two or more servers configured as a High Availability Cluster removes a physical server as a single point of failure. The traditional way of configuring two such servers was to have one as live and one as standby, and keep the data synchronised between them. The two servers would run a “heartbeat” process, and each would monitor the other’s heartbeat. If a certain number of heartbeats were missed, the remaining server would take over as the primary.
The heartbeat system worked, but it wasn’t optimal. For a start, it required an additional server that did nothing most of the time other than monitor the live server. There are also other technical challenges with a heartbeat solution, and today there are better techniques.
Let’s consider the example of a web-based service with a database back-end. There are two key elements to this service: a web server (typically Apache or nginx) and a database (perhaps MariaDB or PostgreSQL). However, there’s no requirement that both elements run on the same hardware.
By arranging to run the database on one server and the web service on the other, we achieve some simple load balancing. We can use some cluster software, Pacemaker, to implement some rules:
- Ensure Apache is running
- Ensure MariaDB is running
- Prefer to have MariaDB not running on the same server as Apache
Under normal circumstances, the two services will run on different servers. Should the server running MariaDB fail, Pacemaker will notice that rule 2 is not being observed. It can’t follow rule 3, so it will start MariaDB on the same server as Apache, and thus service is resumed.
There’s a need to ensure that the data is suitably replicated between the two physical servers, but that is straightforward to arrange. The end result is a resilient infrastructure, typically taking advantage of all of the hardware, but able to switch to one server only, fully automatically, if required.
This is seamless and transparent to users, save for a few seconds while the switchover takes place. This gives a number of advantages:
- it provides a degree of hardware fault tolerance
- it is possible to manually configure both the database and the web service to run on one server, freeing the other for maintenance, configuration or other work to be carried out without impacting the business
- the servers can be of a lower specification (for example, dual power supplies are not so important)
Not all roses
The downside, of course, is that you need two servers. However, they need not be twice the cost of one because the specification may be lower, as mentioned above. You’ll also need to accommodate both servers in your rack or data centre.
It’s Not Just The Servers
When designing a high availability configuration, ensure that there are no single points of failure outside of the servers:
- Include multiple network switches configured in such a way that the failure of any one doesn’t prevent connectivity to or between the servers.
- Use Uninterruptible Power Supplies (UPS) for all equipment. Ensure that the failure of any one UPS can be tolerated by the infrastructure.
- If hosting this equipment in your own server room, ensure that there is redundant air conditioning.
An alternative to hosting your own High Availability infrastructure is to take advantage of the redundant nature of providers such as Amazon Web Services. Such solutions are not appropriate for all applications, but where they make sense they can be very cost effective. Be aware, however, that the nature of “cloud computing” is that the server instances your application runs on can disappear without notice at any time. To ensure resilience, the design of a cloud infrastructure must be very different from a conventional hardware infrastructure.
The entire point of a High Availability infrastructure is that there is no single point of failure. That means that if an element of the infrastructure does fail, it won’t be noticeable to users – but you must be aware so that the problem can be resolved. All Linux servers should be comprehensively monitored, but with a high availability infrastructure, effective monitoring is essential if resilience is to be maintained.
The bottom line
Like so many IT decisions, this is essentially a business decision rather than a technical one, and it should be approached on that basis. The High Availability option that’s right for you should be determined by your business requirements, and your IT support people should be able to advise you accordingly.
If you’d like some help with High Availability Linux Systems, contact us today:
- call us on 01600 483 484 or
- email firstname.lastname@example.org
Could this Linux for Business article be improved? Let us know in the comments below.