Skip to main content

There is a familiar pattern in many NHS IT teams. A system behaves oddly, an integration stalls, a patch needs careful handling, or a long-forgotten server suddenly becomes very important. Everyone knows who to call. It is the person who has been there long enough to remember how it was built, why it was built that way, and which parts must never be touched on a Friday afternoon.

Every organisation has people like this. They are often brilliant, committed, and quietly responsible for keeping far more services running than anyone outside the team fully appreciates. They know the quirks, the dependencies, the undocumented workarounds and the systems that require a little more care than the asset register might suggest.

The trouble is that, over time, this knowledge can become a form of hidden infrastructure. It keeps things working, but it also creates risk. When too much operational understanding sits with one person, the organisation has not just a staffing issue, but a resilience issue.

The Risk That Doesn’t Show Up On a Dashboard

When NHS organisations think about infrastructure risk, attention naturally goes to visible threats: cyber attacks, hardware failures, unsupported systems, supplier outages, power issues and backup failures. These are important and they deserve serious planning.

Key-person dependency is different. It rarely announces itself until it matters. Monitoring tools can tell you when a server is under load or a service has stopped responding, but they cannot tell you that only one engineer understands the history behind that server, the integration it supports, or the reason a particular update has been deferred.

That is why this risk can remain invisible for years. As long as the right person is available, the environment appears stable. Incidents are fixed. Patches are managed. Workarounds are remembered. Maintenance windows are navigated successfully. From the outside, it looks like everything is under control.

And often, it is. Until that person is on leave, off sick, overloaded, promoted into another role, or no longer with the organisation.

Why This Matters in NHS Linux Environments

Linux sits quietly beneath many of the services NHS teams rely on every day. It may support clinical systems, integration engines, databases, reporting platforms, imaging workflows, authentication services, cloud workloads or business intelligence environments. In some cases, the Linux estate is well documented, actively maintained and supported by several experienced people. In others, it has grown organically over many years, with knowledge spread unevenly across the team.

This is not a criticism of NHS IT teams. Quite the opposite. Most have kept complex environments running through budget pressure, staffing gaps, supplier change, legacy constraints and constant demand for transformation. The fact that so much works as well as it does is testament to the skill and commitment of the people involved.

But that same commitment can mask the underlying fragility. When the most experienced Linux specialist becomes the person who knows “how it really works”, their expertise stops being just an asset and starts becoming a dependency.

For Trusts, that dependency can affect clinical system resilience. For CSUs, it can put shared services and data platforms under pressure. For national bodies, it can create operational risk across large-scale infrastructure and cloud environments. The setting changes, but the underlying issue is the same: critical knowledge should not live in one person’s head.

Recruitment is Not Always the Whole Answer

The obvious response is to hire another Linux engineer. Sometimes that is absolutely the right thing to do. Internal capability matters, and NHS organisations benefit from having people who understand their systems, their governance, their users and their priorities.

The difficulty is that experienced Linux specialists are not always easy to find. The NHS is competing with private sector employers, cloud providers, consultancies and technology firms that can often move faster and pay more. Even when the right person is recruited, it takes time for them to learn the particular shape of an NHS environment. No job description can capture the full reality of years of technical decisions, supplier arrangements, historical workarounds and clinical dependencies.

That is why the goal should not simply be “find another person”. The stronger goal is to reduce the amount of critical knowledge that depends on any one person in the first place.

From Individual Heroics to Shared Resilience

The strongest infrastructure teams are not the ones that rely on heroics. They are the ones that make heroics less necessary.

That means turning specialist knowledge into something the organisation can use consistently. It means documenting the systems that matter, creating runbooks for known failure patterns, automating routine tasks where possible, standardising patching and configuration, and making sure escalation routes are clear before something goes wrong.

This is where Site Reliability Engineering principles can be useful, even if the organisation does not formally adopt the label. At its simplest, it is about making systems more reliable by reducing repetitive manual work, improving visibility, and building operational practices that do not depend on memory or goodwill.

For an NHS team, that might mean starting with a straightforward question: if our most experienced Linux specialist was unavailable for a month, which systems would make us nervous?

The answer is often very revealing. It may point to a clinical application with unclear dependencies, a reporting platform that lacks proper documentation, an integration engine that has not been patched because everyone is worried about breaking it, or a server that no one wants to touch because it has simply “always been there”.

Once those areas are visible, they can be managed. Not overnight, and not through grand transformation, but through practical steps that steadily reduce risk.

What Good Looks Like

There is no single model that works for every NHS organisation. Some will build more capability in-house. Some will recruit dedicated Linux staff. Some will work with an external specialist for support, maintenance or emergency cover. Many will use a blend of all three.

The important thing is that the model is deliberate. If a Trust, CSU or national NHS body is relying on one or two specialists to keep critical Linux systems running, that should be recognised as an operational risk and addressed in the same way as any other resilience concern.

Good practice usually involves a few common elements: clear ownership, accurate documentation, patching that follows an agreed process, proactive monitoring, tested recovery steps, and access to experienced Linux support when internal teams need backup. None of this is glamorous. It is the sort of quiet operational discipline that only becomes visible when it is missing.

And that is rather the point. Reliable infrastructure should not require drama.

A Safer Way Forward

The NHS will always depend on skilled people. Technology does not run itself, and no amount of automation removes the need for experienced engineers who understand the environment. But skilled people should be supported by processes, documentation, monitoring and external backup where appropriate.

Key-person risk is not a sign that a team has failed. It is usually a sign that capable people have been carrying too much for too long.

The most dangerous Linux server in your Trust is not a server. It is the one only one person understands. The sooner that knowledge is shared, documented and supported, the safer the organisation becomes.

Is Key-Person Risk Hiding in Your Linux Estate?

Most organisations have at least one system that only one person truly understands.

The challenge is identifying those dependencies before they become a problem.

If this article has struck a chord, we’d be happy to have a conversation about your Linux environment, the resilience challenges you’re facing, and the practical steps other NHS organisations are taking to reduce operational risk.

Whether you’re considering developing internal capability, improving documentation and processes, or exploring external support options, an independent discussion can often help clarify priorities.

Book a short conversation with a Tiger engineer to talk through your Linux challenges.