Pilots use them. Surgeons use them. Here at Tiger Computing, we use them, and I use one when I go on holiday. Maybe you use them, too.
I’m talking about checklists. So simple, yet such a clever idea. The nature of life is such that we carry out repetitive actions frequently. We pack for the overnight business trip; we go away with our families on holiday; we have our daily or weekly routines; some of us even fly aircraft from time to time. Every one of those activities can benefit from a checklist.
Simple But Powerful
Just like wine (and, indeed, ourselves), checklists improve with age. Let me give you an example. Here at Tiger Computing, we have a “Server Build” checklist, which we use when we have built a new server for a client. Here’s part of it:
- Attach a TCL Serial Number sticker
- Network ports labelled
- System labelled
- Outbound mail tested
There’s over 40 items on the list altogether. A few years ago, we built a server that was going to be used as a firewall, sitting between the client’s Internet connection and their office network (the LAN). We’d built the server, run through the Server Build checklist, and taken it to the client. When we installed it, however, we couldn’t get it to connect to the local network. We could access the Internet without problem, but not the client’s PCs. It turned out that the second network port, the one that connected to the LAN, was faulty. During the build process, we’d only used what was to become the Internet connection, so the problem hadn’t been noticed. We’d labelled the LAN port, as the checklist said, but never tested it. We added a new item to our Server Build checklist:
- Test all network ports
That improved, for all time, our Server Build checklist. We shouldn’t ever arrive on site with a server and find that one of the network ports is faulty. That particular checklist is on revision 21, and the first version of it in our Wiki is dated 2009 – but I recall we had an earlier version of that checklist printed out and stuck on the wall, so it’s been around for a while. Hopefully all really important things are on it now, but that doesn’t mean it can’t still be tweaked.
One of our other checklists is the “Data Centre Visit” checklist, and one of the items on there is “Keys to rack”. I’d tell you the story of how that got added to the list, but it’s just too embarrassing…
It’s well known that pilots use checklists, but perhaps not widely known just how much they use them. Here’s just a taster of the different checklists used just getting airborne:
- Before Engine Start
- After Engine Start
- Before Take Off
- After Take Off / Climb
Each stage of the flight has an accompanying checklist, and they have them for two very simple reasons: human beings tend to skip over task sequences that they “know”, and the checklist improves safety. One really simple example from the “Before Engine Start” checklist: “Parking Brake – As Required”. One can imagine the results of starting the engines on your Airbus 320 with the brakes off.
What Does This Have To Do With Linux?
Checklists prevent silly mistakes, improve system reliability and improve security. The following activities may benefit from a checklist:
- Adding a new user
- An employee leaving
- Adding a new server to your infrastructure
- Appointing a new IT supplier
- Routine backup checks
Each of the above requires a sequence of actions, and a checklist will help ensure consistency and integrity.
ToDo List versus Checklist
In the past, here at Tiger Computing we have conflated ToDo Lists with Checklists: it’s easy to use the Server Build Checklist as a to do list. The aircraft checklist items above are a combined list of instructions and checks.
We’ve largely moved away from that approach now because we see them as separate, albeit related, activities. It’s one thing to check and, if necessary, apply the parking brake before starting the engines: that adds one or two seconds. Our Server Build Checklist has one item, “All packages up to date”, to ensure that the server is fully updated before we install it. If the person running the checklist finds that not all packages are up to date, they need to break from the checklist and update the system. That takes them away from the checklist for too long: it breaks the flow. Instead, they “fail” the server and pass it back for rectification.
Part of this approach stems from automating systems, a practice we heartily condone. We use a configuration management system, Puppet, to build (and manage) servers, and the checklist’s role is now merely to confirm that Puppet has done its job. Despite best intentions, though, sometimes we find that the checklist is also a To Do list.
One such checklist is “employee leaving” checklist. That has items on it such as:
- Disable account in LDAP
- Revoke GPG keys.
- Revoke VPN certificates and update Certificate Revocation List
It’s too easy to go to LDAP and disable the user’s account, then come back to the checklist and revoke VPN certificates, inadvertently omitting the (very important) “Revoke GPG keys” step. So, this particular checklist has some items in red (including the three above). Once the checklist has been run by one of our staff, it is passed to another staff member who checks – and only checks – the red items. This is quick and easy, and they are less likely to miss an item. If the checklist fails, the work is passed back to the original staff member to correct.
Checklists have an undoubted role to play in many areas of life, and IT systems management is no exception. If you’d like to know more, I can recommend the very readable “The Checklist Manifesto: How to Get Things Right” by Atul Gawande, a surgeon who has spearheaded a move to make checklists an integral part of hospital operations.
Could this article be improved?
Let us know in the comments below.