In IT you are assured that at some point things will not go as expected. There are software bugs, hardware failures and yes, even "human errors". Here's a little story about what NOT to do...
Some time ago when I was working for a large organisation, they used removable storage disks for their large computers. These disks weighed several kilos and had to be de-mounted and stopped before an operator could open the unit and place the disk cover over the disk, turn the cover until engaged, then remove the disk and place it into a locking base. So the operation is not one that is done quickly.
Now one night, an operator saw that there were errors coming from one of the disks. An acceptable procedure is to ask the Operating System to "swap" it. The OS determines a disk that is not being used. It then stops both drives and informs the operator of the drives to swap. The operator then uses the above-mentioned technique to swap the disks.
This is fine for "soft" errors, where it may be a slight problem in reading/writing head alignment, etc. However, operators are also trained to look for signs of a "head crash" before moving a disk to another unit. A head crash means that the read/write heads that skim close to the surface of the dozen or so platters that make up a disk, have come into physical contact with the surface of the disk. This is definitely a "hard error". The disk is unusable and the unit should be checked by a technician.
Back to the story... So the operator swaps the disks. He then gets another error on the original disk. So he swaps it again, and again, and again. By the time he has finished he has wrecked at least half a dozen disks and potentially the same number of drive units - because he didn't check for a head crash on the first swap. When he swapped the damaged disk to a new unit, because its surface is damaged, it has the potential to damage the read/write heads of the new unit. Because he is swapping a new disk into the drive unit that had the head crash, the read/write heads could be damaged and then cause a head crash on the new disk.
So what can we do?
Process and procedures need to be in place to prevent problems and to guide people in resolving problems. With regard to software bugs, there are techniques to buld in "defensive code" that gracefully handles potential problems instead of failing catastrophically. Providing log information to assist people in resolving issues is essential. The software development life cycle (SDLC) must have checks and balances at different stages to pick up potential errors before the next part of the SDLC. The earlier a problem/defect is detected, the cheaper and easier it is to correct. I have seen development teams push functionality into production knowing that it has errors and leaving it up to production support to fix. Who is best equipped to fix the errors, the developers or the people who have read documentation on how the functionality should work?
With hardware, we need to establish a monitoring regime to highlight potential problems as early as possible. Each hardware device will have statistics available from the vendor on Mean Time Between Failures (MTBF). This provides you with "expected" failures over your installed base of these hardware devices. You also need to know the Mean Time To Fix (MTTF). How long does it typically take to fix a type of hardware device. Of course, there are the whole areas of backing up data, disaster recovery and business continuity - but I won't cover them in this post.
Common Sense
- Eliminate problems/defects as early as possible
- Have clear procedures for people to follow (& educate them)
- Know you software and hardware - monitor it
- Be prepared before something goes wrong - it's too late to "wing it" in the midst of a catastrophe!