Search This Blog

Tuesday, September 29, 2009

Something's Wrong...

In IT you are assured that at some point things will not go as expected.  There are software bugs, hardware failures and yes, even "human errors".  Here's a little story about what NOT to do...

Some time ago when I was working for a large organisation, they used removable storage disks for their large computers.  These disks weighed several kilos and had to be de-mounted and stopped before an operator could open the unit and place the disk cover over the disk, turn the cover until engaged, then remove the disk and place it into a locking base.  So the operation is not one that is done quickly.

Now one night, an operator saw that there were errors coming from one of the disks.  An acceptable procedure is to ask the Operating System to "swap" it.  The OS determines a disk that is not being used.  It then stops both drives and informs the operator of the drives to swap.  The operator then uses the above-mentioned technique to swap the disks.

This is fine for "soft" errors, where it may be a slight problem in reading/writing head alignment, etc.  However, operators are also trained to look for signs of a "head crash" before moving a disk to another unit.  A head crash means that the read/write heads that skim close to the surface of the dozen or so platters that make up a disk, have come into physical contact with the surface of the disk.  This is definitely a "hard error".  The disk is unusable and the unit should be checked by a technician.

Back to the story...  So the operator swaps the disks.  He then gets another error on the original disk.  So he swaps it again, and again, and again.  By the time he has finished he has wrecked at least half a dozen disks and potentially the same number of drive units - because he didn't check for a head crash on the first swap.  When he swapped the damaged disk to a new unit, because its surface is damaged, it has the potential to damage the read/write heads of the new unit.  Because he is swapping a new disk into the drive unit that had the head crash, the read/write heads could be damaged and then cause a head crash on the new disk.

So what can we do?
Process and procedures need to be in place to prevent problems and to guide people in resolving problems. With regard to software bugs, there are techniques to buld in "defensive code" that gracefully handles potential problems instead of failing catastrophically.  Providing log information to assist people in resolving issues is essential.  The software development life cycle (SDLC) must have checks and balances at different stages to pick up potential errors before the next part of the SDLC. The earlier a problem/defect is detected, the cheaper and easier it is to correct.  I have seen development teams push functionality into production knowing that it has errors and leaving it up to production support to fix.  Who is best equipped to fix the errors, the developers or the people who have read documentation on how the functionality should work?

With hardware, we need to establish a monitoring regime to highlight potential problems as early as possible.  Each hardware device will have statistics available from the vendor on Mean Time Between Failures (MTBF).  This provides you with "expected" failures over your installed base of these hardware devices.  You also need to know the Mean Time To Fix (MTTF).  How long does it typically take to fix a type of hardware device.  Of course, there are the whole areas of backing up data, disaster recovery and business continuity - but I won't cover them in this post.

Common Sense
  • Eliminate problems/defects as early as possible
  • Have clear procedures for people to follow (& educate them)
  • Know you software and hardware - monitor it
  • Be prepared before something goes wrong - it's too late to "wing it" in the midst of a catastrophe!

Friday, September 18, 2009

Interface Simplification

One of the common problems I see in large (and not so large) organisations is the proliferation of interfaces.  Let's not get hung up on the type of interface in terms of the enabling technology, but rather look at the complexity that builds over years and even decades.

Why does it happen?
Organisations either buy or build applications.  Then there is a decision to move information between the applications.  An interface is born!  Let's call the first 2 applications A & B.  Then another application (called C) wants to exchange information with A, but not quite the same information as B does.  So another interface is created - the A-C interface.

You can see how over years, this network of inter-connectivity between applications can reach quite daunting complexity.  Now when you change an application, as well as making sure the change doesn't break the application's current functionality, you must also ensure all of the interfaces involving this application, also continue to function correctly.  Then, when you move a change into production, you have to ensure that all changed applications are updated in unison.  If just 1 application change fails, you have to back out ALL changes - think of the cost and possible disruption to your business!

How to simplify
Start with an application, probably best to select the one with a lot of interfaces.  Identify the interfaces where the application provides outgoing information/data.  You need to analyse the data with a goal of combining the total data needs of all recipients and attempt to create a single, wider feed of information that includes the data elements required by all recipients.

For example, let's say an application has 3 outgoing interfaces to recipient applications called X, Y & Z.  X requires data elements D1, D2, D3 & D4.  Y requires D1, D2, D5 & D9.  And lastly, Z requires D1, D3, D6, D7 & D8.  By combining the needs of all recipients, we create an outgoing interface with data elements D1 through to D9.  We now have 1 interface instead of 3 - but there's more to do.

We need to ensure that the recipient applications only receive what they are expecting.  There are 2 approaches to this; change the recipient application to extract only the data elements it requires out of the new, wider interface, or create an interface "hub" to do this work.

In Conclusion
The more complexity, the more work you need to do to regain control over your interfaces.  However, the benefits are worth it.  Reduced development costs (with changes), reduced testing costs.  Faster delivery of new interface recipient for existing information feeds, etc.  There are other factors to consider, such as standardising interface formats, using XML, etc - but I'll save that for another day!

Cheers, Pete

Wednesday, September 16, 2009

First post and welcome

Pete here.  I started out in IT when it was called "data processing" back in the late 1970's.  Began my IT career working for a bank on some pretty old iron.

The intention of this blog is to provide my views on many IT-related themes.  I'll try to shed some light on common problems and misconceptions.  I'll also inject some personal anecdotes of IT "incidents" that I've been involved with over the years.

Starting off, my aim is to blog on a weekly basis.

So - welcome to "IT for the common man"!