Monday, September 3, 2007

Thoughts On Complexity

Everyone knows that computer systems are increasingly complex, and that it's important to manage that complexity through effective documentation, change control procedures and so on. But usually people (well, people like me, anyway) think of this complexity "in the large," as something that applies to "IT in general" and not necessarily to one's own self.

You see where this is going, don't you?

Before joining Configuresoft a month ago, I was an independent consultant, working with my clients (usually software companies who were looking for expert, specialized assistance) on all sorts of interesting and diverse projects. Because of this, I needed the ability to replicate my clients' various IT environments in my own home office, because otherwise (and this was generally a deal-breaker for me when considering taking on any new project) I would not be able to work from home. And I like working from home[1] quite a lot, so my home office network grew, and grew. I have a large collection of virtual machines, most of which are dormant at any given time, and through these I can replicate whatever software I'm required to build and test against, from Oracle 8i to Oracle 10g, from SQL Server 7.0[2] to SQL Server 2008 and so on. But I also have a surprisingly large number of "real" servers that are required to let me do my work, including file servers, domain controllers and mail servers.

This complexity grew a few months back when I upgraded my mail system to Exchange Server 2007. I bought two new 64-bit servers and installed them as a mailbox server and a client access server, basically trying to "do it right." The servers have been up and running great ever since. That is, until...

At this point, if I were telling the story correctly, that would be the end of the prologue, which would take place during "present time" and then chapter one would start off years before, as the plot lines that lead to disaster start being woven together, and then only hundreds of pages later would it become obvious how the looming dread in the prologue related to the apparently idyllic beginnings of it all...

But thankfully I will simply cut to the chase.

After installing the two new Exchange servers and making "a few other minor changes" to the network, I neglected to update my documentation[3] to reflect the new reality of things. I was "too busy." And so I paid the price this past week.

Late Wednesday night (while I was out of state on business travel) one of my domain controllers went down. I still don't know why. But I had my bases covered, and had another domain controller running, so everything should have been fine. Or so I thought. Instead, I lost email. Imagine that - being thousands of miles from home and not having email. (The soundtrack plays the eerie cry of winds at midnight, the scratch of branches on the moonlit windowpane, the howling of wolves in the distance, but closing in...) I got home late Friday night, but didn't get email back online until a few minutes ago.

Why is that? Why did it take the whole darned weekend to get a mission-critical service back online?

Although part of the answer is priorities (after being on the road for a week, family takes priority over any IT anything) and another part of the answer is the interaction between Exchange Server and Active Directory (why does every ^&@()^&% operations master need to be online for the %@*$)%&#( Information Store to start???) the real answer lies in poorly managed complexity.

I didn't manage change well.

I don't have a CMDB.

I don't have proactive monitoring in place.

What was I thinking?

I suppose I was thinking that these things were really only important for "real" enterprises, for those organizations where they had teams of people managing huge data centers.[4] I never really articulated this thought to myself before, but now I know it was definitely there. And it certainly cost me - four days without email, and a good 10 to 12 hours of active troubleshooting time wasted.

Where am I going? What's my point? My point (I think) is that the benefits provided by MOF and ITIL are not just for the big boys; they're for everyone, including me. For years I've followed MSF guidance on my own personal pet projects - I wouldn't even think of starting development work on a one-man project without first defining my requirements and doing some sort of rough design, and yes, even the code I wrote alone was checked into SCC early on. As a developer, all of this was so painfully obvious to me, how could I even consider doing anything else?

Well, thanks to the pain of the last few days, I think I've finally seen the light on MOF the way I saw it years ago on MSF. I honestly don't know what actions I'll take to address the risk inherent in what I've been doing with my home office network, but actions I must take, and soon. Likely I'll start off by getting my documentation up to date, but beyond that it will boil down to priorities and figuring out how to get the best benefit through the least effort.

And what's the moral of the story?

Don't let this happen to you. No matter how small your IT operations are, you can still benefit from managing them using best practices, and can still pay a big price from not doing so.

And now I need to see how much of my holiday weekend still remains...

[1] At first I wasn't sure it would work out, but after five years I think I've got it down pretty well. I'm more productive here than anywhere else when working on "real" development tasks, and can generally keep my travel to a minimum, for those occasions when face-to-face communication is essential.
[2] Thankfully not any more, but it wasn't too long ago...
[3] I actually do keep a decent set of network and server documentation, and am generally pretty good at keeping things up to date.
[4] You know the ones - the data centers where you have to wear coats and you get to walk on those really cool floors with all of the wires underneath - I love those floors! ;-)

No comments: