Once in every while, you encounter a repetitive issue that no matter what you try to do to resolve it, the problem manifests itself over and over again - sometimes, even on a daily basis. Much of how the issue is remediated really depends on the person assigned to the task.
You might be puzzled at why I’d write about something like this, but it’s a situation I see constantly - one I like to refer to as “over thinker syndrome”. What do I mean by this ? Here’s the theory. Some people are very analytical when it comes to problem solving. Couple that with technical knowledge and you could land up with a situation where something relatively simple gets blown out of all proportion because the scenario played out in the mind is often much further from reality than you’d expect. And the technical reasoning is usually always to blame. Sometime around 2007, a colleague noticed that the Exchange Server (2003 wouldn’t you know) would suddenly reboot half way through a backup job. Rightly so, he wanted to investigate and asked me if this would be ok. Anyone with an ounce of experience knows that functional backups are critical in the event of a disaster - none more so than I - obviously, I have the go ahead. One bright spark in my team suggested a reboot of the server, which immediately prompted the response
“…it’s rebooting itself every day, so how will that help ?”
The investigation
Joking aside, we’ve all heard the “have you rebooted” question touted at some point during helpdesk discussions, but this one was different. A system rebooting itself is usually symptomatic of an underlying issue somewhere, and my team member was ready for the task ahead. Stepping up to the plate, he asked if it was ok to install some monitoring software on the server. Usually, installing additional software components in a production server without testing first is a non-starter, but seeing as we needed to get this resolved as quickly as possible to reinstate the nightly backup (which incidentally hasn’t run successfully for 3 days by now), I provided approval to proceed without question. There’s a leap of faith at this point, as you could cause more problems than that you actually set out to resolve in the first place, but, as with anything related to information technology, someone’s you have to accept an element of risk. The software itself was actually for the RAID controller and motherboard The assigned technician had already decided it was related to something along the lines of a faulty RAM module, or perhaps an issue with the controller itself. My thoughts leaned elsewhere already at this point - is the server reboots itself at exactly the same time every day then there is an established pattern which should be investigated first. It’s a logical approach, but it’s a common trait for technical support staff to sometimes think outside of the box - or in this case, outside of the building. Not wanting to push my opinion, or trample on anyone’s toes, I decided to remain quiet and see just how far this would go before intervention was required.
In this case, not very far. The following morning after another unannounced nightly reboot, the error “the previous shutdown at [insert time and date here] was unexpected” showed up in the event log. No real surprises there, and once again, exactly the same time as the previous night. I asked my technician for an update, and he informed me that he believed that the memory was faulty and somehow causing the server to blue screen and reboot. That was actually a reasonable response and so I commended him on his research and findings, but also reminded him to perform a manual backup so that we at least had something to revert to in the event of a failure. Later that afternoon, the same tech approached me and said that he had ordered some replacement memory, and wanted to arrange downtime to fit it. Trying to keep a poker face and remain passive, I agreed and the memory was replaced the same evening around 10pm. At 2am the following morning, kaboom ! - the server rebooted itself again. Not wanting to admit defeat, our courageous tech suggested that the problem could be due to the system overheating. Another fair point, but not realistic as you’d see this in event log as a thermal shutdown. I willingly entertained this, and allowed investigations into the CPU temperature to begin - after another manual backup. Unsurprisingly, the temperature data returned no smoking gun, so that was abandoned. The next port of call was to reapply the service pack. Now, I’ll admit that this used to fix a multitude of issues under Windows NT Server (particularly Service Pack 4) but not under Windows 2003. I declined this for obvious reasons - if you reapply the service pack, you run the risk of overwriting key DLL files that could (and often will) render Exchange inoperable. Not being prepared to introduce an unprecedented risk into what was already becoming something of a showcase, I suggested that we look elsewhere.
The exasperation
The final (and honestly more realistic suggestion) was to enable verbose logging in Exchange. This is actually a good idea, but only if you suspect that the information store could be the issue. Given the evidence, I wasn’t convinced. If there was corruption in the store, or on any of the disks, this would show itself randomly through the day and wouldn’t wait until 2am in the morning. Not wanting to come across as condescending, I agreed, but at the same time, set a deadline to escalation. I wasn’t overly concerned about the backups as these were being completed manually each day whilst the investigations were taking place. Neither was I concerned at what could be seen at this point as wasting someone’s time when you think you may have the answer to what now seemed to be an impossible problem. This is where experience will eclipse any formal qualifications hands down. Those with university degrees may scoff at this, but those with substantially analytical thinking patterns seem to avoid logic like the plague and go off on a wild tangent looking for a dramatically technical explanation and solution to a problem when it’s much simpler than you’d expect. Hence the title of this article - Avoid the “bulldozer to find a china cup” scenario. After witnessing another pained expression on the face of my now exasperated and exhausted tech, I said “let’s get a coffee”. In agreement, he followed me to the kitchen and then asked me what I thought the problem could be. I said that if he wanted my advice, it would be to step back and look at this problem from a logical angle rather than technical. The confused look I received was priceless - the guy must have really though I’d lost the plot. After what seemed like an eternity (although in reality only a few seconds) he asked me what I meant by this. “Come with me”, I said. Finishing his coffee, he diligently followed me to the server room. Once inside, I asked him to show me the Exchange Server. Puzzled, he correctly pointed out the exact machine. I then asked him to trace the power cables and tell me where they went.
As with most server rooms, locating and identifying cables can be a bit of a challenge after equipment has been added and removed, so this took a little longer than we expected. Eventually, the tech traced the cables back to
…an old looking UPS that had a red light illuminated at the front like it had been a prop in a Terminator film.
The realisation
Suddenly, the real cause of this issue dawned on the tech like a morning sunrise over the Serengeti. The UPS that the Exchange Server was unexpectedly connected to had a faulty battery. The UPS was conducting a self test at 2am each morning, and because the bypass test failed owing to the burnt battery, the connected server lost power and started back up after the offending equipment left bypass mode and went online.
Where is this going you might ask ? Here’s the moral of this (particular, and many others like it) story
- Just because a problem involves technology, it doesn’t mean that the answer has to be a complex technical one
- Logic and common sense has a part to play in all of our lives.
- Sometimes, it makes more sense just to step back, take a breath, and see something for what it really is before deciding to commit
- It’s easy to allow technical expertise to cloud your judgement - don’t fall into the trap of using a sledgehammer to break an egg
- You cannot buy experience - it’s earned, gained, and leaves an indelible mark
Let’s hear your views. Did you ever come across a situation where no matter what you tried, nothing worked ? Did the solution turn out to be much simpler than you’d have ever thought ?