Recovering from a server disk array failure in 1998

Blog
  • dc1.webp
    Why is it that all outages seem to happen at 5:30pm on a Friday afternoon ? Back in the day during 1998 when DEC (yeah, I’m old - shoot me) was still mainstream and Windows NT Server 4.0 was the latest and greatest, I was working for a commodity trading firm in the West End as an IT Manager. The week had typically gone by with the usual activity - nothing too major to report apart from the odd support issue and the usual plethora of invoices that needed to be approved. Suddenly, one of my team emerged from the comms room and informed me that they had spotted a red light on one of the disks sitting in the Exchange server. I asked which disk it was, and said we’d need to get a replacement.

    For those who haven’t been in this industry for a years (unlike me) DEC (Digital Equipment Corporation) was a major player in previous years, but around 1998 started to struggle - it was then acquired by Compaq (who later on down the line in 2002 were acquired themselves by Hewlett Packard). This server was a beast - a DEC server 5000 the size of an under the counter fridge with a Mylex DAC960 RAID controller. It was so large, it had wheels with brakes. And, like a washing machine, was incredibly heavy. I’m sure the factory that manufactured servers in the 90’s used to pour concrete in them just for a bit of fun…

    Here’s a little glimpse for nostalgia purposes
    decserver5000.webp
    Those who remember DEC and it’s associated Mylex DAC960 RAID controller will also recall that the RAID5 incarnation was less than flawless. In modern RAID deployments, if a disk was marked as faulty or defunct, the controller would effectively blacklist the disk meaning that if it were to be removed then reinserted, any bad blocks would not be copied into the array hence causing corruption - it would be rejected.

    Well, that’s how modern controllers work. Unfortunately, the DAC960 controller was one of those boards that when coupled with old firmware and the NT operating system created the perfect storm. It was relatively well documented at the time that plugging a faulty drive back into an array could cause corruption and spell disaster. My enterprising team member had spotted the red light on the drive, then decided to eject it out of the array. For some unknown reason, instead of taking it back to his desk to order a replacement, he reinserted the it back into the array. Now, for those of you that actually remember the disks that went inside a DEC server 5000, you’ll know that these things were like bricks in plastic containers. They were around 3 inches in height, about 6 inches long, and quite heavy. These drives even had a eject clip on each side meaning that you had to press both sides of the disk carrier and then slide out the drive before it could be fully removed. Inserting a replacement drive required much the same effort (except in reverse), and provided a satisfyingly secure “clunk” as the interface of the drive made contact with the RAID controller bus.

    No sooner had I said the words

    “…please tell me you didn’t plug that disk back in……”

    to my team member, our central helpdesk number lit up like a Christmas tree in Times Square with users complaining they couldn’t get into email. I literally ran into the comms room and found the server with all drive bays lit solidly as if suspended in its own cryogenic state. For sake of schematics, a standard RAID5 configuration looks like the below. Essentially, the “p” component is parity. This is the stripe that contains information about the array and is spread across all disks that are members. In the event that one fails, the data is still held across the remaining drives, meaning still accessible - with a reduction in performance. The data is written across the disks in one write like a stripe (set).
    raid5_ok.webp
    At this point I’d already realised that the array had been corrupted by the returning faulty disk, and the bad stripe information was now resident on all the remaining drives. Those who understand RAID will know that if one drive in a RAID5 set fails, you still have the other remaining drives as a resilient array - but not if they are all corrupted. What I am alluding to here is shown below. The stripe was now unreadable, therefore, none of the disks were accessible
    raid5_broken.png.webp
    The server had completely frozen up and would not respond. I’m no fan of force powering a server off in the best of circumstances, but what choice did we have ?

    The server was powered off, then turned back on again. I really was hoping that this was just a system freeze and a reboot would make all our problems go away. The less naïve and experienced part of me dragged my legs towards the backup storage area (yes, we had a rotation pool of 2 weeks on site and 2 weeks off), and started collecting the previous day’s backup from the safe. As it stands, this was clearly the next logical step. Upon restart, we were met with the below shortly after NTOSKERNEL completed it’s checks
    bsod.webp
    (Not the actual BSOD of course - camera phones didn’t exist in 1998 - but as close as it gets)

    Anyone familiar with the Windows operating system will have bumped into this at some point in their career, and by the more commonplace acronym BSOD (Blue Screen Of Death). Either way, it’s never a good sign when you are trying to recover a system. One of the best messages displayed by a BSOD is

    IRQL_NOT_LESS_OR_EQUAL

    I say “best” with a hint of sarcasm of course as this message is completely useless and doesn’t mean anything to anyone as such. As the internet back in 1998 was fairly infantile, gaining a decent insight wasn’t as simple or clear cut as it is today. Looking at the problem from a sensible angle, it was fairly obvious that the DAC960 controller had either failed completely, or couldn’t read the disks and caused the crash. Deciding not to invest too much time in getting this system back to life, I fired up it’s dormant sister (yes, we had two fridges :)) which started with no issues. This secondary server was originally purchased to split the load of the mailboxes across two servers for resilience purposes, but never came to fruition owing to a backlog of other projects that were further up the chain of importance. Had this exercise have taken place, only 50% of the office would have been impacted - typical.

    With the server started, we then began the process of installing Exchange. Don’t get too excited - this was Exchange 5.5 and didn’t have any formal link to Active Directory, so it was never going to be the case of installing Exchange in disaster recovery mode, then playing back the database. Nope. This was going to be a directory restore first, followed by the Information Store.

    With Exchange installed and the previous service packs and hotfixes applied (early versions of Exchange had a habit of not working at all after a restore unless the patching​ level was the same), BackupExec 6.2 (yes, I know) was set to restore to an alternative Exchange server, and the tapes loaded into the robotic arm cradle. In hindsight, it would have been a better option to install BackupExec on the Exchange server itself, and connect the tape drive to the SCSI connector. However, can you find a cable when you really need one ? In any case, the server was SCSI2 when the loader was SCSI1. This should have set alarm bells ringing at the time, but with the restore started, we went back to our seats - I then began the task of explaining to senior management about the cause of the outage and what we were doing to resolve the problem. As anyone with experience of Microsoft systems knows, attempting to predict the time to restore or copy anything (especially back in the 90’s) wasn’t a simple task, as Windows had a habit of either exaggerating the time, or sitting there not responding for ages.

    Rather like a 90’s Wikipedia, NT wasn’t known for it’s accuracy.

    I called home and solemnly declared I was in for a long night. It’s never easy explaining the reasons why or attempting to justify the reasons you need to work late to family members, but that’s another story. Checking on the progress of the restore, we were averaging speeds of around 2Mbps ! Cast your mind back to 1998 and think of the surrounding technology. Back in the (not so good) old days, modern switching technology and 10Gbps networks were non existent. We were stuck with old 3Com 10Mbps hubs and an even slower Frame Relay connection (256k with 128k ISDN backup) as the gateway. To make matters worse, our internet connection was based on dialup technology using a SHIVA LanRoverE. Forget 1Gb fibre - this thing dished out an awesome [sic] 33.6k speed or even 56k if you were using ISDN. Web Pages loading in about 20 seconds was commonplace - downloading drivers was an absolute nightmare as you can imagine.

    Back to the restore. Having performed the basic math, and given the size of the databases (around 70Gb on a DLT 40 that was compressed to 80Gb), this was going to take over 24 hours. If you think about how hubs used to work, this meant that the 10Mbps speed of the device was actually shared across all 24 ports. This effectively reduces the port speed to 0.42Mbps - and that really depends on what the other ports are doing at the time. The restore rate remained at around 2Mbps for hours, and rather than everyone sit there watching water evaporate, I sent home the remaining staff and told them to be on standby for the entire weekend. I really couldn’t stomach food at this point, and ended up working into the night on other open tasks in an effort to catch up. I ended up falling asleep at my desk around 2am, and then being woken by the sound of my mobile (a Nokia of course) ringing. Looking at the clock, it was 5am. Checking the restore, it had progressed to the information store itself and was around 60% completed. After another 15 hours in the office, the restore finally completed.

    Having restarted all of the Exchange services, even the information store came up, which really was good news. However, browsing through the mailboxes I noticed that only a quarter of the 250+ I was expecting were listed. Not knowing much about the Exchange back end at the time, I contacted a so-called Exchange specialist based in Switzerland (in case you’re wondering, we were a Swiss headquartered entity, and all external support came from there). This Exchange specialist informed me that the backup hadn’t completed properly, and a set of commands needed to be run in BackupExec to resolve this. Of course, this also meant that the restore process had to be restarted - there goes another 24+ hours, I thought to myself. With the new “settings applied” and the restore process restarted, I decided that I wasn’t going to sit in the office for another day waiting for the restore to complete, and so I decided to call one of my team to come in and occupy the watchtower.

    Getting hold of someone was much more difficult than I had imagined. After letting the remainder of the team go, they all forged an exodus to the nearest door like iron filings to a magnet. So much for team ethic I thought. Eventually, I managed to get hold of a colleague who, after much griping, agreed to come into the office. I wouldn’t have minded as much if he didn’t live less than 15 minutes away, but that’s another story. My colleague arrived around 30 minutes later, and then I left the office. Getting home wasn’t a simple task. In the UK, there are often engineering works taking place over the weekend - particularly on the tube, and in most cases, local rail providers also - mine included. What should have taken about 2 hours maximum took 4, and by the time I got home, I flopped into bed exhausted. Needless to say this didn’t go down particularly well with my wife who saw me last on the previous morning - especially as after 3 hours of restlessness and a general inability to sleep, I was called by senior management - and was asked to go back in.

    By now, my already frustrated wife’s temperature went from 36.9c to an erupting volcano equivalent in less than a split second. I fully appreciated her response, but I was young (well, younger), eager to impress, and also had a sense of ownership. After a somewhat heated exchange, I left for the office. I arrived in much the same time as it took me to get home in the first place, and found that the restore was of course still running. My colleague made some half baked excuse that he needed to leave the office as he had a “family emergency”. Not really in the mood to argue this, I let him leave. I then got on a conference call with the consultant we had been using. Unsurprisingly, the topic of the restore time came up.

    “…You have a very slow network…” said the consultant.

    “…No s**t Sherlock…”  I thought. “…Do you honestly think I’m sitting here for my health ? …”

    I politely “agreed”.

    Eventually, the restore process completed. With a sudden feeling of euphoria, I went back into the comms room to start the services and… to my dismay, found once again that only a third of the recipients appeared in the directory. The term “FFS” didn’t go anywhere near being an accurate portrayal of my response. I was brutally upset. Hopelessly crushed. On the verge of losing it… (ok, perhaps that’s overkill). There had to be a reason for this. Something we’d missed, or just didn’t understand. I went looking for answers on a 1998 version of Yahoo (actually, I think it may have been Lycos), and found an article relating to the DS/IS Consistency Adjuster in Exchange 5.5 - this isn’t the exact resource I found, but it goes a long way to describe the fundamental process. The upshot is that the consistency adjuster needed to be run to synchronise the once orphaned mailboxes with the directory service. This entire process took​ a couple of hours - whilst that seems inconceivable to even the extreme Luddite, this is 1998 with SCSI1 drives, a Pentium II Processor, and 512Mb ram.

    After the process completed (which incidentally looked like this)
    dsisadjuster.webp
    I could then see all mailboxes ! After performing several somersaults around the office (just kidding here, but I can tell you I felt like doing it), I confirmed with a 25% random user test that I had access to mailboxes. Unfortunately, I couldn’t see any new mail arriving, but that was only due to a stalled mail connector on the server in Switzerland that received external mail. After a quick reboot of this gateway, mail began to flow. After around an hour of testing, I was happy that everything was working as expected. As for the consultant who had just wasted hours of my life, it’s just as well he wasn’t in the same country as me, let alone room. I went home elated - to an extremely angry wife. She’s since forgiven me of course, and now looking back, I really appreciate why - she was looking out for me, and concerned - I just didn’t appreciate that at the time.

    Come Monday morning, users were back into email with everything working as expected. An emergency Exchange backup had been run, and I was in the process of writing up my postmortem report for senior management. I then got a phone call. Anyone remember a product by Fenstrae called Faxination ? This was peered with Exchange 5.5, and had stopped working since the crash. The head of operations demanded that this was resolved as a priority… Another late night… another argument at home, but that’s a story for another day.


  • Link vs Refresh

    Solved Customisation
    20
    8 Votes
    20 Posts
    495 Views

    @pobojmoks Do you see any errors being reported in the console ? At first guess (without seeing the actual code or the site itself), I’d say that this is AJAX callback related

  • 2 Votes
    6 Posts
    250 Views

    @kurulumu-net CSS styling is now addressed and completed.

  • 0 Votes
    1 Posts
    157 Views

    Once in every while, you encounter a repetitive issue that no matter what you try to do to resolve it, the problem manifests itself over and over again - sometimes, even on a daily basis. Much of how the issue is remediated really depends on the person assigned to the task.

    You might be puzzled at why I’d write about something like this, but it’s a situation I see constantly - one I like to refer to as “over thinker syndrome”. What do I mean by this ? Here’s the theory. Some people are very analytical when it comes to problem solving. Couple that with technical knowledge and you could land up with a situation where something relatively simple gets blown out of all proportion because the scenario played out in the mind is often much further from reality than you’d expect. And the technical reasoning is usually always to blame. Sometime around 2007, a colleague noticed that the Exchange Server (2003 wouldn’t you know) would suddenly reboot half way through a backup job. Rightly so, he wanted to investigate and asked me if this would be ok. Anyone with an ounce of experience knows that functional backups are critical in the event of a disaster - none more so than I - obviously, I have the go ahead. One bright spark in my team suggested a reboot of the server, which immediately prompted the response

    “…it’s rebooting itself every day, so how will that help ?”

    The investigation

    Joking aside, we’ve all heard the “have you rebooted” question touted at some point during helpdesk discussions, but this one was different. A system rebooting itself is usually symptomatic of an underlying issue somewhere, and my team member was ready for the task ahead. Stepping up to the plate, he asked if it was ok to install some monitoring software on the server. Usually, installing additional software components in a production server without testing first is a non-starter, but seeing as we needed to get this resolved as quickly as possible to reinstate the nightly backup (which incidentally hasn’t run successfully for 3 days by now), I provided approval to proceed without question. There’s a leap of faith at this point, as you could cause more problems than that you actually set out to resolve in the first place, but, as with anything related to information technology, someone’s you have to accept an element of risk. The software itself was actually for the RAID controller and motherboard  The assigned technician had already decided it was related to something along the lines of a faulty RAM module, or perhaps an issue with the controller itself. My thoughts leaned elsewhere already at this point - is the server reboots itself at exactly the same time every day then there is an established pattern which should be investigated first. It’s a logical approach, but it’s a common trait for technical support staff to sometimes think outside of the box - or in this case, outside of the building. Not wanting to push my opinion, or trample on anyone’s toes, I decided to remain quiet and see just how far this would go before intervention was required.

    In this case, not very far. The following morning after another unannounced nightly reboot, the error “the previous shutdown at [insert time and date here] was unexpected” showed up in the event log. No real surprises there, and once again, exactly the same time as the previous night. I asked my technician for an update, and he informed me that he believed that the memory was faulty and somehow causing the server to blue screen and reboot. That was actually a reasonable response and so I commended him on his research and findings, but also reminded him to perform a manual backup so that we at least had something to revert to in the event of a failure. Later that afternoon, the same tech approached me and said that he had ordered some replacement memory, and wanted to arrange downtime to fit it. Trying to keep a poker face and remain passive, I agreed and the memory was replaced the same evening around 10pm. At 2am the following morning, kaboom ! - the server rebooted itself again. Not wanting to admit defeat, our courageous tech suggested that the problem could be due to the system overheating. Another fair point, but not realistic as you’d see this in event log as a thermal shutdown. I willingly entertained this, and allowed investigations into the CPU temperature to begin - after another manual backup. Unsurprisingly, the temperature data returned no smoking gun, so that was abandoned. The next port of call was to reapply the service pack. Now, I’ll admit that this used to fix a multitude of issues under Windows NT Server (particularly Service Pack 4) but not under Windows 2003. I declined this for obvious reasons - if you reapply the service pack, you run the risk of overwriting key DLL files that could (and often will) render Exchange inoperable. Not being prepared to introduce an unprecedented risk into what was already becoming something of a showcase, I suggested that we look elsewhere.

    The exasperation

    The final (and honestly more realistic suggestion) was to enable verbose logging in Exchange. This is actually a good idea, but only if you suspect that the information store could be the issue. Given the evidence, I wasn’t convinced. If there was corruption in the store, or on any of the disks, this would show itself randomly through the day and wouldn’t wait until 2am in the morning. Not wanting to come across as condescending, I agreed, but at the same time, set a deadline to escalation. I wasn’t overly concerned about the backups as these were being completed manually each day whilst the investigations were taking place. Neither was I concerned at what could be seen at this point as wasting someone’s time when you think you may have the answer to what now seemed to be an impossible problem. This is where experience will eclipse any formal qualifications hands down. Those with university degrees may scoff at this, but those with substantially analytical thinking patterns seem to avoid logic like the plague and go off on a wild tangent looking for a dramatically technical explanation and solution to a problem when it’s much simpler than you’d expect. Hence the title of this article - Avoid the “bulldozer to find a china cup” scenario. After witnessing another pained expression on the face of my now exasperated and exhausted tech, I said “let’s get a coffee”. In agreement, he followed me to the kitchen and then asked me what I thought the problem could be. I said that if he wanted my advice, it would be to step back and look at this problem from a logical angle rather than technical. The confused look I received was priceless - the guy must have really though I’d lost the plot. After what seemed like an eternity (although in reality only a few seconds) he asked me what I meant by this. “Come with me”, I said. Finishing his coffee, he diligently followed me to the server room. Once inside, I asked him to show me the Exchange Server. Puzzled, he correctly pointed out the exact machine. I then asked him to trace the power cables and tell me where they went.

    As with most server rooms, locating and identifying cables can be a bit of a challenge after equipment has been added and removed, so this took a little longer than we expected. Eventually, the tech traced the cables back to

    …an old looking UPS that had a red light illuminated at the front like it had been a prop in a Terminator film.

    The realisation

    Suddenly, the real cause of this issue dawned on the tech like a morning sunrise over the Serengeti. The UPS that the Exchange Server was unexpectedly connected to had a faulty battery. The UPS was conducting a self test at 2am each morning, and because the bypass test failed owing to the burnt battery, the connected server lost power and started back up after the offending equipment left bypass mode and went online.

    Where is this going you might ask ?  Here’s the moral of this (particular, and many others like it) story

    Just because a problem involves technology, it doesn’t mean that the answer has to be a complex technical one Logic and common sense has a part to play in all of our lives. Sometimes, it makes more sense just to step back, take a breath, and see something for what it really is before deciding to commit It’s easy to allow technical expertise to cloud your judgement - don’t fall into the trap of using a sledgehammer to break an egg You cannot buy experience - it’s earned, gained, and leaves an indelible mark

    Let’s hear your views. Did you ever come across a situation where no matter what you tried, nothing worked ? Did the solution turn out to be much simpler than you’d have ever thought ?

  • 0 Votes
    1 Posts
    197 Views

    expert.webp
    One thing I’ve seen a lot of over my career is the “expert” myth being touted on LinkedIn and Twitter. Originating from psychologist K. Anders Ericsson who studied the way people become experts in their fields, and then discussed by Malcolm Gladwell in the book, “Outliers“, “to become an expert it takes 10,000 hours (or approximately 10 years) of deliberate practice”. This paradigm (if you can indeed call it that) has been adopted by several so called “experts” - mostly those within the Information Security and GDPR fields. This article isn’t about GDPR (for once), but mostly those who consider themselves “experts” by virtue of the acronym. Prior to it’s implementation, nobody should have proclaimed themselves a GDPR “expert”. You cannot be an expert in something that wasn’t actually legally binding until May 25 2018, nor can you have sufficient time invested to be an expert since inception in my view. GDPR is a vast universe, and you can’t claim to know all of it.

    Consultant ? Possibly, yes. Expert ? No.

    The associated sales campaign isn’t much better, and can be aligned to the children’s book “Chicken Licken”. For those unfamiliar with this concept, here is a walkthrough. I’m sure you’ll understand why I choose a children’s story in this case, as it seems to fit the bill very well. What I’ve seen over the last 12 months had been nothing short of amazing - but not in the sense of outstanding. I could align GDPR here to the PPI claims furore - for anyone unfamiliar with what this “uprising” is, here’s a synopsis.

    The “expert” fallacy

    Payment Protection Insurance (PPI) is the insurance sold alongside credit cards, loans and other finance agreements to ensure payments are made if the borrower is unable to make them due to sickness or unemployment. The PPI scandal has its roots set back as far as 1998, although compensatory payments did not officially start until 2011 once the review and court appeal process was completed. Since the deadline for PPI claims has been announced as August 2019, the campaign has become intensively aggressive, with, it would seem, thousands of PPI “experts”. Again, I would question the authenticity of such a title. It seems that everyone is doing it, therefore, it must be easy to attain (a bit like the CISSP then). I witnessed the same shark pool of so called “experts” before, back in the day when Y2K was the latest buzzword on everyone’s lips. Years of aggressive selling campaigns and similarly, years of FUD (Fear, Uncertainty, Doubt - more effectively known as complete bulls…) caused an unprecedented spike that allowed companies and consultants (several of whom had never been heard of before) to suddenly appear out of the woodwork and assume the identity of “experts” in this field. In reality, it’s not possible to be a subject matter expert in a particular field or niche market unless you have extensive experience. If you compare a weapons expert to a GDPR “expert”, you’ll see just how weak this paradigm actually is. A weapons expert will have years of knowledge in a field, and could probably tell you which gun discharged a bullet just by looking at the expended shell casing. I very much doubt a self styled GDPR expert can tell you what will happen in the event of an unknown scenario around the framework and the specific legal rights (in terms of the individual who the data belongs to) and implications for the institution affected. How can they when nobody has even been exposed to such a scenario before ? This makes a GDPR expert in my view about as plausible as a Brexit expert specialising in Article 50.

    What defines an expert ?

    The focal point here is in the comparison. A weapons expert can be given a gun and a sample of shell casings, then asked to determine if the suspected weapon actually fired the supplied ammunition or not. Using a process of proven identification techniques, the expert can then determine if the gun provided is indeed the origin. This information is derived by using established identity techniques from the indentations and markings in the shell casing created by the gun barrel from which the bullet was expelled, velocity, angle, and speed measurements obtained from firing the weapon. The impact of the bullet and exit damage is also analysed to determine a match based on material and critical evidence. Given the knowledge and experience level required to produce such results, how long do you think it took to reach this unrivalled plateau ? An expert isn’t solely based on knowledge. It’s not solely based on experience either. In fact, it’s a deep mixture of both. Deep in the sense of the subject matter comprehension, and how to execute that same understanding along with real life experience to obtain the optimum result. Here’s an example   An information technology expert should be able to

    Identify and eliminate potential bottlenecks Address security concerns, Design high availability Factor in extensible scalability Consider risk to adjacent and disparate technology and conduct analysis Ensure that any design proposal meets both the current criteria and beyond Understand the business need for technology and be able to support it

    If I leveraged external consultancy for a project, I’d expect all of the above and probably more from anyone who labels themselves as an expert - or for that fact, an architect. Sadly, I’ve been disappointed on numerous occasions throughout my career where it became evident very quickly that the so called expert (who I hasten to add is earning more an hour than I do in a day in most cases) hired for his “expertise and superior knowledge” in fact appears to know far less than I do about the same topic.

    How long does it really take to become an expert ?

    I’ve been in the information technology and security field since I was 16. I’m now 47, meaning 31 years experience (well, 31 as this year isn’t over yet). If you consider that experience is acquired during an 8 hour day, and used the following equation to determine the amount of years needed to reach 10,000 hours

    10000 / 8 / 365 = 3.4246575342 - for the sake of simple mathematics, let’s say 3.5 years.

    However, in the initial calculation, it’s 10 years (using the basis of 90 minutes invested per day) - making the expert title when aligned to GDPR even more unrealistic. As the directive was adopted on the 27 April 2016, the elapsed time period isn’t even enough to carry the first figure cited at 3.5 years, irrespective of the second. The reality here is that no amount of time invested in anything is going to make your an expert if you do not possess the prerequisite skills and a thorough understanding based on previous events in order to supplement and bolster the initial investment. I could spend 10,000 practicing a particular sport - yet effectively suck at it because my body (If you’ve met me, you’d know why) isn’t designed for the activity I’m requesting it to perform. Just because I’ve spent 10,000 hours reading about something doesn’t make me an expert by any stretch of the imagination. If I calculated the hours spanned over my career, I would arrive at the below. I’m basing this on an 8 hour day when in reality, most of my days are in fact much longer.

    31 x 365 x 8 = 90,520 hours

    Even when factoring in vacation based on 4 weeks per year (subject to variation, but I’ve gone for the mean average),

    31 x 28 X 8 = 6,944 hours to subtract

    This is only fair as you are not (supposed to be) working when on holiday. Even with this subtraction, the total is still 83,578 hours. Does my investment make me an expert ? I think so, yes - based on the fact that 31 years dedicated to one area would indicate a high level of experience and professional standard - both of which I constantly strive to maintain. Still think 10,000 hours invested makes you an expert ? You decide ! What are your views around this ?

  • 0 Votes
    1 Posts
    169 Views

    One of the most important safety nets in IT Operations is contingency. Every migration needs a rollback plan in the event that things don’t quite go the way you’d expect, and with a limited timeline to implement a change, or in some cases, a complete migration, the rollback process is one that is an essential component. Without a plan to revert all changes back to their previous state, your migration is destined for failure from the outset. No matter how confident you are (I’ve yet to meet a project manager who doesn’t build in redundancy or rollback in one form or another) there is always going to be something you’ve missed, or a change that produces undesirable results.

    It is this seemingly innocent change that can have a domino effect on your migration - unless you have access to a replica environment, the result of the change cannot be realistically predicted. Admittedly, it’s a simple enough process to clone virtual machines to test against, but that’s of no consequence if your change relates to those conducted at hardware level. A classic example of this is a firewall migration. Whilst it would be possible to test policies to ensure their functionality meets the requirement of the business, confirming VPN links for example isn’t so straightforward - especially when you need to rely on external vendors to complete their piece of the puzzle before you can continue. Unless you’re deploying technology into a greenfield site, you do not have the luxury of testing a VPN into a production network during business hours. Based on this, you have a couple of choices

    You perform all testing off hours by switching equipment for the replacement, and perform end to end testing. Once you are satisfied everything works as it should, you put everything back the way you found it, then schedule a date for the migration. You configure the firewall using a separate subnet, VLAN, and other associated networking elements meaning the two environments run symmetrically

    But which path is the right one ? Good question. There’s no hard and fast rule to which option you go for - although option 2 is more suited to a phased migration approach whilst option 1 is more aligned to “big bang” - in other words, moving everything at the same time. Option 2 is good for testing, but may not reflect reality as you are not targeting the same configuration. As a side note, I’ve often seen situations where residual configuration from option 1 has been left behind, meaning you either land up with a conflict of sorts, or black hole routing.

    Making use of a rollback

    This is where the rollback plan bridges the gap. If you find yourself in a situation where you either run out of time, or cannot continue owing to physical, logical or external constraints, then you would need to invoke your rollback plan. It’s important to note at this stage that part of the project plan should include a point where the progress is reviewed and assessed, and if necessary, the rollback is executed. My personal preference is within around 40% of the allocated time window - all relevant personnel should reconvene and provide status updates around their areas of responsibility, and give a synopsis of any issues - and be fully prepared to elaborate on these if the need arises. If the responsible manager feels that the project is at risk of overrunning it’s started time frame, or cannot be completed within that window, he or she needs to exercise authority to invoke the rollback plan. When setting the review interval, you should also consider the amount of time required to revert all changes and perform regression testing.

    Rollback provides the ideal opportunity to put everything back how it was before you started on your journey - but it does depend on two major factors. Firstly, you need to allocate a suitable time period for the rollback to be completed within. Secondly, unless you have a list of changes that were made to hardware - inclusive of configuration, patching, and a myriad of others, how can you be sure that you’ve covered everything ?

    Time after time I see the same problem - something gets missed, and turns out to be fundamental on Monday morning when the changes haven’t been cross checked.

    So what should a contingency plan consist of ?

    One surefire way to ensure that configurations are preserved prior to making changes is to create backups of running configs - 2 minutes now can save you 2 days of troubleshooting when you can’t remember which change caused your issue.  For virtual machines, this is typically a snapshot that can be restored later should the need arise. A word to the wise though - don’t leave the machine running on snapshot for too​ long as this can rapidly deplete storage space. It’s not a simple process to recover a crashed VM that has run out of disk space.

    Keep version and change control records up to date - particularly during the migration. Any change that could negatively impact the remainder of the project should be examined and evaluated, and if necessary, removed from the scope of works (provided this is a feasible step - sometimes negating a process is enough to make a project fail)

    Document each step. I can’t stress the importance of this enough. I understand that we all want to get things done in a timely manner, but will you realistically remember all the changes you made in the order they were implemented ?
    Use differential tools to examine and easily highlight changes between two configurations. There are a number of free tools on the internet that do this. If you’re using a Windows environment, a personal favourite of mine is WinMerge. Using a diff tool can separate the wood from the trees quickly, and provides a simple overview of changes - very useful in the small hours, I can assure you.
    Working on a switch or firewall ? Learn how to use the CLI. This is often superior in terms of power and usually contains commands that are not available from the GUI. Using this approach, it’s perfectly feasible to bulk load configuration, and also back it out using the same mechanism.

    What if your rollback plan doesn’t work ? Unfortunately, there is absolutely no way to simulate a rollback during project planning, and this is often made worse by many changes being made at once to multiple systems. It’s not that the rollback doesn’t work - it’s usually always a case of settings being reverted before they should be. In most cases, this has the knock on effect of denying yourself access to a system - and it’s always in a place where there are no local support personnel to assist - at least, not immediately. For every migration I have completed over my career, I’ve always ensured that there is an alternative route to reach a remote device should the primary path become inaccessible. For firewalls, this can be a blessing - particularly as they usually permit access on the public interfaces.

    However, delete a route inadvertently and you are toast - you lose access to the firewall full stop - get out of that one. What would I do in a situation like this where the firewall is located in Asia for example, and you are in London ? Again - contingency. You can’t remove a route on a firewall if it was created automatically by the system. In this case, a VLAN or directly connected interface will create it’s own dynamic route, and should still be available. If dealing with a remote firewall, my suggestion here would be Out Of Band Management (OOBM), but not a device connected directly to the firewall itself, as this presents a security risk if not configured properly. A personal preference is a locally connected laptop in the remote location that uses either independent WiFi or a 3G / Mifi presence. Before the migration starts, establish a WebEx or GoToMeeting session (don’t forget to disable UAC here as that can shoot you in the foot), and arrange for a network cable to be plugged into switch fabric, or directly. Direct is better if you can spare the interface, as it removes potential routing issues. Just configure the NIC on the remote machine with an address in the same subnet add the interface you’re connected to, and you’re golden.

    I’ve used the above as a get out of jail free card on several occasions, and I can assure you it works.

    So what are the takeaways here ?

    The most important aspect is to be ready with a response - effectively a “plan b” when things go wrong. Simple planning in advance can save you having to book a flight, or foot the expense of a local IT support firm with no prior knowledge of your network - there’s the security aspect as well; you’d need to provide the password for the device which immediately invokes a change once the remediation is complete. In summary

    Thoroughly plan each migration and allow time for contingency steps. You may not need them, and if you don’t, then you effectively gain time that could be used elsewhere. Have an alternative way of reaching a remote device, and ensure necessary third party vendors are going to be available during your maintenance window should this be necessary. Take regular config backups of all devices. You don’t necessarily need an expensive tool for this - I actually designed a method to make this work using Linux, a TFTP server, and a custom bash script - let me know if you’d like a copy 🙂 Regularly analyse (automated diff) configuration changes between configurations. Any changes that are undocumented or previously approved are a cause for alarm and should be investigated Ensure that you have adequate documentation, and steps necessary to recover systems in the event of failure

    Any thoughts or questions ? Let me know !

  • 0 Votes
    1 Posts
    105 Views

    1631810017053-netsecurity.jpg.webp
    I read an article By Glenn S. Gerstell (Mr. Gerstell is the general counsel of the National Security Agency) with a great deal of interest. That same article is detailed below

    The National Security Operations Center occupies a large windowless room, bathed in blue light, on the third floor of the National Security Agency’s headquarters outside of Washington. For the past 46 years, around the clock without a single interruption, a team of senior military and intelligence officials has staffed this national security nerve center.

    The center’s senior operations officer is surrounded by glowing high-definition monitors showing information about things like Pentagon computer networks, military and civilian air traffic in the Middle East and video feeds from drones in Afghanistan. The officer is authorized to notify the president any time of the day or night of a critical threat.

    Just down a staircase outside the operations center is the Defense Special Missile and Aeronautics Center, which keeps track of missile and satellite launches by China, North Korea, Russia, Iran and other countries. If North Korea was ever to launch an intercontinental ballistic missile toward Los Angeles, those keeping watch might have half an hour or more between the time of detection to the time the missile would land at the target. At least in theory, that is enough time to alert the operations center two floors above and alert the military to shoot down the missile.

    But these early-warning centers have no ability to issue a warning to the president that would stop a cyberattack that takes down a regional or national power grid or to intercept a hypersonic cruise missile launched from Russia or China. The cyberattack can be detected only upon occurrence, and the hypersonic missile, only seconds or at best minutes before attack. And even if we could detect a missile flying at low altitudes at 20 times the speed of sound, we have no way of stopping it.

    Something I’ve been saying all along is that technology alone cannot stop cyber attacks. Often referred to as a “silver bullet”, or “blinky lights”, this provides the misconception that by purchasing that new, shiny device, you’re completely secure. Sorry folks, but this just isn’t true. In fact, cyber crime, and it’s associated plethora of hourly attacks is evolving at an alarming rate - in fact, much faster than you’d like to believe.

    You’d think that for all the huge technological advances we have made in this world, the almost daily plethora of corporate security breaches, high profile data loss, and individuals being scammed every day would have dropped down to nothing more than a trickle - even to the point where they became virtually non-existent. We are making huge progress with landings on Mars, autonomous space vehicles, artificial intelligence, big data, machine learning, and essentially reaching new heights on a daily basis thanks to some of the most creative minds in this technological sphere. But somehow, we have lost our way, stumbled and fallen - mostly on our own sword. But why ?

    Just like the Y2k Gold Rush in the late 90’s, information security has become the next big thing with companies ranging from a few employees as startups to enterprise organisations touting their services and platforms to be the best in class, and the next “must have” tool in the blue team’s already bulging arsenal of tools. Tools that on their own in fact have little effect unless they are combined with something else as equally as expensive to run. We’ve spent so much time focusing on efforts ranging from what SEIM solution we need to what will be labelled as the ultimate silver bullet capable of eliminating the threat of attack once and for all that in my opinion, we have lost sight of the original goal. With regulatory requirements and best practice pushing us towards products and services that either require additional staff to manage, or are incredibly expensive to deploy and ultimately run. Supposedly, in an effort to simplify the management, analysis, and processing of millions of logs per hour we’ve created even more platforms to ingest this data in order to make sense of it.

    In reality, all we have created is a shark infested pool where larger companies consume up and coming tech startups for breakfast to ensure that they do not pose a threat to their business model / gravy train, therefore enabling them to dominate the space even further with their newly enhanced reach.

    How did we get to this ? What happened to thought process and working together in order to combat the threat that increases on an hourly basis ? We seem to be so focused on making sure that we aren’t the next organisation to be breached that we have lost the art of communication and the full benefit of sharing information so that it assists others in their journey. We’ve become so obsessed with the daily onslaught of platforms that we no longer seem to have the time to even think, let alone take stock and regroup - not as an individual, but as a community.

    There are a number of ”communities” that offer “free” forums and products under the open source banner, but sadly, these seem to be turning into paid-for products at a rate of knots. I understand people need to live and make money, but if awareness was raised to the point where users wouldn’t click links in phishing emails, fall for the fake emergency wire transfer request from the CEO, or be suddenly tempted by the latest offer in terms of cheap technology then we might - just might - be able to make the world a better place. In order to make this work, we first need to remove the stigma that has become so ingrained by the media and set in stone like King Arthur’s Excalibur. Let’s first start with the hacker / criminal parallel. They aren’t the same thing folks.

    Nope. Not at all. Hackers are those people who find ingenious ways of getting into networks and infrastructure that you never even knew existed, trick you into parting with sensitive information (then inform you as to where you went wrong), and most importantly, educate you so that you and your network are far more secure against real attacks and real criminals. These people exist to increase your awareness, and by definition, security footprint - not use it against you in order to steal. Hackers do like to wear hoodies as they are comfortable, but you won’t find one using gloves, wearing a balaclava or sunglasses, and in some cases, they actually prefer desktops rather than laptops.

    The image being portrayed here is one perpetuated by the media, and it has certainly been effective - but not in a positive way. The word “hacker” is now synonymous with criminals, where it really shouldn’t be. One defines security, whereas the other sets out to break it. If we locked up all the hackers on this planet, we’d only have the blue team remaining. It’s the job of the red team (hackers) to see how strong your defences are. Hackers exist to educate, not infiltrate (at least, not without asking for permission first :))

    I personally have lost count of how many times I’ve sat in meetings where a sales pitch around a security platform is touted as a one stop shop or a Swiss army knife that can protect your entire network from a breach. Admittedly, there’s some great technology on the market that performs a variety of functions to protect your estate, but they all fail to take into consideration the weakest link in any chain - users. Irrespective of bleeding edge “combat platforms” (as I like to refer to them), criminals are becoming very adept in their approach, leveraging techniques such as social engineering. It should come as no surprise for you to learn that this type of attack can literally walk past your shiny new defence system as it relies on the one vulnerability you cannot predict - the human. Hence the term “hacking humans”.

    I’m of the firm opinion that if you want to outsmart a criminal, you have to think like one. Whilst newfangled platforms are created to assist in the fight against cyber crime, they are complex to configure, suffer from alerting bloat (far too many emails so you end up missing the one where your network is actually being compromised), or are simply overwhelming and difficult to understand. Here’s the thing. You don’t need (although they do help) expensive bleeding edge platforms with flashing lights to tell you where weak points lie within your network, but you do need to understand how a criminal can and will exploit these. A vulnerability cannot be leveraged if it no longer exists, or even better, never even existed to begin with.

    And so, on with the mission, and the real reason as to why I created this site. I’ve been working in information technology for 30 years, and have a very strong technical background in network design and information security.

    What I want to do is create a communication, information, and awareness sharing platform. I created the original concept of what I thought this new community should look like in my head, but its taken a while to finally develop, get people interested, and on board. To my mind, those from inside and outside of the information security arena will pool together, share knowledge, raise awareness, and probably the most important, harness this new found force and drive change forward.

    The breaches we are witnessing on a daily basis are not going to simply stop. They will increase dramatically in their frequency, and will get worse with each incident.

    Let’s stop the “hackers are criminals” myth, start using our own unique talents in this field, and make a community that

    is able to bring effective change treats everyone as equals The community once fully established could easily be the catalyst for change - both in perception, and inception.

    Why not wield the stick for a change instead of being beaten with it, and work as a global virtual team instead ?

    Will you join me ? In case I haven’t already mentioned it, this initiative has no cost - only gains. It is entirely free.

  • 0 Votes
    1 Posts
    158 Views

    1631812610135-security1.webp
    The recent high profile breaches impacting organisations large and small are a testament to the fact that no matter how you secure credentials, they will always be subject to exploit. Can a password alone ever be enough ? in my view, it’s never enough. The enforced minimum should be at least with a secondary factor. Regardless of how “secure” you consider your password to be, it really isn’t in most cases – it just “complies” with the requirement being enforced.

    Here’s classic example. We take the common password of “Welcome123” and put it into a password strength checker
    1564764162-304322-password1.png
    According to the above, it’s “strong”. Actually, it isn’t. It’s only considered this way because it meets the complexity requirements, with 1 uppercase letter, at least 8 characters, and numbers. What’s also interesting is that a tool sponsored by Dashlane considers the same password as acceptable, taking supposedly 8 months to break
    1564764192-579936-password2.png
    How accurate is this ? Not accurate at all. The password of “Welcome123” is in fact one of the passwords contained in any penetration tester’s toolkit – and, by definition, also used by cyber criminals. As most of this password combination is in fact made up of a dictionary word, plus sequential numbers, it would take less than a second to break this rather than the 8 months reported above. Need further evidence of this ? Have a look at haveibeenpwned, which will provide you with a mechanism to test just how many times “Welcome123” has appeared in data breaches
    1564764241-350631-hibp.png

    Why are credentials so weak ?

    My immediate response to this is that for as long as humans have habits, and create scenarios that enable them to easily remember their credentials, then this weakness will always exist. If you look at a sample taken from the LinkedIn breach, those passwords that occupy the top slots are arguably the least secure, but the easiest to remember from the human perspective. Passwords such as “password” and “123456” may be easy for users to remember, but on the flip side, weak credentials like this can be broken by a simple dictionary attack in less than a second.

    Here’s a selection of passwords still in use today – hopefully, yours isn’t on there
    1564764251-257407-passwordlist.jpeg
    We as humans are relatively simplistic when it comes to credentials and associated security methods. Most users who do not work in the security industry have little understanding, desire to understand, or patience, and will naturally choose the route that makes their life easier. After all, technology is supposed to increase productivity, and make tasks easier to perform, right ? Right. And it’s this exact vulnerability that a cyber criminal will exploit to it’s full potential.

    Striking a balance between the security of credentials and ease of recall has always had it’s challenges. A classic example is that networks, websites and applications nowadays typically have password policies in place that only permit the use of a so-called strong password. Given the consolidation and overall assortment of letters, numbers, non-alphanumeric characters, uppercase and lowercase, the password itself is probably “secure” to an acceptable extent, although the method of storing the credentials isn’t. A shining example of this is the culture of writing down sensitive information such as credentials. I’ve worked in some organisations where users have actually attached their password to their monitor. Anyone looking for easy access into a computer network is onto an immediate winner here, and unauthorised access or a full blown breach could occur within an alarmingly short period of time.

    Leaked credentials and attacks from within

    You could argue that you would need access to the computer itself first, but in several historical breach scenarios, the attack originated from within. In this case, it may not be an active employee, but someone who has access to the area where that particular machine is located. Any potential criminal has the credentials – well, the password itself, but what about the username ? This is where a variety of techniques can be used in terms of username discovery – in fact, most of them being non-technical – and worryingly simple to execute. Think about what is usually on a desk in an office. The most obvious place to look for the username would be on the PC itself. If the user had recently logged out, or locked their workstation, then on a windows network, that would give you the username unless a group policy was in place. Failing that, most modern desk phones display the name of the user. On Cisco devices, under Extension Mobility, is the ID of the user. It doesn’t take long to find this. Finally there’s the humble business card. A potential criminal can look at the email address format, remove the domain suffix, and potentially predict the username. Most companies tend to leverage the username in email addresses mainly thanks to SMTP template address policies – certainly true in on premise Exchange environments or Office 365 tenants.

    The credentials are now paired. The password has been retrieved in clear text, and by using a simple discovery technique, the username has also been acquired. Sometimes, a criminal can get extremely lucky and be able to acquire credentials with minimal effort. Users have a habit of writing down things they cannot recall easily, and in some cases, the required information is relatively easily divulged without too much effort on the part of the criminal. Sounds absurd and far fetched, doesn’t it ? Get into your office early, or work late one evening, and take a walk around the desks. You’ll be unpleasantly surprised at what you will find. Amongst the plethora of personal effects such as used gym towels and footwear, I guarantee that you will find information that could be of significant use to a criminal – not necessarily readily available in the form of credentials, but sufficient information to create a mechanism for extraction via an alternative source. But who would be able to use such information ?

    Think about this for a moment. You generally come into a clean office in the mornings, so cleaners have access to your office space. I’m not accusing anyone of anything unscrupulous or illegal here, but you do need to be realistic. This is the 21st century, and as a result, it is a security measure you need to factor in and adopt into your overall cyber security policy and strategy. Far too much focus is placed on securing the perimeter network, and not enough on the threat that lies within. A criminal could get a job as a cleaner at a company, and spend time collecting intelligence in terms of what could be a vulnerability waiting to be exploited. Another example of “instant intelligence” is the network topology map. Some of us are not blessed with huge screens, and need to make do with one ancient 19″ or perhaps two. As topology maps can be quite complex, it’s advantageous to be able to print these in A3 format to make it easier to digest. You may also need to print copies of this same document for meetings. The problem here is what you do with that copy once you have finished with it ?

    How do we address the issue ? Is there sufficient awareness ?

    Yes, there is. Disposing of it in the usual fashion isn’t the answer, as it can easily be recovered. The information contained in most topology maps is often extensive, and is like a goldmine to a criminal looking for intelligence about your network layout. Anything like this is classified information, and should be shredded at the earliest opportunity. Perhaps one of the worst offences I’ve ever personally experienced is a member of the IT team opening a password file, then walking away from their desk without locking their workstation. To prove a point about how easily credentials can be inadvertently leaked, I took a photo with a smartphone, then showed the offender what I’d managed to capture a few days later. Slightly embarrassed didn’t go anywhere near covering it.

    I’ve been an advocate of securing credentials for some time, and recently read about the author of “NIST Special Publication 800-63” (Bill Burr). Now retired, he has openly admitted the advice he originally provided as in fact, incorrect

    “Much of what I did I now regret.” said Mr Burr, who advised people to change their password every 90 days and use obscure characters.

    “It just drives people bananas and they don’t pick good passwords no matter what you do.”

    The overall security of credentials and passwords

    However, bearing in mind that this supposed “advice” has long been the accepted norm in terms of password securuty, let’s look at the accepted standards from a well-known auditing firm

    It would seem that the Sarbanes Oxley 404 act dictates that regular changes of credentials are mandatory, and part of the overarching controls. Any organisation that is regulated by the SEC (for example) would be covered and within scope by this statement, and so the argument for not regularly changing your password becomes “invalid” by the act definition and narrative. My overall point here is that the clearly obvious bad password advice in the case of the financial services industry is negated by a severely outdated set of controls that require you to enforce a password change cycle and be in compliance with it. In addition, there are a vast number of sites and services that force password changes on a regular basis, and really do not care about what is known to be extensive research on password generation.

    The argument for password security to be weakened by having to change it on a frequent basis is an interesting one that definitely deserves intense discussion and real-world examples, but if your password really is strong (as I mentioned previously, there are variations of this which are really not secure at all, yet are considered strong because they meet a complexity requirement), then a simple mutation of it could render it vulnerable. I took a simple lowercase phrase

    mypasswordissimpleandnotsecureatall

    1564764311-893646-nonillion.png
    The actual testing tool can be found here. So, does a potential criminal have 26 nonillion years to spare ? Any cyber criminal who possesses only basic skills won’t need a fraction of that time as this password is in fact made up of simple dictionary words, is all lowercase, and could in fact be broken in seconds.

    My opinion ? Call it how you like – the password is here to stay for the near future at least. The overall strength of the password or credentials stored using MD5, bCrypt, SHA1 and so on are irrelevant when an attacker can use established and proven techniques such as social engineering to obtain your password. Furthermore, the addition of 2FA or a SALT dramatically increases password security – as does the amount of unsuccessful attempts permitted before the associated account is locked. This is a topic that interests me a great deal. I’d love to hear your feedback and comments.

  • 0 Votes
    3 Posts
    184 Views

    @justoverclock yes, completely understand that. It’s a haven for criminal gangs and literally everything is on the table. Drugs, weapons, money laundering, cyber attacks for rent, and even murder for hire.

    Nothing it seems is off limits. The dark web is truly a place where the only limitation is the amount you are prepared to spend.