Performing a migration? Ensure you have a back-out plan

Blog
  • One of the most important safety nets in IT Operations is contingency. Every migration needs a rollback plan in the event that things don’t quite go the way you’d expect, and with a limited timeline to implement a change, or in some cases, a complete migration, the rollback process is one that is an essential component. Without a plan to revert all changes back to their previous state, your migration is destined for failure from the outset. No matter how confident you are (I’ve yet to meet a project manager who doesn’t build in redundancy or rollback in one form or another) there is always going to be something you’ve missed, or a change that produces undesirable results.

    It is this seemingly innocent change that can have a domino effect on your migration - unless you have access to a replica environment, the result of the change cannot be realistically predicted. Admittedly, it’s a simple enough process to clone virtual machines to test against, but that’s of no consequence if your change relates to those conducted at hardware level. A classic example of this is a firewall migration. Whilst it would be possible to test policies to ensure their functionality meets the requirement of the business, confirming VPN links for example isn’t so straightforward - especially when you need to rely on external vendors to complete their piece of the puzzle before you can continue. Unless you’re deploying technology into a greenfield site, you do not have the luxury of testing a VPN into a production network during business hours. Based on this, you have a couple of choices

    1. You perform all testing off hours by switching equipment for the replacement, and perform end to end testing. Once you are satisfied everything works as it should, you put everything back the way you found it, then schedule a date for the migration.
    2. You configure the firewall using a separate subnet, VLAN, and other associated networking elements meaning the two environments run symmetrically

    But which path is the right one ? Good question. There’s no hard and fast rule to which option you go for - although option 2 is more suited to a phased migration approach whilst option 1 is more aligned to “big bang” - in other words, moving everything at the same time. Option 2 is good for testing, but may not reflect reality as you are not targeting the same configuration. As a side note, I’ve often seen situations where residual configuration from option 1 has been left behind, meaning you either land up with a conflict of sorts, or black hole routing.

    Making use of a rollback

    This is where the rollback plan bridges the gap. If you find yourself in a situation where you either run out of time, or cannot continue owing to physical, logical or external constraints, then you would need to invoke your rollback plan. It’s important to note at this stage that part of the project plan should include a point where the progress is reviewed and assessed, and if necessary, the rollback is executed. My personal preference is within around 40% of the allocated time window - all relevant personnel should reconvene and provide status updates around their areas of responsibility, and give a synopsis of any issues - and be fully prepared to elaborate on these if the need arises. If the responsible manager feels that the project is at risk of overrunning it’s started time frame, or cannot be completed within that window, he or she needs to exercise authority to invoke the rollback plan. When setting the review interval, you should also consider the amount of time required to revert all changes and perform regression testing.

    Rollback provides the ideal opportunity to put everything back how it was before you started on your journey - but it does depend on two major factors. Firstly, you need to allocate a suitable time period for the rollback to be completed within. Secondly, unless you have a list of changes that were made to hardware - inclusive of configuration, patching, and a myriad of others, how can you be sure that you’ve covered everything ?

    Time after time I see the same problem - something gets missed, and turns out to be fundamental on Monday morning when the changes haven’t been cross checked.

    So what should a contingency plan consist of ?

    One surefire way to ensure that configurations are preserved prior to making changes is to create backups of running configs - 2 minutes now can save you 2 days of troubleshooting when you can’t remember which change caused your issue.  For virtual machines, this is typically a snapshot that can be restored later should the need arise. A word to the wise though - don’t leave the machine running on snapshot for too​ long as this can rapidly deplete storage space. It’s not a simple process to recover a crashed VM that has run out of disk space.

    Keep version and change control records up to date - particularly during the migration. Any change that could negatively impact the remainder of the project should be examined and evaluated, and if necessary, removed from the scope of works (provided this is a feasible step - sometimes negating a process is enough to make a project fail)

    Document each step. I can’t stress the importance of this enough. I understand that we all want to get things done in a timely manner, but will you realistically remember all the changes you made in the order they were implemented ?
    Use differential tools to examine and easily highlight changes between two configurations. There are a number of free tools on the internet that do this. If you’re using a Windows environment, a personal favourite of mine is WinMerge. Using a diff tool can separate the wood from the trees quickly, and provides a simple overview of changes - very useful in the small hours, I can assure you.
    Working on a switch or firewall ? Learn how to use the CLI. This is often superior in terms of power and usually contains commands that are not available from the GUI. Using this approach, it’s perfectly feasible to bulk load configuration, and also back it out using the same mechanism.

    What if your rollback plan doesn’t work ? Unfortunately, there is absolutely no way to simulate a rollback during project planning, and this is often made worse by many changes being made at once to multiple systems. It’s not that the rollback doesn’t work - it’s usually always a case of settings being reverted before they should be. In most cases, this has the knock on effect of denying yourself access to a system - and it’s always in a place where there are no local support personnel to assist - at least, not immediately. For every migration I have completed over my career, I’ve always ensured that there is an alternative route to reach a remote device should the primary path become inaccessible. For firewalls, this can be a blessing - particularly as they usually permit access on the public interfaces.

    However, delete a route inadvertently and you are toast - you lose access to the firewall full stop - get out of that one. What would I do in a situation like this where the firewall is located in Asia for example, and you are in London ? Again - contingency. You can’t remove a route on a firewall if it was created automatically by the system. In this case, a VLAN or directly connected interface will create it’s own dynamic route, and should still be available. If dealing with a remote firewall, my suggestion here would be Out Of Band Management (OOBM), but not a device connected directly to the firewall itself, as this presents a security risk if not configured properly. A personal preference is a locally connected laptop in the remote location that uses either independent WiFi or a 3G / Mifi presence. Before the migration starts, establish a WebEx or GoToMeeting session (don’t forget to disable UAC here as that can shoot you in the foot), and arrange for a network cable to be plugged into switch fabric, or directly. Direct is better if you can spare the interface, as it removes potential routing issues. Just configure the NIC on the remote machine with an address in the same subnet add the interface you’re connected to, and you’re golden.

    I’ve used the above as a get out of jail free card on several occasions, and I can assure you it works.

    So what are the takeaways here ?

    The most important aspect is to be ready with a response - effectively a “plan b” when things go wrong. Simple planning in advance can save you having to book a flight, or foot the expense of a local IT support firm with no prior knowledge of your network - there’s the security aspect as well; you’d need to provide the password for the device which immediately invokes a change once the remediation is complete. In summary

    • Thoroughly plan each migration and allow time for contingency steps. You may not need them, and if you don’t, then you effectively gain time that could be used elsewhere.
    • Have an alternative way of reaching a remote device, and ensure necessary third party vendors are going to be available during your maintenance window should this be necessary.
    • Take regular config backups of all devices. You don’t necessarily need an expensive tool for this - I actually designed a method to make this work using Linux, a TFTP server, and a custom bash script - let me know if you’d like a copy 🙂
    • Regularly analyse (automated diff) configuration changes between configurations. Any changes that are undocumented or previously approved are a cause for alarm and should be investigated
    • Ensure that you have adequate documentation, and steps necessary to recover systems in the event of failure

    Any thoughts or questions ? Let me know !


  • 9 Votes
    12 Posts
    98 Views

    @crazycells said in ION brings clients back online after ransomware attack:

    you know, they believe the world revolves around them

    Haha, yes. And they invented sex.

  • Link vs Refresh

    Solved Customisation
    20
    8 Votes
    20 Posts
    502 Views

    @pobojmoks Do you see any errors being reported in the console ? At first guess (without seeing the actual code or the site itself), I’d say that this is AJAX callback related

  • Blog Setup

    Solved Customisation
    17
    8 Votes
    17 Posts
    422 Views

    Here is an update. So one of the problems is that I was coding on windows - duh right? Windows was changing one of the forward slashes into a backslash when it got to the files folder where the image was being held. So I then booted up my virtualbox instance of ubuntu server and set it up on there. And will wonders never cease - it worked. The other thing was is that there are more than one spot to grab the templates. I was grabbing the template from the widget when I should have been grabbing it from the other templates folder and grabbing the code from the actual theme for the plugin. If any of that makes sense.

    I was able to set it up so it will go to mydomain/blog and I don’t have to forward it to the user/username/blog. Now I am in the process of styling it to the way I want it to look. I wish that there was a way to use a new version of bootstrap. There are so many more new options. I suppose I could install the newer version or add the cdn in the header, but I don’t want it to cause conflicts. Bootstrap 3 is a little lacking. I believe that v2 of nodebb uses a new version of bootstrap or they have made it so you can use any framework that you want for styling. I would have to double check though.

    Thanks for your help @phenomlab! I really appreciate it. I am sure I will have more questions so never fear I won’t be going away . . . ever, hahaha.

    Thanks again!

  • Nodebb as blogging platform

    General
    10
    5 Votes
    10 Posts
    315 Views

    @qwinter I’ve extensive experience with Ghost, so let me know if you need any help.

  • 2 Votes
    6 Posts
    252 Views

    @kurulumu-net CSS styling is now addressed and completed.

  • 0 Votes
    1 Posts
    205 Views

    expert.webp
    One thing I’ve seen a lot of over my career is the “expert” myth being touted on LinkedIn and Twitter. Originating from psychologist K. Anders Ericsson who studied the way people become experts in their fields, and then discussed by Malcolm Gladwell in the book, “Outliers“, “to become an expert it takes 10,000 hours (or approximately 10 years) of deliberate practice”. This paradigm (if you can indeed call it that) has been adopted by several so called “experts” - mostly those within the Information Security and GDPR fields. This article isn’t about GDPR (for once), but mostly those who consider themselves “experts” by virtue of the acronym. Prior to it’s implementation, nobody should have proclaimed themselves a GDPR “expert”. You cannot be an expert in something that wasn’t actually legally binding until May 25 2018, nor can you have sufficient time invested to be an expert since inception in my view. GDPR is a vast universe, and you can’t claim to know all of it.

    Consultant ? Possibly, yes. Expert ? No.

    The associated sales campaign isn’t much better, and can be aligned to the children’s book “Chicken Licken”. For those unfamiliar with this concept, here is a walkthrough. I’m sure you’ll understand why I choose a children’s story in this case, as it seems to fit the bill very well. What I’ve seen over the last 12 months had been nothing short of amazing - but not in the sense of outstanding. I could align GDPR here to the PPI claims furore - for anyone unfamiliar with what this “uprising” is, here’s a synopsis.

    The “expert” fallacy

    Payment Protection Insurance (PPI) is the insurance sold alongside credit cards, loans and other finance agreements to ensure payments are made if the borrower is unable to make them due to sickness or unemployment. The PPI scandal has its roots set back as far as 1998, although compensatory payments did not officially start until 2011 once the review and court appeal process was completed. Since the deadline for PPI claims has been announced as August 2019, the campaign has become intensively aggressive, with, it would seem, thousands of PPI “experts”. Again, I would question the authenticity of such a title. It seems that everyone is doing it, therefore, it must be easy to attain (a bit like the CISSP then). I witnessed the same shark pool of so called “experts” before, back in the day when Y2K was the latest buzzword on everyone’s lips. Years of aggressive selling campaigns and similarly, years of FUD (Fear, Uncertainty, Doubt - more effectively known as complete bulls…) caused an unprecedented spike that allowed companies and consultants (several of whom had never been heard of before) to suddenly appear out of the woodwork and assume the identity of “experts” in this field. In reality, it’s not possible to be a subject matter expert in a particular field or niche market unless you have extensive experience. If you compare a weapons expert to a GDPR “expert”, you’ll see just how weak this paradigm actually is. A weapons expert will have years of knowledge in a field, and could probably tell you which gun discharged a bullet just by looking at the expended shell casing. I very much doubt a self styled GDPR expert can tell you what will happen in the event of an unknown scenario around the framework and the specific legal rights (in terms of the individual who the data belongs to) and implications for the institution affected. How can they when nobody has even been exposed to such a scenario before ? This makes a GDPR expert in my view about as plausible as a Brexit expert specialising in Article 50.

    What defines an expert ?

    The focal point here is in the comparison. A weapons expert can be given a gun and a sample of shell casings, then asked to determine if the suspected weapon actually fired the supplied ammunition or not. Using a process of proven identification techniques, the expert can then determine if the gun provided is indeed the origin. This information is derived by using established identity techniques from the indentations and markings in the shell casing created by the gun barrel from which the bullet was expelled, velocity, angle, and speed measurements obtained from firing the weapon. The impact of the bullet and exit damage is also analysed to determine a match based on material and critical evidence. Given the knowledge and experience level required to produce such results, how long do you think it took to reach this unrivalled plateau ? An expert isn’t solely based on knowledge. It’s not solely based on experience either. In fact, it’s a deep mixture of both. Deep in the sense of the subject matter comprehension, and how to execute that same understanding along with real life experience to obtain the optimum result. Here’s an example   An information technology expert should be able to

    Identify and eliminate potential bottlenecks Address security concerns, Design high availability Factor in extensible scalability Consider risk to adjacent and disparate technology and conduct analysis Ensure that any design proposal meets both the current criteria and beyond Understand the business need for technology and be able to support it

    If I leveraged external consultancy for a project, I’d expect all of the above and probably more from anyone who labels themselves as an expert - or for that fact, an architect. Sadly, I’ve been disappointed on numerous occasions throughout my career where it became evident very quickly that the so called expert (who I hasten to add is earning more an hour than I do in a day in most cases) hired for his “expertise and superior knowledge” in fact appears to know far less than I do about the same topic.

    How long does it really take to become an expert ?

    I’ve been in the information technology and security field since I was 16. I’m now 47, meaning 31 years experience (well, 31 as this year isn’t over yet). If you consider that experience is acquired during an 8 hour day, and used the following equation to determine the amount of years needed to reach 10,000 hours

    10000 / 8 / 365 = 3.4246575342 - for the sake of simple mathematics, let’s say 3.5 years.

    However, in the initial calculation, it’s 10 years (using the basis of 90 minutes invested per day) - making the expert title when aligned to GDPR even more unrealistic. As the directive was adopted on the 27 April 2016, the elapsed time period isn’t even enough to carry the first figure cited at 3.5 years, irrespective of the second. The reality here is that no amount of time invested in anything is going to make your an expert if you do not possess the prerequisite skills and a thorough understanding based on previous events in order to supplement and bolster the initial investment. I could spend 10,000 practicing a particular sport - yet effectively suck at it because my body (If you’ve met me, you’d know why) isn’t designed for the activity I’m requesting it to perform. Just because I’ve spent 10,000 hours reading about something doesn’t make me an expert by any stretch of the imagination. If I calculated the hours spanned over my career, I would arrive at the below. I’m basing this on an 8 hour day when in reality, most of my days are in fact much longer.

    31 x 365 x 8 = 90,520 hours

    Even when factoring in vacation based on 4 weeks per year (subject to variation, but I’ve gone for the mean average),

    31 x 28 X 8 = 6,944 hours to subtract

    This is only fair as you are not (supposed to be) working when on holiday. Even with this subtraction, the total is still 83,578 hours. Does my investment make me an expert ? I think so, yes - based on the fact that 31 years dedicated to one area would indicate a high level of experience and professional standard - both of which I constantly strive to maintain. Still think 10,000 hours invested makes you an expert ? You decide ! What are your views around this ?

  • 0 Votes
    1 Posts
    124 Views

    dc1.webp
    Why is it that all outages seem to happen at 5:30pm on a Friday afternoon ? Back in the day during 1998 when DEC (yeah, I’m old - shoot me) was still mainstream and Windows NT Server 4.0 was the latest and greatest, I was working for a commodity trading firm in the West End as an IT Manager. The week had typically gone by with the usual activity - nothing too major to report apart from the odd support issue and the usual plethora of invoices that needed to be approved. Suddenly, one of my team emerged from the comms room and informed me that they had spotted a red light on one of the disks sitting in the Exchange server. I asked which disk it was, and said we’d need to get a replacement.

    For those who haven’t been in this industry for a years (unlike me) DEC (Digital Equipment Corporation) was a major player in previous years, but around 1998 started to struggle - it was then acquired by Compaq (who later on down the line in 2002 were acquired themselves by Hewlett Packard). This server was a beast - a DEC server 5000 the size of an under the counter fridge with a Mylex DAC960 RAID controller. It was so large, it had wheels with brakes. And, like a washing machine, was incredibly heavy. I’m sure the factory that manufactured servers in the 90’s used to pour concrete in them just for a bit of fun…

    Here’s a little glimpse for nostalgia purposes
    decserver5000.webp
    Those who remember DEC and it’s associated Mylex DAC960 RAID controller will also recall that the RAID5 incarnation was less than flawless. In modern RAID deployments, if a disk was marked as faulty or defunct, the controller would effectively blacklist the disk meaning that if it were to be removed then reinserted, any bad blocks would not be copied into the array hence causing corruption - it would be rejected.

    Well, that’s how modern controllers work. Unfortunately, the DAC960 controller was one of those boards that when coupled with old firmware and the NT operating system created the perfect storm. It was relatively well documented at the time that plugging a faulty drive back into an array could cause corruption and spell disaster. My enterprising team member had spotted the red light on the drive, then decided to eject it out of the array. For some unknown reason, instead of taking it back to his desk to order a replacement, he reinserted the it back into the array. Now, for those of you that actually remember the disks that went inside a DEC server 5000, you’ll know that these things were like bricks in plastic containers. They were around 3 inches in height, about 6 inches long, and quite heavy. These drives even had a eject clip on each side meaning that you had to press both sides of the disk carrier and then slide out the drive before it could be fully removed. Inserting a replacement drive required much the same effort (except in reverse), and provided a satisfyingly secure “clunk” as the interface of the drive made contact with the RAID controller bus.

    No sooner had I said the words

    “…please tell me you didn’t plug that disk back in……”

    to my team member, our central helpdesk number lit up like a Christmas tree in Times Square with users complaining they couldn’t get into email. I literally ran into the comms room and found the server with all drive bays lit solidly as if suspended in its own cryogenic state. For sake of schematics, a standard RAID5 configuration looks like the below. Essentially, the “p” component is parity. This is the stripe that contains information about the array and is spread across all disks that are members. In the event that one fails, the data is still held across the remaining drives, meaning still accessible - with a reduction in performance. The data is written across the disks in one write like a stripe (set).
    raid5_ok.webp
    At this point I’d already realised that the array had been corrupted by the returning faulty disk, and the bad stripe information was now resident on all the remaining drives. Those who understand RAID will know that if one drive in a RAID5 set fails, you still have the other remaining drives as a resilient array - but not if they are all corrupted. What I am alluding to here is shown below. The stripe was now unreadable, therefore, none of the disks were accessible
    raid5_broken.png.webp
    The server had completely frozen up and would not respond. I’m no fan of force powering a server off in the best of circumstances, but what choice did we have ?

    The server was powered off, then turned back on again. I really was hoping that this was just a system freeze and a reboot would make all our problems go away. The less naïve and experienced part of me dragged my legs towards the backup storage area (yes, we had a rotation pool of 2 weeks on site and 2 weeks off), and started collecting the previous day’s backup from the safe. As it stands, this was clearly the next logical step. Upon restart, we were met with the below shortly after NTOSKERNEL completed it’s checks
    bsod.webp
    (Not the actual BSOD of course - camera phones didn’t exist in 1998 - but as close as it gets)

    Anyone familiar with the Windows operating system will have bumped into this at some point in their career, and by the more commonplace acronym BSOD (Blue Screen Of Death). Either way, it’s never a good sign when you are trying to recover a system. One of the best messages displayed by a BSOD is

    IRQL_NOT_LESS_OR_EQUAL

    I say “best” with a hint of sarcasm of course as this message is completely useless and doesn’t mean anything to anyone as such. As the internet back in 1998 was fairly infantile, gaining a decent insight wasn’t as simple or clear cut as it is today. Looking at the problem from a sensible angle, it was fairly obvious that the DAC960 controller had either failed completely, or couldn’t read the disks and caused the crash. Deciding not to invest too much time in getting this system back to life, I fired up it’s dormant sister (yes, we had two fridges :)) which started with no issues. This secondary server was originally purchased to split the load of the mailboxes across two servers for resilience purposes, but never came to fruition owing to a backlog of other projects that were further up the chain of importance. Had this exercise have taken place, only 50% of the office would have been impacted - typical.

    With the server started, we then began the process of installing Exchange. Don’t get too excited - this was Exchange 5.5 and didn’t have any formal link to Active Directory, so it was never going to be the case of installing Exchange in disaster recovery mode, then playing back the database. Nope. This was going to be a directory restore first, followed by the Information Store.

    With Exchange installed and the previous service packs and hotfixes applied (early versions of Exchange had a habit of not working at all after a restore unless the patching​ level was the same), BackupExec 6.2 (yes, I know) was set to restore to an alternative Exchange server, and the tapes loaded into the robotic arm cradle. In hindsight, it would have been a better option to install BackupExec on the Exchange server itself, and connect the tape drive to the SCSI connector. However, can you find a cable when you really need one ? In any case, the server was SCSI2 when the loader was SCSI1. This should have set alarm bells ringing at the time, but with the restore started, we went back to our seats - I then began the task of explaining to senior management about the cause of the outage and what we were doing to resolve the problem. As anyone with experience of Microsoft systems knows, attempting to predict the time to restore or copy anything (especially back in the 90’s) wasn’t a simple task, as Windows had a habit of either exaggerating the time, or sitting there not responding for ages.

    Rather like a 90’s Wikipedia, NT wasn’t known for it’s accuracy.

    I called home and solemnly declared I was in for a long night. It’s never easy explaining the reasons why or attempting to justify the reasons you need to work late to family members, but that’s another story. Checking on the progress of the restore, we were averaging speeds of around 2Mbps ! Cast your mind back to 1998 and think of the surrounding technology. Back in the (not so good) old days, modern switching technology and 10Gbps networks were non existent. We were stuck with old 3Com 10Mbps hubs and an even slower Frame Relay connection (256k with 128k ISDN backup) as the gateway. To make matters worse, our internet connection was based on dialup technology using a SHIVA LanRoverE. Forget 1Gb fibre - this thing dished out an awesome [sic] 33.6k speed or even 56k if you were using ISDN. Web Pages loading in about 20 seconds was commonplace - downloading drivers was an absolute nightmare as you can imagine.

    Back to the restore. Having performed the basic math, and given the size of the databases (around 70Gb on a DLT 40 that was compressed to 80Gb), this was going to take over 24 hours. If you think about how hubs used to work, this meant that the 10Mbps speed of the device was actually shared across all 24 ports. This effectively reduces the port speed to 0.42Mbps - and that really depends on what the other ports are doing at the time. The restore rate remained at around 2Mbps for hours, and rather than everyone sit there watching water evaporate, I sent home the remaining staff and told them to be on standby for the entire weekend. I really couldn’t stomach food at this point, and ended up working into the night on other open tasks in an effort to catch up. I ended up falling asleep at my desk around 2am, and then being woken by the sound of my mobile (a Nokia of course) ringing. Looking at the clock, it was 5am. Checking the restore, it had progressed to the information store itself and was around 60% completed. After another 15 hours in the office, the restore finally completed.

    Having restarted all of the Exchange services, even the information store came up, which really was good news. However, browsing through the mailboxes I noticed that only a quarter of the 250+ I was expecting were listed. Not knowing much about the Exchange back end at the time, I contacted a so-called Exchange specialist based in Switzerland (in case you’re wondering, we were a Swiss headquartered entity, and all external support came from there). This Exchange specialist informed me that the backup hadn’t completed properly, and a set of commands needed to be run in BackupExec to resolve this. Of course, this also meant that the restore process had to be restarted - there goes another 24+ hours, I thought to myself. With the new “settings applied” and the restore process restarted, I decided that I wasn’t going to sit in the office for another day waiting for the restore to complete, and so I decided to call one of my team to come in and occupy the watchtower.

    Getting hold of someone was much more difficult than I had imagined. After letting the remainder of the team go, they all forged an exodus to the nearest door like iron filings to a magnet. So much for team ethic I thought. Eventually, I managed to get hold of a colleague who, after much griping, agreed to come into the office. I wouldn’t have minded as much if he didn’t live less than 15 minutes away, but that’s another story. My colleague arrived around 30 minutes later, and then I left the office. Getting home wasn’t a simple task. In the UK, there are often engineering works taking place over the weekend - particularly on the tube, and in most cases, local rail providers also - mine included. What should have taken about 2 hours maximum took 4, and by the time I got home, I flopped into bed exhausted. Needless to say this didn’t go down particularly well with my wife who saw me last on the previous morning - especially as after 3 hours of restlessness and a general inability to sleep, I was called by senior management - and was asked to go back in.

    By now, my already frustrated wife’s temperature went from 36.9c to an erupting volcano equivalent in less than a split second. I fully appreciated her response, but I was young (well, younger), eager to impress, and also had a sense of ownership. After a somewhat heated exchange, I left for the office. I arrived in much the same time as it took me to get home in the first place, and found that the restore was of course still running. My colleague made some half baked excuse that he needed to leave the office as he had a “family emergency”. Not really in the mood to argue this, I let him leave. I then got on a conference call with the consultant we had been using. Unsurprisingly, the topic of the restore time came up.

    “…You have a very slow network…” said the consultant.

    “…No s**t Sherlock…”  I thought. “…Do you honestly think I’m sitting here for my health ? …”

    I politely “agreed”.

    Eventually, the restore process completed. With a sudden feeling of euphoria, I went back into the comms room to start the services and… to my dismay, found once again that only a third of the recipients appeared in the directory. The term “FFS” didn’t go anywhere near being an accurate portrayal of my response. I was brutally upset. Hopelessly crushed. On the verge of losing it… (ok, perhaps that’s overkill). There had to be a reason for this. Something we’d missed, or just didn’t understand. I went looking for answers on a 1998 version of Yahoo (actually, I think it may have been Lycos), and found an article relating to the DS/IS Consistency Adjuster in Exchange 5.5 - this isn’t the exact resource I found, but it goes a long way to describe the fundamental process. The upshot is that the consistency adjuster needed to be run to synchronise the once orphaned mailboxes with the directory service. This entire process took​ a couple of hours - whilst that seems inconceivable to even the extreme Luddite, this is 1998 with SCSI1 drives, a Pentium II Processor, and 512Mb ram.

    After the process completed (which incidentally looked like this)
    dsisadjuster.webp
    I could then see all mailboxes ! After performing several somersaults around the office (just kidding here, but I can tell you I felt like doing it), I confirmed with a 25% random user test that I had access to mailboxes. Unfortunately, I couldn’t see any new mail arriving, but that was only due to a stalled mail connector on the server in Switzerland that received external mail. After a quick reboot of this gateway, mail began to flow. After around an hour of testing, I was happy that everything was working as expected. As for the consultant who had just wasted hours of my life, it’s just as well he wasn’t in the same country as me, let alone room. I went home elated - to an extremely angry wife. She’s since forgiven me of course, and now looking back, I really appreciate why - she was looking out for me, and concerned - I just didn’t appreciate that at the time.

    Come Monday morning, users were back into email with everything working as expected. An emergency Exchange backup had been run, and I was in the process of writing up my postmortem report for senior management. I then got a phone call. Anyone remember a product by Fenstrae called Faxination ? This was peered with Exchange 5.5, and had stopped working since the crash. The head of operations demanded that this was resolved as a priority… Another late night… another argument at home, but that’s a story for another day.

  • 0 Votes
    1 Posts
    138 Views

    bg-min-dark.webp
    It’s a common occurrence in today’s modern world that virtually all organisations have a considerable budget (or a strong focus on) information and cyber security. Often, larger organisations spend millions annually on significant improvements to their security program or framework, yet overlook arguably the most fundamental basics which should be (but are often not) the building blocks of any fortified stronghold.

    We’ve spent so much time concentrating on the virtual aspect of security and all that it encompasses, but seem to have lost sight of what should arguably be the first item on the list – physical security. It doesn’t matter how much money and effort you plough into designing and securing your estate when you consider how vulnerable and easily negated the program or framework is if you neglect the physical element. Modern cyber crime has evolved, and it’s the general consensus these days that the traditional perimeter as entry point is rapidly losing its appeal from the accessibility versus yield perspective. Today’s discerning criminal is much more inclined to go for a softer and predictable target in the form of users themselves rather than spend hours on reconnaissance and black box probing looking for backdoors or other associated weak points in a network or associated infrastructure.

    Physical vs virtual

    So does this mean you should be focusing your efforts on the physical elements solely, and ignoring the perimeter altogether ? Absolutely not – doing so would be commercial suicide. However, the physical element should not be neglected either, but instead factored into any security design at the outset instead of being an afterthought. I’ve worked for a variety of organisations over my career – each of them with differing views and attitudes to risk concerning physical security. From the banking and finance sector to manufacturing, they all have common weaknesses. Weaknesses that should, in fact, have been eliminated from the outset rather than being a part of the everyday activity. Take this as an example. In order to qualify for buildings and contents insurance, business with office space need to ensure that they have effective measures in place to secure that particular area. In most cases, modern security mechanisms dictate that proximity card readers are deployed at main entrances, rendering access impossible (when the locking mechanism is enforced) without a programmed access card or token. But how “impossible” is that access in reality ?

    Organisations often take an entire floor of a building, or at least a subset of it. This means that any doors dividing floors or areas occupied by other tenants must be secured against unauthorised access. Quite often, these floors have more than one exit point for a variety of health and safety / fire regulation reasons, and it’s this particular scenario that often goes unnoticed, or unintentionally overlooked. Human nature dictates that it’s quicker to take the side exit when leaving the building rather than the main entrance, and the last employee leaving (in an ideal world) has the responsibility of ensuring that the door is locked behind them when they leave. However, the reality is often the case instead where the door is held open by a fire extinguisher for example. Whilst this facilitates effective and easy access during the day, it has a significant impact to your physical security if that same door remains open and unattended all night. I’ve seen this particular offence repeatedly committed over months – not days or weeks – in most organisations I’ve worked for. In fact, this exact situation allowed thieves to steal a laptop left on the desk in an office of a finance firm I previously worked at.

    Theft in general is mostly based around opportunity. As a paradigm, you could leave a £20 note / $20 bill on your desk and see how long it remained there before it went missing. I’m not implying here that anyone in particular is a thief, but again, it’s about opportunity. The same process can be aligned to Information security. It’s commonplace to secure information systems with passwords, least privilege access, locked server rooms, and all the other usual mechanisms, but what about the physical elements ? It’s not just door locks. It’s anything else that could be classed as sensitive, such as printed documents left on copiers long since forgotten and unloved, personally identifiable information left out on desks, misplaced smartphones, or even keys to restricted areas such as usually locked doors or cupboards. That 30 second window could be all that would be required to trigger a breach of security – and even worse, of information classed as sensitive. Not only could your insurance refuse to pay out if you could not demonstrate beyond reasonable doubt that you had the basic physical security measures in place, but (in the EU) you would have to notify the regulator (in this case, the ICO) that information had been stolen. Not only would it be of significant embarrassment to any firm that a “chancer” was able to casually stroll in and take anything they wanted unchallenged, but significant in terms of the severity of such an information breach – and the resultant fines imposed by the ICO or SEC (from the regulatory perspective – in this case, GDPR) – at €20m or 4% of annual global (yes, global) turnover (if you were part of a larger organisation, then that is actually 4% of the parent entity turnover – not just your firm) – whichever is the highest. Of equal significance is the need to notify the ICO within 72 hours of a discovered breach. In the event of electronic systems, you could gain intelligence about what was taken from a centralised logging system (if you have one – that’s another horror story altogether if you don’t and you are breached) from the “electronic” angle of any breach via traditional cyber channels, but do you know exactly what information has taken residence on desks ? Simple answer ? No.

    It’s for this very reason that several firms operate a “clean desk” policy. Not just for aesthetic reasons, but for information security reasons. Paper shredders are a great invention, but they lack AI and machine learning to wheel themselves around your office looking for sensitive hard copy (printed) data to destroy in order for you to remain compliant with your information security policy (now there’s an invention…).

    But how secure are these “unbreakable” locks ? Despite the furore around physical security in the form of smart locks, thieves seem to be able to bypass these “security measures” with little effort. Here’s a short video courtesy of ABC news detailing just how easy it was (and still is in some cases) to gain access to hotel rooms using cheap technology, tools, and “how-to” articles from YouTube.

    Surveillance systems aren’t exempt either. As an example, a camera system can be rendered useless with a can of spray paint or even something as simple as a grocery bag if it’s in full view. Admittedly, this would require some previous reconnaissance to determine the camera locations before committing any offence, but it’s certainly a viable prospect of that system is not monitored regularly. Additionally, (in the UK at least) the usage of CCTV in a commercial setting requires a written visible notice to be displayed informing those affected that they are in fact being recorded (along with an impact assessment around the usage), and is also subject to various other controls around privacy, usage, security, and retention periods.

    Unbreakable locks ?

    Then there’s the “unbreakable” door lock. Tapplock advertised their “unbreakable smart lock” only to find that it was vulnerable to the most basic of all forced entry – the screwdriver. Have a look at this article courtesy of “The Register”. In all seriousness, there aren’t that many locks that cannot be effectively bypassed. Now, I know what you’re thinking. If the lock cannot be effectively opened, then how do you gain entry ? It’s much simpler than you think. For a great demonstration, we’ll hand over to a scene from “RED” that shows exactly how this would work. The lock itself may have pass-code that “…changes every 6 hours…” and is “unbreakable”, but that doesn’t extend to the material that holds both the door and the access panel for the lock itself.

    And so onto the actual point. Unless your “unbreakable” door lock is housed within fortified brick or concrete walls and impervious to drills, oxy-acetylene cutting equipment, and proximity explosive charges (ok, that’s a little over the top…), it should not be classed as “secure”. Some of the best examples I’ve seen are a metal door housed in a plasterboard / false wall. Personally, if I wanted access to the room that badly, I’d go through the wall with the nearest fire extinguisher rather than fiddle with the lock itself. All it takes is to tap on the wall, and you’ll know for sure if it’s hollow just by the sound it makes. Finally, there’s the even more ridiculous – where you have a reinforced door lock with a viewing pane (of course, glass). Why bother with the lock when you can simply shatter the glass, put your hand through, and unlock the door ?

    Conclusion

    There’s always a variety of reasons as to why you wouldn’t build your comms room out of brick or concrete – mostly attributed to building and landlord regulations in premises that businesses occupy. Arguably, if you wanted to build something like this, and occupied the ground floor, then yes, you could indeed carry out this work if it was permitted. Most data centres that are truly secure are patrolled 24 x 7 by security, are located underground, or within heavily fortified surroundings. Here is an example of one of the most physically secure data centres in the world.

    https://www.identiv.com/resources/blog/the-worlds-most-secure-buildings-bahnhof-data-center

    Virtually all physical security aspects eventually circle back to two common topics – budget, and attitude to risk. The real question here is what value you place on your data – particularly if you are a custodian of it, but the data relates to others. Leaking data because of exceptionally weak security practices in today’s modern age is an unfortunate risk – one that you cannot afford to overlook.

    What are your thoughts around physical security ?