In one of my lectures at university, the lecturer said something that has all stayed with me, in regards to strategic planning. That was:
"Many business fail to plan but few business plan to fail".
Well, I am saying that you should have a plan for when things fail - your Disaster Recovery Plan. DRP is a process that any company who has production IT systems should have in place.
DRP is not just about having a back up plan for data. It goes further to address the issue of what you do with that data that you have backed up and what hardware do you put it on? It takes into account the likely costs and associated business issues that comes with downtime. It is not an IT document but rather a business document.
Let me provide a real life example, which highlights the value of having a plan in place and how "Murphy's Law" can apply at times:On Tuesday 13th June 2006, the ineedhits.com website suffered a number of hours of downtime whilst we implemented our own disaster recovery plan, at an individual server level. Whilst the overall circumstances are quite complex, it is best under stood by looking at the timeline below:4pm Friday 9th June: ineedhits' production SQL server reported failure on a RAID 5 drive. For those people with an IT background, you will know that RAID 5 offers a level of redundancy that allows for one disk in the array to fail without any issue.A new disk was ordered and under our agreement with our hardware manufacturer, would be delivered next business day.Unfortunately, next business was Tuesday 13th June 2006 due to a localized public holiday where our data centre is located which is in a different state than ineedhits' head office.2pm Monday 12th June: the same server reported a high probability of failure to one the "mirrored" drives within this machine. A call was logged raising the urgency of the replacement drive(s), however the public holiday again slowed progress.12noon Tuesday 13th June: A second drive in the array reported failure. The machine stopped responding. The maintenance banner was placed on the site whilst we examined our options.Our plan called for a full copy of the database to be copied down to our alternative data centre via a secure VPN Tunnel. Even with a high speed link, this took multiple hours to achieve, finishing in the very early hours of the morning. In the meantime, we double checked the security and patch levels on our back up SQL server and bought them up to date.
Wednesday 14th June 2006: The restore was completed on Wednesday morning and site connectivity restored.The first hardware technician replaced one of the failed drives in the array. Unfortunately this person was a Tier 1 level support person and did not have a great deal of experience or knowledge.Thursday 15th June 2006: A more experienced Tier 2 support engineer arrived and replaced the SCSI backplane, as well as the failed mirror drive. He used his initiative and bought the backplane as two dries failing in a server less than 4 months old (from a name brand vendor) is highly unusual.A rebuild of the array was commenced.Friday 16th June 2006: The rebuild of the array completed but showed corruption of the data on the drive.The decision was made to rely on backups and continue running on our alternative data centre until the main production server reliability could be assured.As such, a 60 hour long "stress test" was applied to this server over the weekend.Monday 19th June 2006: With confidence restored in the server after passing the stress test, the entire process completed on Tuesday 13th June and finished on Wednesday 14th June had to be reversed.Tuesday 20th June: All systems appear to be up and running. However, if you are experiencing an issue, I strongly urge you to contact the ineedhits' customer care team and they will gladly assist.
I would like to stress that ineedhits' data has not been compromised by an external party. All data remains in a secure encrypted state. Thanks to having our plan in place, we were able toproceede with an acceptable downtime and with minimal disruptions. It is always highlyregrettablee when our site is unavailable.
For that - and perhaps most importantly - I'd like to apologize to all our customers and sites visitors for the inconvenience that this downtime may have caused. If you have any questions about orders you placed between Sunday 11th June and Wednesday 15th June 2006, please contact our customer care team!
Some hints / lessons learnt with DRP:
Do not fall into the trap of thinking that it won't happen to you or that DRP is not for small businesses. It is!If you plan to fail you will also plan to get back up and running!
Posted by Warren Duff at 6:03 AM GMT
I did face the problem of disk crash and i never thought it would be so difficult to get to the solution and after much efforts i sent it to Disk Doctors Labs Inc where my Disk Was recovered
Hi Robin,We looked at that as an option but the cost of doing that was extremely high. Each of the disks in the array would have had to be provided to the data receovery expert who then charge "per meg" of data recovered. With costs of IDE hard drives coming down, I strongly recommended people use mirror drives in their machines or even back up to USB Thumb drives. Simple, cheap and very simple.Warren