Friday, May 22, 2009

Trying to visit

And you thought

YOU had problems

Customer ID: 1*******

Contract ID: 8******

Dear John Glenn,

The web server that your site is hosted on has been offline due to an apc power failure of 3 server racks. While all the other affected servers came back online, this web server did not recognize its RAID subsystem.

RAID stands for "Redundant Array of Independent Disks" and is a technology that employs the simultaneous use of two or more hard disk drives to achieve greater levels of reliability and performance.

Your website is stored across the RAID system twice over different hard drives, if one of the hard drives fails your web site will continue to run. The failed hard drive is replaced and the data that was on the drive copied again from the other drives within the RAID, this is known as rebuilding the RAID, and normally happens seamlessly without any effect to the web hosting server or your website. This is a daily task performed in our data centers and is standard for large data storage systems such as used in the web hosting environment.

In this instance, we replaced the failed drive with a new drive and the RAID started to rebuild. While this was happening the rebuild process failed, corrupting all the data within the RAID set. This should not happen and we have open tickets with the RAID manufacturer to understand what went wrong in this case and to ensure that they can prevent this for the future.

Our system administrators do not rely on the RAID system as our only source of backup. We run a rolling backup of the live system to external backup servers to ensure that in a case like this we have a restore solution.

After the RAID corruption occurred, our engineers analyzed the situation and found that the only solution left to us was to recover the data from our backup systems. At this point the RAID was reinitialized ready to receive data, this process itself takes several hours to perform.

We are currently copying and restoring the data from our backup systems to the web hosting server that your site runs from. The restore process takes time and will be finished not before Saturday afternoon.

Since the system problems began we have had a dedicated team of administrators working around the clock to monitor the copy of data from our backups and to ensure that all settings are restored so that your website will run again.

We apologize for any inconvenience and thank you for your patience. We will update you again as soon as there is additional information available.

Sincerely, 1&1 Internet Inc.

For the record, 1&1 normally has an excellent up time.

Also "For the Record": RAID assemblies are useless if the box in which they are installed "goes away." Yes, it happens. Given that nothing is perfect (except you and me, and I'm not sure about you < g >) backup media is a must and that media needs to be proven - was the WRITE error-free? If a tape, has anyone tested a READ on a drive other than the one on which the tape was written?

Nothing is simple. It's our job as risk managers to anticipate 99% of the potential "got'chas."

John Glenn, MBCI, SRP
Enterprise Risk Management/Business Continuity practitioner
Ft. Lauderdale FL
Planner @

1 comment:

Hawker said...

Yea I got the same e-mail about my web site.
Funny they have been down for almost three weeks now. How does it take 3 weeks to restore a server and back up?
Any why does my website go down for 1-3 weeks every year in May? And why when I e-mailed support did I get a different lame no information reply?
I don't trust any of this.