Years ago, I had the task to create a very large network share. I decided to build a Linux box with 6 raided 1.5 TB drives. At the time, it was a hefty cost. So, when we were planning this whole thing out, it was decided that there would really be no possibility of a backup since getting tapes and building a secondary machine were both cost prohibitive. Yes, it was a risk, but one that was acceptable. To counter the
And yesterday the machine failed.
Now, I don't have another Linux box with six SATA ports on it, so I made a trip to Microcenter and purchased some handy SATA to USB devices in order to get five drives running. That way I could run in degraded mode and mount the filesystem as read-only so I could get the data off the drives. I discovered that one of the things I picked was actually IDE to USB, and so I made trip #2 to Microcenter. After that, I was wiring things together and one of the enclosures failed to work. Trip #3. At least they're really nice at the returns counter.
I plug in the drives into a USB hub, then plug in the hub and additional destination drives to my laptop. I'm recovering at a mere 20 MB/s, so it will take a long time, but at least I didn't have the drives full when I started.
So, here I am, pondering the things that went well and the things that were terrible about this strategy, and I have to say that I am quite pleased with how everything is panning out. I figure that I should give you an overview of the various pieces that were considered during building the system and how well things worked for me during this time of failure. It might keep my mind off the fact that I'm now recovering my RAID on a hodgepodge of cabling, I've got my kids looking at the flashing lights, I'm pretty sure one of the enclosures has touchy wiring making it motion sensitive and there's a thunderstorm coming. I wish I plugged all this into a UPS before I started.
I knew that I'd be building a custom system that had more space on it than what was on all of the servers, NAS devices and desktops (combined) at my current place of business. When this would fail, how would I get data off the machine? Have a backup plan. Mine was really to get the information again through a very long and painful process because I could not afford to double my costs.
To mitigate the chance of loss, I did decide that I'd always be able to afford one more drive to be used by the RAID for the "R" part (redundant). I'd need at least two drives to fail for me to lose the data.
When you purchase the drives for your devices, you want to get them from different batches. This is because hard drives manufactured at the same time tend to break at about the same time. I didn't do this either due to time constraints, but you should do what you can.
Alerts were set up to monitor the drives and let me know immediately if the data was at risk. I'd just go out and buy a new hard drive and add it to the RAID to recover. Not a big deal... as long as the other five drives stayed running.
If my machine died, part of the recovery plan was to go out and purchase USB adapters for the drives. At the time, those were a little expensive and they came down greatly in price. I figured that perhaps USB 3 could be everywhere when there was a drive failure, so I could get improved recovery speeds.
One big thing to avoid is setting up a hardware-based RAID array. Yes, they offload the RAID work to some other device, but benchmarks show that it isn't very expensive computationally to use a software based RAID. Another advantage of using a software RAID is that you can use multiple channels on the board to fetch and store information instead of passing everything through a single controller. Lastly, you avoid proprietary RAID formats. This last topic is a huge hurdle.
When you use a hardware RAID card, I strongly suggest you buy no fewer than two at the exact same time and confirm that they have the same firmware on them. I've experienced and heard of people having issues recovering a RAID when they use newer cards, different models and even with minor firmware changes. If your one controller dies, you will need a backup controller that can get the data off the RAID, otherwise you've got a lot of useless disks.
Now, compare these problems to software RAID. If I keep a CD of the distribution I used to make the RAID, I'll be able to install it again and recover. Plus, it is usually forward compatible with future versions of that software. Years ago I used mdadm to set up the RAID and today I used the current mdadm version to recover the data from the drives. No hassle at all.
Since you are investing all this time and energy in making a bulletproof system, you probably want to put it on a UPS to help your hardware last longer. The local power grid goes through brownouts, power outages, spikes and has lots of noise from adjacent buildings, blenders, fluorescent lights and other computers. A UPS stops that and conditions the power so your hardware doesn't get beaten up nearly as much. I have a feeling that something like that fried the big computer so that it can only stay on for two minutes at a time, which is why I'm trying to recover this data with my laptop.
I've worked at places where the backup job appeared to be running for months, but never actually wrote data to the disk. We were able to recover some of the data painfully (RAID failure there as well), but it also taught us to try to restore files from our backups every now and then. Acrobats test that their net will hold their weight before they blindly trust their lives. Your data is depending on you; test your "safety net" backups before you rely on them.
Keep an eye on the current safety of your systems. Set up monitoring to ensure the health of your system is consistently good. Backups are good, redundancy is good. Plan for failure and test your failure plans when you can.
Thankfully my drives were not full, otherwise I'd be spending abut 110 hours recovering them. As it is, I only have perhaps another 12 hours. The hardest part is that I'm juggling data to drives that are significantly smaller, but I would much rather have my data than try to regenerate it again!