Preventing data loss

January 21st, 2009

OK, so it's not a forest fire, but you get the point.

I’ve always been the sort of person that saves receipts. Until very recently, I had an entire filing box full of receipts, warranty cards and software license keys. This system bothered me in much the same way that keeping paper copies of research papers did: they can’t be searched (did I file the receipt for my coffee table under C for coffee table, T for table, I for Ikea, F for furniture?), and they take up space in my apartment, where space is pretty limited. There’s the additional problem of a paper receipt’s relative fragility: there’s only one copy of it, and it can be burned, crumpled, crushed, soaked, can fade into unreadability, and so on. The solution was pretty straightforward; I scanned everything in the box, gave them descriptive file names and let Spotlight index them.  To provide some additional protection against data loss, I burned two copies of the newly-scanned PDFs onto two separate DVDs and mailed one to my parents. This got me thinking; sure, this way it’s much less likely that my important documents will be destroyed in the event of a disaster. Really though, how safe is my data? More importantly, how safe can my data really be?

Some studies point to the fact that the lifespan of recordable CDs and DVDs is quite short (in the neighborhood of three to five years), which means that any “permanent” storage on DVDs would need to be refreshed on about that interval to remain current. Now, given that most of what I burned on those DVDs don’t need to be kept for more than five years, that shouldn’t really be a problem. Also, these are kept in relatively controlled conditions – no direct sunlight, ambient temperature where they are doesn’t get much colder than about 55 or much hotter than about 100. But what about things like photos, home videos? Stuff that is irreplaceable and would need to be stored over a period of, say, decades?

Every data retention scheme you can name has its breaking point, some condition that could irretrievably kill your data. In my opinion, the important factors are the likelihood and inevitability of the failure and if the failure can be proactively avoided. Let’s take a look at a few ways to store data sorted from least paranoid to “tin-foil hat” paranoid.

Single disk:

  • Data is lost when: the disk fails.
  • How likely/inevitable is this?: All disks will fail, it’s just a matter of when.
  • Proactive avoidance?: Unless you go to some outside storage source, none.

Scarily, this is what most people are relying on. Professional drive recovery companies might be able to get your data off the disk, but it will cost a pretty penny.

RAID array (multiple, redundant disks):

  • Data is lost when:
    • All redundant disks fail before the array can be rebuilt. Especially heinous if your disks all croak of the same hardware problem, or your power supply kills them by catching on fire.
    • (Hardware RAID only) RAID controller fails, no replacement or equivalent controller can be found
  • How likely/inevitable is this?: With enough diligence (and enough backup disks), this shouldn’t be too likely.
  • Proactive avoidance?: More redundant disks, software RAID, quick monitoring and early detection of drive errors, always having a spare drive or two handy.

Burning to CD/DVD:

  • Data is lost when: the disk becomes unreadable (crushed, melted, thoroughly scratched, hacked into un-spinnable chunks, or some other loss of structural integrity)
  • How likely/inevitable is this?: Some might argue that it will happen eventually, but it’s a lot less likely than your disk failing, especially if you don’t expect to touch the data itself frequently.
  • Proactive avoidance?: More copies, stored in more places. Re-burning periodically.

Backup to off-site computer/external disk that’s kept somewhere else:

  • Data is lost when: both the storage system storing the original and the storage system storing the backup fail before an additional backup can be brought online
  • How likely/inevitable is this?: Unless you’re not paying attention to your backup for long periods of time or there’s a nuclear war, you’re probably OK. The amount of time you have to react depends, of course, on the storage setup of the two machines.
  • Proactive avoidance?: More than one backup, as widely distributed geographically as you can afford.

Cloud Storage (Amazon’s S3, others):

  • Data is lost when: hopefully never. Companies providing cloud storage services spend a lot of money making sure that this doesn’t ever happen. Amazon’s SLA for S3 doesn’t even mention data loss, just the minimum amount of time they guarantee that the service will be available per year.
  • How likely/inevitable is this?: Much like the nuclear war scenario, if S3 loses data it will probably be on the news, at least in the Silicon Valley.
  • Proactive avoidance?: Multiple storage clouds, owned by multiple vendors. This is truly the peak of tinfoil-hat paranoia.

In my opinion, there are only two real downsides to the cloud storage approach. The first is that upload bandwidth, for most residential Internet customers in the United States, just plain sucks, and uploading a non-trivial amount of data is painfully slow. Fiber to the home (and a technology-friendly executive branch) might change this, but it won’t happen in the immediate future. Second, S3 costs money, both to upload data and to store it there. If you’re not willing (or able) to pay the bill, this isn’t for you.

This is by no means an exhaustive list, and there are a lot of hybrid approaches. I back up my computer’s hard drives to external drives periodically, and those drives spend most of their time in my desk drawer. My various media (photos, music, digital copies of DVDs) are stored on a software RAID-1 in the media center PC. As an additional layer of protection, all my photos are on Flickr, all my music is synchronized with my iPod fairly regularly and my DVDs are, well, DVDs.

This setup, while pretty decent, is by no means foolproof. One major concern that I haven’t talked about here is data corruption, and I’ve got pretty much zero defense against that. Once Apple releases Snow Leopard, hopefully I’ll be able to transfer all my data over to ZFS volumes. I could nerd out over ZFS for a whole other post, but the short of it is that ZFS makes far fewer assumptions about data’s integrity and comes pretty close to eliminating the data corruption problem.

No related posts.

4 Responses to “Preventing data loss”

  1. Allenon 21 Jan 2009 at 9:40 pm

    Data loss is one of the most disastrous issue,when we lost data then we come across the importance of data backup.i lost my during loading some system file and abruptly an error occurs “STOP: C0000221 unknown hard error
    \”.i find it very difficult to resolve this error.i researched alot to find the root cause of this error but could not successed.My system was corrupted and i could not access any of my useful data.then one of my friend suggest a tool called Stellar phoenix File Recovery Software which offers a solution in the most comprehensive way.the software recovers all of data and files and can also recover data loss complications during cases like formatting the hard drive.you may try this software may be it will help you as it helps me and recovers your data as well.

  2. Streeteron 22 Jan 2009 at 12:34 am

    One might say that I’m really paranoid. I take full advantage of TimeMachine on Leopard. I’ve got an external USB disk that TimeMachine syncs to. Then I’ve got my linux box with a software RAID-5 array that I have an AFP mount on that TimeMachine syncs to while on my home network. And then, for the offsite storage, I’ve got an online backup service that works in the background. It took a long time to get all the files up to the online storage, but it makes me feel a lot better. I’m not losing any data anytime soon.

  3. Allisonon 06 Feb 2009 at 8:04 am

    recordable CDs and DVDs only have a 3-5 year lifespan?? That doesn’t seem very long. And I’ve had burned CDs that I made easily 7 years ago, and they work fine. Perhaps I misunderstood you?

    And if that IS the case, how does that technology differ from CDs and DVDs that can’t be recorded onto (i.e. music and movies.)?

  4. Alexon 06 Feb 2009 at 10:27 am

    You’re right, that really doesn’t seem like that long at all, does it? This guy at IBM seems to think differently though:

    http://www.networkworld.com/news/2006/011006-ibm-storage.html

    Now, it’s just one guy’s opinion, but it’s still a pretty interesting thing to think about.