Posted: October 16th, 2011 | Author: Alex | Filed under: Advice (Unsolicited) | No Comments »
Until about a week ago, I used password-less SSH key pairs. I would keep a private key on a machine and stick its corresponding public key in authorized_keys files for my accounts everywhere else so that I would be able to log into any machine from any other without using a password. I figured the only way something really bad could ever happen is if someone were to get ahold of one of my private keys – and what are the odds of that happening, right?
Turns out Murphy’s Law applies here too. Somehow, someone got ahold of one of my private keys last week. I won’t go into the gory details of how, mostly because I don’t know exactly how they did it. I spent a couple days last week frantically changing every password I have and regenerating all of my key pairs. I then nuked the offending machine from orbit.
To say the whole event was unsettling is the mother of all understatements. The fact of the matter is that I have no idea what they did, what they took or didn’t take, what files they accessed … it’s that lack of information that’s the most terrifying. I’m not a computer security expert by any means, but I’m not a complete layman; I took a lot of precautions on the server in question to make sure this wouldn’t happen, and it did anyway.
This whole event triggered a lot of research and soul-searching. Here are some thoughts/recommendations that have come out of that process.
One of the blogs I read on the subject said that password-less SSH keys are “like credit cards without PIN numbers”. The analogy is pretty appropriate, I think. Let my mistake serve as an example – just don’t use them. They just aren’t safe.
You should generate separate, strong key pairs for each machine you use. I used 1024-bit DSA keys before, but I’m using 4096-bit RSA key pairs everywhere now. 4096-bit keys have modest performance overheads relative to 2048-bit RSA keys (ssh-keygen‘s default), and should still be really hard to compromise by brute force for the next couple decades barring some major theoretical advance or access to an enormous botnet. Nothing’s foolproof, of course, but it’s a start.
The keys you generate should have distinct passwords. That is, keys’ passwords should be both distinct from your normal password for that machine and distinct from one another.
Use ssh-agent to save yourself the headache of providing a password every time. The web is littered with tutorials on how to use ssh-agent, so I won’t talk about that here. These days, I start an ssh-agent process when I log in, and make sure it gets terminated when I log out. See this Github gist for the appropriate incantations to make this happen. On OS X, it’s even easier; ssh-agent has been nicely integrated into Keychain and launchd since Leopard. Really, there’s no excuse not to use ssh-agent.
Finally, minimize your attack surface. If a service can run in a separate user account with no privileges, it should. If a service must run as root, run it in a sandbox. If a port doesn’t absolutely have to be open, it shouldn’t be open. If a machine can still do its job while not being world-routable, make it non-world-routable.
Personally, I’m done running my own publicly-visible servers. Unless you’re in the business of running and securing servers (and have the known-how to keep them secure), I just don’t see how it’s worth the trouble.
Posted: September 24th, 2011 | Author: Alex | Filed under: Advice (Unsolicited), School | 1 Comment »
It looks like it’s about time for school to start again. Inspired by Justine’s post, I’ve decided that it’s time for yet another set of unsolicited advice for new students. This is the start of my fifth (oh jeez) year as a graduate student, so I feel that I can share some things that I didn’t quite grok in the early part of my graduate career. I apologize if any of these pieces of advice are cliched or obvious. This is particularly geared toward students in the systems and networking sub-disciplines of computer science, since that’s what I know. YMMV.
I’ll start with the one that all first-year graduate students hear and most completely fail to act on: grades don’t matter as long as they’re good enough. By this I mean that, as long as you pass, your grade in a graduate-level course does not matter at all. Nobody will ever look at your grades in graduate coursework, for internships, jobs or otherwise.
This will be really hard for you to accept, because you have been in the business of performing well in classes your entire life. Resist the temptation to spend more time than absolutely necessary on coursework. Make every effort to make every course project you do relevant to your research or publishable in some way. Time spent on your research is time spent productively. Time spent on anything else is time you should be spending on research (or, heaven forbid, actually enjoying yourself outside of work).
Graduate school is an emotional rollercoaster. You will have really good weeks. Who’s-the-man, major-results-every-day, high-fives-all-around weeks. If you’re anything like me, you’ll also have weeks when you feel like you haven’t gotten anything done. This is completely normal. If it happens more than once or twice in a row, take some time to step back and reconsider what you’re doing or how you’re doing it.
Some of your papers will be rejected. Some of them will be rejected several times in a row. Some might never even see the light of day. This does not mean that you’re a failure as a graduate student or that your research is garbage. You probably aren’t and it’s probably not.
The thing that is hard to come to grips with coming out of college is that papers aren’t accepted or rejected based on some objective rubric. A great deal of the selection process is very unscientific. Program committees are comprised of people, and everyone has their own opinions and biases. You might just have caught a reviewer on a bad day.
Treat every failed submission as a learning experience. Act on the legitimate complaints, ignore the inscrutable, bizarre and mean-spirited ones, and move on. Most importantly, don’t let it reflect on your opinions of yourself or your work. It doesn’t do you or anybody else any good. The only thing you can do is consider any constructive criticism and produce the highest quality work you are capable of producing. As long as you keep doing that, you’ll do fine.
Don’t be afraid to discard an idea you’ve been working on for a while or a piece of code that took you a long time to write if it’s clear that you’re going in the wrong direction. At the same time, don’t be too quick to abandon an idea if it doesn’t work out immediately.
Write down everything you try. If you run an experiment for a paper, write down how you ran it, when you ran it, and what the results were. In general, take good notes. They will save you a ton of time down the road.
There will be times during your career as a graduate student when you’ll ask yourself, “Why, oh why didn’t I just take that job at Large Software Company X out of college, with its hefty salary and reasonable hours?” The answer, hopefully, is that you wanted to gain a depth of understanding in a portion of your field and advance the state of the art. Eventually, probably when you start to see a tangible endpoint, you’ll feel like you’ve done that. Hang in there.
Posted: September 3rd, 2011 | Author: Alex | Filed under: Advice (Unsolicited), Computers | Comments Off
Last week, I talked about the bathtub curve and what it can tell you about bad hard drive reviews. I’m going to expand on that a little this week and talk about how replacing your drive doesn’t necessarily mean you’re solving the problem. Then we’ll briefly touch on another common source of consumer angst, hard drive sizes.
Correlated Failures
A common pattern in one-star hard drive reviews is the following:
First drive failed, sent it back. Replacement failed two weeks later. You computer people are all monsters. I’m going back to using a typewriter.
If you buy a drive from a company and it hits the wrong end of the bathtub curve, they will usually replace it. This is basically what hard drive warranties are for: they prevent customers on the wrong side of the bathtub from getting screwed over. Unfortunately, they will probably just pull the next hard drive box off the wall and send you that one. Those two drives probably arrived at their warehouse on the same shipping palette, which probably means that they were manufactured and left the factory at approximately the same time. If there was an unnoticed defect in that particular production batch, you’re much more likely to see the same problem on the replacement that you had with the original.
Incidentally, this is why you should never buy multiple instances of the same drive at the same time if you’re planning on building a RAID array with them; correlated failures might come back and bite you in a big way.
Drive Sizes Lie to You
Stop me if you’ve heard this complaint before:
I bought a 500GB hard drive, but it’s only got 465.7GB of space! I want my 34.3GB back!
I talked about this last year in the context of Wolfram Alpha. The short answer is that drive manufacturers are advertising their capacities in powers of ten and shipping with capacities in powers of two.
Operating systems vendors seem to be converging on lying to their customers rather than confusing them; Apple’s Disk Utility, for example, gives capacities in powers of two and units in powers of ten (500GB when it’s really 500 GiB). In my opinion, this is like setting the value of pi to 3.2; not only does it mask the problem, it hides some of the fundamental truths underlying it.
Posted: August 27th, 2011 | Author: Alex | Filed under: Advice (Unsolicited), Computers | Comments Off
Last week, one of my external drives failed, and another indicated that it’s about to die by failing a read and causing my RAID volume to degrade. Neither of these failures were surprising; both drives were well outside of their warranty periods. The way these drives failed and the (sadly ongoing) quest to replace them has brought up a couple of things that I’ll talk about here.
Failed drives means shopping for replacements. When it comes to external hard drives, we seem to be presented with a multitude of choices, none of which are good. Judging by reviews on NewEgg, external consumer-grade hard drives are some combination of:
- Unreliable
- Slow
- Feature-poor
- Plagued with awful customer support
I was surprised at how many of the one- and two-star reviews for hard drives on NewEgg (and virtually everywhere else that sells drives) display some of the same common misconceptions. It’s a sad indicator that as an industry, we still haven’t figured out how to make computers anything less than magical and inscrutable to the average consumer. I’m going to lay out a couple of those misconceptions in the next couple of posts. They’ve doubtlessly been rehashed elsewhere, but these are things that deserve repeating.
The Bathtub Curve
If you were to plot failure rate of hard drives versus time on a graph, the graph would probably look like the blue line in the graph below (thanks, Wikipedia!):

This blue line is what’s referred to in reliability engineering as a bathtub curve, because its shape is evocative of a bathtub. In plain English, the bathtub curve basically says
- Things that are shipped with defects usually fail early.
- Things that work as designed still eventually wear out.
- In the middle, anything can happen, but failure is less likely.
Many one-star NewEgg reviews I came across were some variant of:
Drive fails after X days of use. What a piece of crap. I’m never buying from this company again.
These are people who have unfortunately hit the wrong end of the bathtub curve.
Why does this happen? Well, some of it has to do with manufacturing; with something this intricate there will inevitably be defects, regardless of how much quality assurance you put into it. Some of it might have to do with what happens to the drives during shipping. Sometimes there is actually a systemic defect in a particular model or production batch that goes undetected by quality assurance; this usually results in a class action lawsuit months or years down the road.
The best bet, as I’ve stated here several times in the past, is to never assume that the drive will last another day. I was shocked at the number of times I read a review like this:
Bought this drive and it died three days later. Now 50,000 photos of my cat Muffins are gone. I hate you, Seagate, and so does Muffins.
So you bought this drive, and copied your photos to it, and then … you deleted the originals?! I’ve said it before and I’ll say it again: if there’s only one copy, it is only a matter of time before you lose that data.
Next week: why the replacement for your failed drive is more likely to fail, and why hard drive manufacturers are lying to you.
Posted: August 10th, 2011 | Author: Alex | Filed under: Esoteric Tips | Comments Off
I was looking around for software for doing screencasts so that I could make a short tutorial video for my students. I was a little bewildered by the lack of good, free screencast software for OS X; surely there must be something that can capture audio and video from the screen. Turns out there is – good ol’ trusty QuickTime Player 10.
If you select New Screen Recording from the File menu, it will present you with a window with a prominently-featured “record” button. Simply configure your audio input and hit the button and you’re recording. It’s CPU intensive and doesn’t have a very high framerate, but it certainly worked well enough for my purposes.
Posted: August 7th, 2011 | Author: Alex | Filed under: Advice (Unsolicited), School | Comments Off
Sometimes I am reminded of how long it’s been since I took my first programming course in college. I came across a post called “SICP is Under Attack” written by an undergrad from the class of 2015 named Vedant Kumar. In this post, he points out that CS61A (the introductory CS course at Berkeley) is switching from Scheme to Python, although they won’t be switching entirely away from the course textbook, Structure and Interpretation of Computer Programs (commonly abbreviated SICP), which (as Mr. Kumar points out) is an incredibly good introductory computer science textbook.
I won’t go into detail on why they’re making the switch. Brian Harvey, the prof who has taught the course for the last 25 years (although he didn’t teach it when I took it), has already written about this at length on his website. I reacted to news of 61A’s transition away from Scheme the same way that I reacted to finding out that the Shakey’s Pizza near my parents’ house where I went for team parties as a kid had been bulldozed and replaced by a Wallgreens. I recognize that things can’t stay the way we remember them forever, but it’s still a little sad. The news did get me reflecting on that course and that time in my life. If this thing had visuals, this is where the flashback ripple effect would kick in.
I took CS61A in the fall of 2003. Before that point, I had very little practical experience with programming (I had futzed around with HyperCard and Object Logo, and done a little bit of PHP programming, but hadn’t programmed anything “real” at that point). I remember there being talk among my fellow incoming freshmen about an entrance exam, in which you’d have to demonstrate your knowledge of this technique called recursion that I’d never heard of before. Almost everyone I met that week didn’t seem that concerned; they’d programmed tons before, or done well on the CS AP exam, or even taken the CS AP exam. I was convinced that I would get destroyed by the entrance exam and be forced to take some sort of remedial course and then everything would be ruined forever. This sort of thing happened to me a lot during that first year; that little voice that keeps saying “These people are way smarter than you and you don’t belong here” just would not shut up.
My friend Naren told me years later that when he met me that first day in 61A, I “looked super angry”. Far from it – I was terrified. It turns out that there was no entrance exam, much to my relief. It also turns out that the course was to be taught in a language that hardly anybody had ever heard of before. Some of the savvier students asked, “Why Scheme? Why not just use Common Lisp?” The point, the profs explained, was that they weren’t teaching you a programming language. They were going to teach you how to think like a computer scientist, and that meant teaching you recursion. Teaching you recursion is easier, they said, if the one of the only constructs the language supports is recursion.
61A was somewhat of a trial by fire. It was a lot of work and it was not by any means easy. At the time, I bitched and moaned about it a great deal. I missed our triple-overtime win against USC studying for a 61A midterm (current undergrads, learn from my mistakes!).However, it introduced me to a wide swatch of the fundamentals of computer programming and computer science – recursion, complexity analysis, debugging, logic languages, dynamic programming, the list goes on. It also helped me figure out that I wasn’t that much better or worse at this whole programming thing than anyone else there, and that there hadn’t been some horrible mistake in the admissions office.
I understand the desire to “modernize” the curriculum, and I am an enormous fan of Python, but for a course like 61A the choice of language doesn’t really matter all that much. The purpose of the course is to teach the fundamentals, and to give you an introduction to what computer science is all about so you can figure out if it’s the right major for you. As long as the course accomplishes those goals, nothing major will have changed. Regardless of what textbook they’re using by default, I think that SICP will remain an important volume for that course in the same way that The Art of Computer Programming is important, not for the language that the examples are written in but for its exceptional quality as a teaching, learning, and reference tool.
Posted: July 16th, 2011 | Author: Alex | Filed under: Esoteric Tips | Comments Off
In an effort to practice what I preach, I’ve been spending 15 minutes a day documenting code. I’ve been using Doxygen for all my C++ documentation. One thing that I wanted to do early on was easily identify all the files where I was missing documentation and, if possible, the particular places where documentation was missing. I think I’ve managed to do it (or at least come close).
In my Doxyfile, I’ve got the following lines:
WARNINGS = YES
WARN_IF_UNDOCUMENTED = YES
WARN_IF_DOC_ERROR = YES
WARN_LOGFILE = doxygen_warn.log
This means that Doxygen will generate warnings for undocumented members as well as if the documentation is incomplete or isn’t syntactically correct, and that it will log all warnings to the file doxygen_warn.log.
In order to get a list of undocumented files, I look for warnings about missing documentation, extract the filenames for each warning, and make sure that the list of files doesn’t contain any duplicates:
grep "is not documented" doxygen_warn.log | awk '{print $1}' | sed -E 's/:[0-9]+://g' |
sort | uniq
In order to further expand that list to include the undocumented members in each file, I do something similar, except I extract the filename and information about the undocumented member instead of just the filename:
grep "is not documented" doxygen_warn.log | sed -E 's/is not documented.//g' |
cut -d' ' -f 1,3- | sed -E 's/:[0-9]+:/ : /g' | sort
It’s kind of quick and dirty, but it seems to do the job. These commands run automatically whenever I generate documentation. The goal (unreasonable though it may be) is to bring the list of undocumented files down to zero.
Posted: April 24th, 2011 | Author: Alex | Filed under: Advice (Unsolicited) | Comments Off
When I was in college, I wrote notes in a collection of spiral notebooks. I also had a three-ring binder with a divider tab per class where I put things like graded midterms and assignments. After classes were over, I’d donate the midterms to the HKN exam archive and recycle everything else. As my classes shifted more toward reading a lot of papers, the papers would go in their own separate three-ring binder, with a divider tab per class.
What a waste.
In retrospect, I see a lot of this exercise as more psychological than actually helpful. I spent a lot of time in college feeling overwhelmed (whether or not I was actually overwhelmed is an open question), and the act of collecting and organizing this mountain of paper into something resembling order was a way to feel more in control of the situation. Also, taking notes certainly helped me stay focused in class. But the fact is, much of that mountain of paper was pretty much useless.
First, it was really hard to actually find anything. Trying to find all the information I had collected on a topic, or the definition for a particular term, usually involved doing a scan through a few weeks of notes or a glossary lookup in the course’s textbook, despite the fact that I had things fairly well organized.
Second, I didn’t actually need most of that data. My move from taking a notebook full of notes per class to a few dozen pages spoke more to my increased ability to separate what was important to capture from lecture from information that I could grab from lecture slides or the textbook. Keeping exams was helpful, but only because it told me what I didn’t understand well enough; the actual content of my answers wasn’t really all that helpful.
Third, my handwriting is terrible. Especially when I’m writing fast, which is what I’m doing when I’m writing notes or writing down answers for an exam. Many was the time where I looked at my notes and said, “Well, here’s where the stuff I’m looking for is, but I can’t make heads or tails of what I was writing here.”
I realized too late (after I had mostly stopped taking classes and had shifted into doing research full-time) that the only result of capturing all of this information by sticking it in binders and putting them on bookshelves was more full bookshelves.
It was then that I decided to adopt the following guiding principle for every piece of information I will ever need to look at more than once:
If I can’t search it, I will never find it again.
This immediately applies to things like notes. I found that it’s easier to use paper notes than to lug a full-sized laptop around, but with the emergence of tablets that’s becoming a decreasingly relevant problem. These days, I hardly write anything on paper anymore; I type significantly faster than I write, I can actually read what I wrote afterwards, and most importantly everything I type is instantly searchable.
Any piece of paper I need to keep (bills, receipts, etc.) gets scanned. I’m not as good at passing scanned papers through OCR as I should be (I’m still trying to figure out an effective OCR solution), but I make sure to name these scanned documents descriptively so that I can search by description later. Another huge benefit of this approach is that everything is backed up. I’ve lost notebooks for classes in the middle of the quarter before. It really sucks.
In the last post, I went over how I handle organizing and taking notes on research papers; a big component of making that process work effectively is the ability do full-text search on papers. Related work searches get so much easier when you can just straight-up search through reams of papers at once. One of my favorite Google Talk features is the ability to search past instant message conversations; it’s the one value-add feature that keeps me coming back to GTalk. GMail makes e-mail searchable (then again, so does basically everything else these days), which is always really helpful.
This “make everything searchable” principle has really helped me keep on top of the mountain of data that I receive and/or generate every day.
Posted: February 21st, 2011 | Author: Alex | Filed under: Advice (Unsolicited), Computers | Comments Off
In this post, I’ll focus on the practical side of backups.
Last time, I asserted that in order for a backup to really be a backup, your data has to be automatically replicated on two different drives using two separate filesystems on two different computers that are geographically separated, and one of those backups needs to be able to go back in time by at least 24 hours.
Satisfying all of these criteria at once usually isn’t free, but it doesn’t have to be hard, and you’re probably closer to a workable solution than you think.
In this post, I’ll examine a few possible solutions and point out some non-obvious ones. This isn’t meant to be comprehensive, but rather serves to give a general flavor of the state of backup solutions.
Built-In Solutions
OS X’s Time Machine fits some of our criteria for backups, but falls short in others. You can back up to two different drives, the filesystems are distinct, and you’re able to move the Time Machine drive back in time if needed. Backing up to other computers with Time Machine is possible, but it’s unsupported and not very reliable (at least in my experience).
Using a network-enabled USB drive or a Time Capsule is essentially equivalent to backing up to another computer (they’re practically little computers themselves), but that costs a good deal of additional money. Unless you really know what you’re doing and are willing to take the time to make it work (and keep it working), making remote backups work with the Time Capsule is not really feasible.
Although I’ve never used it personally, Windows 7′s Backup and Restore feature appears to be feature-for-feature equivalent to Time Machine, but without Apple’s high-gloss glittery front-end. If you have a Professional or above license, it can backup to network shares, which is an improvement over Time Machine but requires you to pay more for the OS itself, which is kind of a drag.
You can use rsync by itself on pretty much any platform or with any one of a plethora of (usually OS-specific) front-ends. rsync can push files to pretty much anywhere and it supports incremental backups, so you could definitely satisfy all your backup demands with rsync, although it would take a little bit of work to get everything set up.
Sneakernet
If you’ve got a USB drive and are willing to lug it back and forth, there’s a relatively inexpensive way to come pretty close to an optimal backup solution. If you leave your USB drive at work, take it home and do backups every Monday night and bring the drive back to work on Tuesday, you’ve got your bases mostly covered. The problem here, of course, is that you have to remember to take the drive home with you, your backup granularity is kind of coarse (if your drive dies, you lose at most a week’s worth of stuff), and there’s a small window of vulnerability when your USB drive is at home. You get geographic distance for free, though.
Enter the Cloud
There are several companies that have recently started to offer so-called “cloud backup” services that provide you with some amount of storage space to which you can back up. Notable companies in this space include Mozy, Backblaze and CrashPlan. With cloud backup services, you easily satisfy all of our desirable backup properties simultaneously (unless you happen to live next to one of their data centers, of course), but it will usually cost you and doing the initial backup over the wide-area Internet may take weeks or months. Most services will ship you an external hard drive to which you can do your initial backup, but you have to eat the cost of a hard drive (~$100-150) for the privilege of writing to the drive and mailing it right back.
In my opinion, the standout favorite contender in this space is CrashPlan for one simple reason – they allow two computers running CrashPlan to back up an unlimited amount of data to each other for free. So if you and your friend both want to run backups, you can back up to each other.
Unexpected Surprises
If you care about your photos, you’ll want them backed up. If you share your photos on a site like Facebook or Flickr, you’re most of the way to an ideal backup of those photos. The only major drawbacks here is that restoring your photos isn’t trivial (you have to re-download them, although there are applications that will help automate that process) and you might incur a loss in quality when the site scales your image down. If you don’t mind those things though, these are great inexpensive ways to backup.
“But what about our time travel requirement” you might ask? If you’re editing photos, you might care about reverting to a previous edit. Most of the time though, you take pictures, upload them and never modify them again. Static data like pictures or music, where individual items never change but the set of items is expected to grow larger, is easier to back up because as long as you never delete anything the time travel requirement isn’t necessary.
My Setup
I have three computers that I care about – my desktop, my laptop and my home theatre PC. My desktop has an OS X partition and a Windows 7 partition and my laptop runs Debian in a VM, so I need to back up five filesystems in total. The HTPC has external storage drives that hold movies and music.
I admit that I break my own rules a bit – the external media drive on the HTPC is a RAID 1 with no other backups. I know, scary right?
Every system runs CrashPlan, even the Linux VM on my laptop. All systems backup to two places. The first is an old external drive attached to the HTPC. The second is a workstation under my desk at UCSD. Since I had an extra drive lying around, my desktop’s OS X partition also runs a Time Machine backup on a second internal drive.
That about covers it. Next week, something not related to backups!
Posted: February 12th, 2011 | Author: Alex | Filed under: Advice (Unsolicited) | Comments Off
I can’t overstate the importance of doing backups. Your data is important, and it should be protected. For the next couple of posts, I’m going to do a deep dive into backup strategy and how the changing face of computing is changing the way we have to think – or in some cases not think – about backups.
Why bother with backups?
Simple: someday, your hard drive will die. Backups are not like insurance; what you are preparing for is not a hypothetical situation. This is where I start to sound like a jabbering paranoid, but trust me, it really sucks to lose data.
I wrote a post a couple of years ago that examined a couple of different ways to do backups. I’ve changed my opinions somewhat since then, so hopefully this adds something to the points raised in that post.
What is a backup, really?
It’s easier to define what a backup is by defining what a backup isn’t.
I should preface this by saying that anything is better than nothing. You may not be able to feasibly satisfy a backup by my definition for all of your data, but the more of these points you’re able to hit, the better off you’ll be. With every layer of protection you put in place, the probability of losing data goes down. It never quite goes to zero, but it gets pretty close.
Your data is not truly backed up if:
- Your data isn’t stored on two different drives.
- Your data isn’t stored on two different file systems.
- Your data isn’t stored in two different computers.
- Your data isn’t stored in two geographically distant locations.
- Your data isn’t duplicated automatically.
- You can’t make at least one of the copies go back in time.
Two different drives is a no-brainer: if a hard drive fails, the data on that drive is probably gone forever. Putting your data on two different drives allows your data to survive the failure of one of them.
I say “probably gone forever” because I’m purposefully ignoring filesystem-level recovery systems like Norton Ghost and drive-level recovery services like Disk Doctors. I’m doing this because they cost lots of money and time and are not necessarily guaranteed to work.
Storing your data on two different file systems protects again stupid mistakes and computer compromise. If someone logs into your computer and deletes your data or a software bug hoses your file system, even if that data was stored on a RAID 1 array, it’s gone. It’s better to treat a RAID 1 array as a really robust single drive rather than as a complete backup solution by itself.
Storing your data on two different computers is an extension of the same protective strategy. If someone logs in to one computer and deletes data on all the drives on that computer, that data is gone unless it was stored somewhere else. Similarly (and somewhat more commonly) if your power supply goes kaboom and melts your computer and all the copies of that data are inside that one computer, that data is gone.
If you only go this far, you’re pretty well-protected against the loss of a single drive or a single machine. The next layer of protection, backing up in multiple places that are separated by distance, protects you against catastrophe, i.e. your house is destroyed or you get robbed.
If you’ve gotten this far and are diligent, the only things that are likely to kill your data are a major natural disaster or lots of different things failing at once.
The next point, automatic backup, is really a question of convenience. Personally, I don’t want to spend all day babysitting different copies of my data. You should find a backup solution that works automatically and reliably in the background and, more importantly, tells you quickly when it’s not working. The whole situation falls apart if you think you’re doing backups but you’re not.
The last point, time travel, is something that I think a lot of people overlook. If you do something stupid like delete data that you didn’t want to delete accidentally, it’s possible that the automatic backup system you’ve got in place will back up your mistakes and delete the backed-up copies of the data as well. Going back to an arbitrary past copy of your data isn’t really all that necessary in my opinion, but you should at least be able to go back in time 24 hours. Most backup systems allow you to do this, at least to some degree.
Paranoia In Action
Next time, I’ll give some practical examples, differentiate static data from dynamic data, and show how your data is probably easier to back up than you think.