Dealing with WordPress Database Inconsistency

May 20th, 2009

I just finished wrestling with a problem with WordPress. Editing a really old page tickled some sort of internal inconsistency in the database, which created a zombie page that could neither be edited nor deleted. I really wasn’t looking forward to crawling around in the database, so I did some googling and came up with Lester Chan’s brilliant WP-DBManager. One click of the “Repair” button and allowed me to deliver the deathblow to the zombie page, saving me a boatload of time in the process. Thanks, Lester!

Preventing data loss

January 21st, 2009

OK, so it's not a forest fire, but you get the point.

I’ve always been the sort of person that saves receipts. Until very recently, I had an entire filing box full of receipts, warranty cards and software license keys. This system bothered me in much the same way that keeping paper copies of research papers did: they can’t be searched (did I file the receipt for my coffee table under C for coffee table, T for table, I for Ikea, F for furniture?), and they take up space in my apartment, where space is pretty limited. There’s the additional problem of a paper receipt’s relative fragility: there’s only one copy of it, and it can be burned, crumpled, crushed, soaked, can fade into unreadability, and so on. The solution was pretty straightforward; I scanned everything in the box, gave them descriptive file names and let Spotlight index them.  To provide some additional protection against data loss, I burned two copies of the newly-scanned PDFs onto two separate DVDs and mailed one to my parents. This got me thinking; sure, this way it’s much less likely that my important documents will be destroyed in the event of a disaster. Really though, how safe is my data? More importantly, how safe can my data really be?

Some studies point to the fact that the lifespan of recordable CDs and DVDs is quite short (in the neighborhood of three to five years), which means that any “permanent” storage on DVDs would need to be refreshed on about that interval to remain current. Now, given that most of what I burned on those DVDs don’t need to be kept for more than five years, that shouldn’t really be a problem. Also, these are kept in relatively controlled conditions – no direct sunlight, ambient temperature where they are doesn’t get much colder than about 55 or much hotter than about 100. But what about things like photos, home videos? Stuff that is irreplaceable and would need to be stored over a period of, say, decades?

Every data retention scheme you can name has its breaking point, some condition that could irretrievably kill your data. In my opinion, the important factors are the likelihood and inevitability of the failure and if the failure can be proactively avoided. Let’s take a look at a few ways to store data sorted from least paranoid to “tin-foil hat” paranoid.

Single disk:

  • Data is lost when: the disk fails.
  • How likely/inevitable is this?: All disks will fail, it’s just a matter of when.
  • Proactive avoidance?: Unless you go to some outside storage source, none.

Scarily, this is what most people are relying on. Professional drive recovery companies might be able to get your data off the disk, but it will cost a pretty penny.

RAID array (multiple, redundant disks):

  • Data is lost when:
    • All redundant disks fail before the array can be rebuilt. Especially heinous if your disks all croak of the same hardware problem, or your power supply kills them by catching on fire.
    • (Hardware RAID only) RAID controller fails, no replacement or equivalent controller can be found
  • How likely/inevitable is this?: With enough diligence (and enough backup disks), this shouldn’t be too likely.
  • Proactive avoidance?: More redundant disks, software RAID, quick monitoring and early detection of drive errors, always having a spare drive or two handy.

Burning to CD/DVD:

  • Data is lost when: the disk becomes unreadable (crushed, melted, thoroughly scratched, hacked into un-spinnable chunks, or some other loss of structural integrity)
  • How likely/inevitable is this?: Some might argue that it will happen eventually, but it’s a lot less likely than your disk failing, especially if you don’t expect to touch the data itself frequently.
  • Proactive avoidance?: More copies, stored in more places. Re-burning periodically.

Backup to off-site computer/external disk that’s kept somewhere else:

  • Data is lost when: both the storage system storing the original and the storage system storing the backup fail before an additional backup can be brought online
  • How likely/inevitable is this?: Unless you’re not paying attention to your backup for long periods of time or there’s a nuclear war, you’re probably OK. The amount of time you have to react depends, of course, on the storage setup of the two machines.
  • Proactive avoidance?: More than one backup, as widely distributed geographically as you can afford.

Cloud Storage (Amazon’s S3, others):

  • Data is lost when: hopefully never. Companies providing cloud storage services spend a lot of money making sure that this doesn’t ever happen. Amazon’s SLA for S3 doesn’t even mention data loss, just the minimum amount of time they guarantee that the service will be available per year.
  • How likely/inevitable is this?: Much like the nuclear war scenario, if S3 loses data it will probably be on the news, at least in the Silicon Valley.
  • Proactive avoidance?: Multiple storage clouds, owned by multiple vendors. This is truly the peak of tinfoil-hat paranoia.

In my opinion, there are only two real downsides to the cloud storage approach. The first is that upload bandwidth, for most residential Internet customers in the United States, just plain sucks, and uploading a non-trivial amount of data is painfully slow. Fiber to the home (and a technology-friendly executive branch) might change this, but it won’t happen in the immediate future. Second, S3 costs money, both to upload data and to store it there. If you’re not willing (or able) to pay the bill, this isn’t for you.

This is by no means an exhaustive list, and there are a lot of hybrid approaches. I back up my computer’s hard drives to external drives periodically, and those drives spend most of their time in my desk drawer. My various media (photos, music, digital copies of DVDs) are stored on a software RAID-1 in the media center PC. As an additional layer of protection, all my photos are on Flickr, all my music is synchronized with my iPod fairly regularly and my DVDs are, well, DVDs.

This setup, while pretty decent, is by no means foolproof. One major concern that I haven’t talked about here is data corruption, and I’ve got pretty much zero defense against that. Once Apple releases Snow Leopard, hopefully I’ll be able to transfer all my data over to ZFS volumes. I could nerd out over ZFS for a whole other post, but the short of it is that ZFS makes far fewer assumptions about data’s integrity and comes pretty close to eliminating the data corruption problem.

How to apply to graduate school without going insane

December 19th, 2008

A few friends of mine are applying to graduate school. This was without a doubt one of the most stressful and chaotic things I’ve ever done. Now that I’ve been in grad school for a while, I have a small understanding of how the application process works. My experience is limited, of course, to the programs to which I applied, so your mileage may vary. Also, I applied to ten(!) different graduate programs, so some of these solutions might not apply to you if you’re only applying to one or two.

Getting Organized

One of the most challenging parts of the grad school application process is knowing what everybody wants and keeping it all straight. Just about everybody wants a statement of purpose, but they have different guidelines as to what they want to see (at most X words vs. at most Y pages, single-spaced vs. double-spaced etc). Everyone wants recommendation letters, but they differ as to what sorts of recommendations they expect; do they accept recommendations from people you’ve worked with in an industrial setting, or do they want recommendations strictly from faculty members? The list goes on.

The very first thing I did when applying to graduate programs was to look at their programs’ websites and their application forms and try to answer a few questions:

  • How much is the application fee?
  • What do they say they want in a statement of purpose? (I wrote down exactly what they required there)
  • How many letters of recommendation, and from whom? Do they want them mailed or filed digitally? If mailed, where should I mail them?
  • Do they take GRE scores? If so, do they require a subject GRE? Which one? What are their GRE institution codes? (This last bit is important so you can tell The Testing Mafia where to send your scores)
  • How many copies of my transcript do they need? Do they want it mailed or filed digitally? If mailed, where should I mail them?
  • With which professors would I want to work? What projects have they done or are they doing that I find interesting? (If you can’t answer both of these questions, reconsider applying to this school)

Once I was done making that list (it took me an afternoon, round numbers) I made a folder on my computer called “Graduate School”. Inside that folder, I made two folders, “General Purpose” and “Schools”. The “General Purpose” folder would hold all the information that all schools seemed to want in some form or another – statement of purpose, resume, transcript, extracurricular activities list, work experience, and so on. The “Schools” folder contained a subfolder for each school, and housed not only the application materials for that school but also copies of any confirmation e-mails or webpages I received after completing the application (in case I needed to produce them later for some reason). This über-folder was backed up periodically to a server across town to protect against any major disasters.

The Statement of Purpose

This is the big one. This is the portion of the application that the people for whom you want to do research are likely to read – think of it as your “elevator pitch” for yourself.

Some people say that you should tailor your statement of purpose for each university to which you apply – since I was applying to ten different universities, this tactic didn’t seem feasible. You can say “Oh, a graduate degree from Stanford is the reason I was brought into this world; I will remake the world in Leland Stanford’s image with the help of Professor So-and-so, who I worship as a god among men!” but the people reading your letter won’t buy it. If you’re applying to a program because you’re really interested in X, and a professor in that program is a leader in the field on X, and you’ve had prior research experience in X, then mention it. Otherwise, leave it out because it does you no good.

Be honest, both with yourself and with the university to which you’re applying. I didn’t consider graduate school seriously until my junior year in college, and a lot of the places to which I was applying expected lots of prior research experience from their applicants that I honestly didn’t have. I knew they would notice it, so I came right out and said in my statement (not in so many words, you understand), “Look, I know that I haven’t published anything and that my research experience is kind of slim, but I’m really excited about this stuff and I know that I’ll be able to meet and exceed your expectations”.

Remember that this is not like college applications; your materials are not, by and large, going to be read by some faceless bunch of professional college application readers. The people reading your application are probably among the people whose classes you will take and whose research you will do. More importantly, they will be the ones who will fund you and they want to know that they’ll be getting their money’s worth.

Recommendation Letters

So many “how to get into graduate school” websites say that the key with recommendation letters is to get the ball rolling early, and I agree with them. One thing they don’t tell you often enough (and that they should tell you) is that your college’s letter service is your friend. Professors, especially at “research universities”, usually don’t like being bothered about letters of recommendation by undergraduates. It means that, in addition to writing the letter itself, they have to fill out forms and get envelopes and stamps and it takes time away from their work. Your goal is to be as unobtrusive as possible, and the campus letter service helps tremendously with that.

Letter services offer your recommenders the ability to write their letter once and send it to the letter service office along with some identifying information. The letter service then keeps these letters and (for a nominal fee) ships them off to whomever you want without any further involvement from the recommenders. Another plus of this method is that you don’t have to worry about getting all of your recommenders to send everything in on time – just fill out a form online and you’re good to go.

In short, start early, and if your university has a letter service then use it.

Keeping Track of Deadlines

You may be the most diligent person on Earth and have all your applications done well in advance of the deadline. If you’re like most of us, applying for graduate school isn’t the only thing you’re doing – you’re busy. I wrote down all the deadlines for my applications (which fell in a six week window between the middle of December and the beginning of February) and put countdown timers on my Google homepage. Every time I opened a browser, there they were, a grid of about a dozen numbers that kept getting smaller. If that won’t keep you on track, nothing will.

I hope that my experiences with this horrible, tedious process will prove useful to someone. Godspeed, applicants.

Victory

November 5th, 2008

I really don’t know what to say. Thank you to everyone who worked so hard to make this amazing turn of events (that I never thought would actually happen) possible. I know that the words “hope” and “change” have been thrown around pretty liberally by both sides in the past few weeks, but I really hope that this is a sign that the insanity of the past few years is slowly coming to an end.

Proposition 8 and its sister propositions in several other states passed, and I think that’s unfortunate, but expected. It shows, I think, that Americans cannot cleanly separate their politics from their religion. I’m sure there will be counter-propositions and counter-counter-propositions ad infinitum until someone decides to amend the Constitution. We’ll just have to wait and see.

I congratulate President-Elect Obama (that’s still sort of surprising to be able to say, isn’t it?) on his well-earned and decisive victory. Now comes the time when you make good on the promises that got you elected. Please don’t let us down.

Update: I don’t want that “cleanly separate politics from religion” bit to be construed as a slight to religion or the religious. I totally understand how deeply religious people must have been of two minds about this issue, and what I meant by that statement was that people can’t vote on something like gay marriage without being influenced by factors like their religious convictions and that this is exactly why this issue will continue to oscillate forever until some decision is made at the national level.

Attacking religious people because you’re against Prop. 8 is just as bad, in my opinion, as attacking non-religious people because you’re for Prop. 8, and I apologize if my remarks were in any way misconstrued.

My Digital Transition

November 1st, 2008

After coming back from Microsoft with another bundle of printed research papers in hand I found that, in the course of a year, I’d amassed a stack of read (and needing-to-be-read) research papers that filled a set of binders almost a foot thick. This would be fine if I had categorized them and knew exactly what went where, but I hadn’t and I didn’t. In fact, I had no idea what was in those binders. Furthermore, on several occasions I’ve found myself saying, “Damn, I know there’s a paper about this that I’ve got in these binders …” and not finding anything after flipping through them for about 10 minutes. “Self,” I said to myself, “there’s got to be a better way!”

I identified three problems that needed to be solved:

  • A foot of paper a year, if it continued growing at that rate, would be as tall as me before I graduated. That just isn’t sustainable.
  • Even if I kept all six feet of paper, it would be impossible to find any one of them in that stack.
  • Often, I’m looking for a particular topic or a group of related papers rather than a single paper.
  • I write notes in my papers, and I’d like to retain the annotating flexibility of scribbling on a paper with a pencil when transitioning to a digital system.

At this point, I think I’ve got all but the last one figured out.

I had considered just getting a Fujitsu ScanSnap document scanner, as it translates paper directly into PDFs, but $500 for a document scanner was way too pricey. One of the things I had going for me was that all of the research papers I’d printed started out as PDFs. The easiest and cheapest solution was to find the original PDFs, index them, add any annotations I’d put on the papers beforehand, and recycle the originals. In the end I chose to not transfer the annotations; frankly, I couldn’t decipher most of them and it would have taken too much time. Armed with Google Scholar I was able to find PDFs (and bibliographic information) for my entire paper stack in about 90 minutes.

Now that all the PDFs were downloaded, I inserted them all into Referencer. Two nice things about Referencer are its ability to store bibliographic data (in the form of BibTeX citations) along with papers and its ability to associate papers with tags, descriptive words or phrases (similar to how del.icio.us does bookmarks). This really helps when searching for papers that fit a given topic and is a lot more flexible than any fixed filing system. It also allows you to give each paper a text “note”; I find that, if I take notes on a paper, this forces me to be more concise and structured in writing my thoughts about it than scribbling on a page’s margins would be.

Now that the papers were digitized and tagged, they needed to be searchable. If it’s one thing that Google and Spotlight have taught me, it’s that if it’s not searchable, I won’t find it. Beagle does an admirable job of indexing the text of all my PDFs automatically.

The last step in the process was to make sure that when(!) my hard drive died I wouldn’t lose all my paper data. This was relatively easy, since UCSD just got a shiny new NetApp file server with a boatload of redundant storage. I’ve set up a cron job to synchronize my home folder to that server every night at midnight.

Now I’ve got a much more accessible paper library that’s really easy to maintain. The march toward paperlessness doesn’t stop there, however; just a couple weeks ago I recycled several pounds of software manuals that, if I ever needed them again, I could find online. As a result I’ve got one less storage box in my closet which is a big help given that I live in a pretty small space. I have a feeling my “important documents filing box” is next.

Automatic Everything

August 21st, 2008

This was going to be an extended rant on how people use databases where people shouldn’t use databases, but the more I wrote the more I realized that this had been analyzed quite a bit by many in the systems research community and blogosphere at large, many members of which are far more knowledgeable than I. So I’ll summarize my rant in a paragraph and then move onto more philosophical, “meta”-type comments.

Twitter’s architecture (as much as they’ve shown us) is a Ruby on Rails app backed by a MySQL database. This combination is the Golden Hammer of Web 2.0. A frighteningly large number of web application developers seem to follow the mantra, “If I need to store data, use SQL as a Big-Ass Table (no, not that Big-Ass Table). Who needs high-speed middleware? I’ll write everything in Ruby!” The problem is that schema design is as close to alchemy as CS gets and tuning databases is tedious and hard to do right. If you are writing something that must process tens of thousands of messages a day, do not think you can write it in an interpreted language and have it frequently converse with a database. If you think this will work, you are living in a magical dream world. I’m talking directly to you, Twitter, you poor sad whipping boy of the Web 2.0 universe. Please, for your own sake, rewrite Starling in C or C++ and use a more suitable back-end.

That concludes the synopsis of my multi-page rant of doom. Now, for the meta: if I were to write an essay for NPR’s This I Believe, the following would be that essay.

I believe in telling systems what I want, not how to get it, and having them give it to me as quickly as possible. I believe that programmers are lazy, and that middleware should give them the ability to do the right thing the easy way. I believe in intrinsic scalability and building on sound principles. I believe that the disk is evil and writing to it should be avoided until you have no other choice. I believe in most of what databases do and in the potential of what their descendant systems can and will do.

I believe in the awesome potential of automatic everything.

CD Recommendation of the Epoch

August 14th, 2008

20 Minute Loop uses harmony as a weapon of mass catchiness. I approve. Here’s a taste:

20 Minute Loop – Our William Tell

Messing with Thom Yorke’s head

July 25th, 2008

Radiohead’s latest music video was shot without cameras. Instead, they used a combination of reflected light and lasers to generate clouds of points in 3D. Google was nice enough to provide the rest of the world with some of the 3D point cloud data collected for that music video. A big piece of that data is about 2100 frames of lead singer Thom Yorke’s head. A frame of the original data (when output via Processing) looks like this:

If you look closely, you’ll notice that the point cloud is really noisy around the edges. A simple high-pass filter later and that same frame looks like this:

That’s a little more manageable. I figured, why stop at point when you can have 3D surfaces? One of the more straightforward ways to make a 3D surface out of a bunch of points is to stick a bunch of triangles in between the points, creating what’s called a Delaunay triangulation. This is a really compute-intensive calculation and I don’t exactly have a supercomputer on hand, so I did a lot of fudging and approximation. Even with all that fudging, each of these frames took as much as 5 minutes to render. This process has been running for most of last week while I’ve been at work. That same frame above looks like this when Delaunay-triangulated:

Notice that it’s a little noisy, which is mainly due to some approximation on my part as well as some leftover noise in the point cloud. The video below shows what happens when you sequence all 2100 frames together. Enjoy!

I own alexrasmussen.com

July 15th, 2008

Booyah! My continued dominance of the ‘alex rasmussen’ Google search is assured!

This is the future – why are my updates still failing?

July 12th, 2008

Software Updates are BrokenSo anyone who has an iPhone or iPod Touch will be pretty aware that Apple’s update servers basically fell over in response to all the demand today due to the new iPhone firmware. Recently, Firefox’s update servers suffered exactly the same problem. Now I’m sure that these guys have a really expensive load balancer in front of their update server cluster, but why in the world are so many major companies still having all their users go to a single place for updates?

If I want to download an update from Software Update today on my home computers, I have to do it three times - once for my Mac Mini (file server/backup server/media center), once for my laptop and once for my tower. The actual update binary is, in most cases, identical. If I wanted to only download the update once, I’d have to find where Software Update keeps the update’s installer file, copy it to the other machines and run it there. In some cases I have to download tens or hundreds of megabytes of file that could easily be transferred over my home network, saving both my time and the update provider’s money.

The thing that’s the most irritating about this is that it’s a completely solved problem. Blizzard, for example, distributes updates to World of Warcraft over Bittorrent. My roommate just started playing WoW again and had to install a patch (~2 GB) on two of his computers. He downloaded and installed the patch on the first computer, which took about an hour and a half. The download-and-install process for the second computer took all of about five minutes because the computer automatically recognized that a source for the update existed on its local network and downloaded the file peer-to-peer from the other machine.

Imagine if everyone interested in downloading the iPhone patch could download it not only from Apple but from each other. After the first few hundred downloads (which would have to pull directly from Apple) most of the remaining transfer would be peer-to-peer. If iTunes needs to authenticate the phone with Apple before installing, that’s fine; the load on the servers from authorization would be far lower and of a much shorter duration than the load from patch downloading. Security, of course, is an issue with Bittorrent-esque downloads, but there are relatively straightforward ways to deal with that.

I’m just saying it’s about time that someone did something about this, because it’s getting a little ridiculous.

Next »