Analyzing The Comic-Con Registration Meltdown
Posted: November 22nd, 2010 | Author: Alex | Filed under: Advice (Unsolicited), Computers, Opinions (Uninformed) | 2 Comments »On November 1st at 9 AM, online registration for 4-day passes to San Diego Comic Con began. By 9:05 AM, the massive volume of registering attendees caused the registration system to become inaccessible. By 10:30 AM, Comic Con International closed the registration site down, claiming that it would re-open in three weeks.
At 6 AM PST this morning, registration re-opened. By 6:05, the registration site was once again inaccessible due to overwhelming traffic volume. A lucky few managed to get all the way through the registration process, but most of us were left repeatedly retrying until, at 7:30 AM, the registration site was once again closed.
In the grand scheme of things, this was not a disaster. However, it inconvenienced a large number of people, myself included. In my opinion, both of these problems were entirely preventable.
I do research in the design of scalable, high-performance large-scale systems. Many people in my field work on technologies that are designed to prevent exactly the kind of failure to operate at scale that occurred this morning. While I am by no means an expert, I feel that I can speak from an informed position about scalability issues like this one. In this post, I’m going to speculate on what happened, and what might be done to fix the problem inexpensively.
Disclaimer: I don’t work for Event Planning International Corporation (EPIC) or Comic Con International (CCI), and I was only an external observer of their registration meltdown, so I don’t know exactly what occurred. I’ve seen problems like this documented enough times that I think I can guess what really happened from that documentation and personal experience. I also don’t work for Amazon or any of the other companies mentioned favorably in this post, although I have clearly consumed vast quantities of the cloud computing Kool-Aid.
I’m going to assume some things about EPIC’s architecture. In particular, I’ll assume they have a single well-provisioned database system and a handful of web servers, possibly with a load balancer sitting in front of them and distributing requests evenly among them. This is how most small websites usually look.
“What we’ve got here is a failure to communicate.”
In my mind, the timeline of the meltdown looked something like this:
6 AM: CCI posts a link to the registration site on its homepage. Frantically refreshing nerds see the green box and click it, presenting them with step one of five: enter your name and e-mail address.
Over the next 90 seconds or so, several hundred people open connections to the registration site. Since the page is just vanilla HTML and maybe a little PHP or Javascript, the web server(s) cache the first page and everything is running fairly smoothly, everything’s being served out of memory and all is right with the world.
People start clicking the “Next” button to proceed to step two. Each click of the “Next” button causes an HTTP POST request to be sent to a PHP script housed on one of the web servers. This script inserts some data into the database indicating that a person whose name is X and whose e-mail address is Y has reserved an attendee slot for the next few minutes – this is mainly there so that you don’t register more people than you can fit in the convention hall. While this is happening for the first few hundred requests, several thousand more registrants are about to hit the “Next” button and start this process for their registrations. The database server starts fielding thousands of connections and inserting thousands of rows into the registration lock table at once. It starts to run out of memory and starts swapping, or it hits 100% CPU utilization, or it’s disks are seeking all over the place. It gets slower and slower. Eventually, it effectively stops returning responses to queries.
So now the database server is effectively hosed. The web servers continue to issue database queries anyway, and they have an increasing backlog of PHP scripts waiting for their database queries to return. The web servers’ memories fill up with session state, the PHP scripts’ stacks, TCP connection state, etc. Eventually, the web servers run out of memory, they start swapping and their performance essentially drops to zero. Requests for page two begin to time out. “500: internal server error” begins popping up on the screens of frustrated nerds around the globe. These users furiously hit “Refresh” hoping that the website will come back to life, which creates new requests and only makes the problem worse.
At this point, sysadmins are running around like their hair is on fire trying to get the problem under control. They try every trick in the book, but nothing works. Frantic phone calls are made. Servers are powered off in hopes that demand will recede if the server is inaccessible for a period of time. Demand does not recede.
After about an hour, the decrease in volume from users giving up releases sufficient system resources for some lucky individuals to be able to advance to step 2: enter your address and phone number. Hundreds to thousands of users do just that and then click “Next” on step 2′s page within about 30 seconds of one another. This issues a flood of HTTP POSTs to another PHP script that is supposed to insert the information contained in the POST into the database and associate it with the name and e-mail address that the user entered in step 1. The problem returns with a vengeance and the servers fall over again. Few users make it to step 3 successfully.
7:30 AM: CCI orders the site closed, claiming that they will be “investigating their registration options”. EPIC (presumably) loses a high-profile client.
Cloud Computing to the Rescue
So, what happened here? Fundamentally, EPIC’s servers were not sufficiently well-provisioned to handle the load presented to it by SDCC’s registrants. The servers couldn’t handle the strain, and so they ground to a screeching halt.
How can we solve problems like this? One way is to buy more and more well-provisioned computers and spread the load across them until the load on a given machine becomes manageable. Unfortunately, this typically involves a lot of up-front and long-term costs: you need to buy the computers, find some place to put them where they’ll remain cool and dry, and fix them when they break. Additionally, when your servers are talking to a database, the server hosting the database must be doubly well-provisioned, often at significant additional cost. Oracle makes a lot its money selling ridiculously well-provisioned database servers the size of a refrigerator for hundreds of thousands of dollars a piece.
Getting enough computers to get the job done does not need to be expensive, however. Pay-as-you-go “Infrastructure as a Service” (IaaS) systems – one of the many classes of systems classified under the blanket term “cloud computing” by the world’s IT pundits – were designed to solve this exact problem.
I’m going to focus on Amazon’s EC2 and RDS here for the sake of example and because they’re the most popular services of their kind, but many other IaaS offerings exist. Joyent and Media Temple are two other great examples of IaaS companies with a strong presence in the marketplace, especially among startups that need affordable, scalable hosting solutions.
Let’s suppose that you need to register 24,000 people (~20% of attendees at last year’s con) and expect peak demand to last around 24 hours. You buy time on 50 “high-memory extra-large” EC2 instances (17.1 GB of RAM, dual-core processors) to use as web servers and ten “high-memory quadruple extra large” RDS database servers (68 GB of RAM, 8-core processors, “high I/O capacity”) to do your query processing. You reserve them on-demand, which essentially means you pay more per-hour but you only pay for what you use. Let’s suppose conservatively that you’ll store 20 GB of data to these database servers (that’s almost 1 MB of data per user, which I’m guess is more than enough to store basic registration information) and that you’ll read and write every byte of it once in 24 hours. Let’s assume that the bank’s credit card transaction processing servers will scale to meet the load; after all, they handle millions of transactions a day routinely (and are extremely well-provisioned because of it).
If you use all of it – that means all 50 web servers and all 10 database servers for 24 hours straight – it’ll cost you about $950 at the end of the month. If you only use the database instances (for 24 hours straight, and read, write and store 20 GB), it’ll cost about $320. (This according to AWS’ cost calculator). That’s far less than the retail cost of the parts in even one of the aforementioned servers.
Need more machines? Buy time on more machines for a few tens of dollars; Amazon’s magical network of services can even balance the load across them for you automatically if you ask it nicely. Don’t need all ten of those database servers you reserved? Only turn on five of them and leave the others alone. Demand starting to die down after 18 hours instead of 24? Start shutting machines down. Magical.
This kind of flexibility and “magic” provisioning of hardware is what makes cloud computing such a compelling way to solve website scalability problems like the one plaguing Comic Con, and it’s one of the reasons why the research and industrial communities alike are so excited about it. I hope that situations like this one will encourage more companies to leverage these sorts of technologies to deploy more scalable websites.
