Scripting best practices

Posted: May 15th, 2011 | Author: | Filed under: Computers | Comments Off

According to sloccount, TritonSort has almost 9,000 lines of scripts. 95% of those lines are Python, the remaining 5% are bash and awk scripts. They do everything from setting up our testbed’s resources to monitoring experiments and computing statistics over results. Throughout the process of writing, re-writing and iterating over all those scripts, I’ve distilled a few hard-won lessons about what works and what doesn’t work when it comes to writing them.

A lot of this is going to be Python-specific, since that’s what most of my scripts are written in. However, I think this advice can be applied pretty readily to your favorite scripting language.

A giant, snarly Bash script is almost never the answer. Tools like grep, sed and awk are extremely powerful, and I do most of my ad-hoc text analysis by chaining these tools together with pipes. Unfortunately, anything more complicated than a for loop in Bash tends to get messy really quickly. Also, scripts like this that snarf in unstructured text tend to be rather brittle; if the format of your input data changes over time, your scripts tend to break in interesting ways.

Treat your scripts like libraries. It’s almost never a good idea to stick everything your script is doing in global scope. Instead, make the actual body of the script a function and write a few lines of main() boilerplate that takes in options and arguments and calls that function. Once the main body of your script is a function, you can just import that function somewhere else when you want to compose scripts together, which will make your life a lot easier down the road.

Script functions should return (at least) semi-structured data. If your script produces unstructured text, at some point you’re going to have to parse it. That can get messy really fast. If you want a script’s results to be human-readable, make the script function return some data structure and have main() print it. Better yet, have a second function in the script that prints a readable version of the data structure the script function spits out, or make the data structure a class that overrides __repr__ or __str__.

Make your output portable. If you expect that a program written in another language is going to have to consume a script’s output, it’s a good idea to make that output easy for the consuming program to read. If you’re just dumping out a list of numbers, by all means just dump that list of numbers with one number per line, but for anything more complicated than that you’ll want at least some metadata telling you what all this stuff you’re dumping actually is.

We’ve been starting to use JSON more and more since it’s got reasonably good support across a bunch of languages, is brain-dead simple to parse and is reasonably structured without much of the extra bloat that XML imposes. If you’ve got a really complicated configuration file that needs to be validated, XML might be a better choice, but most of the time you really just want key/value pairs and some limited support for nesting and lists and JSON does that just fine. I’ve also heard that YAML is awesome, but I’ve never used it.

Document, document, document. I know I’ve been on a bit of a documentation kick lately, but seriously, Future You will thank Present You for telling him what exactly it is that putTheThingInThePlaceWhereStuffGoes.py does, what input format it expects, etc. Along the same lines, don’t use names like that one. Give your scripts descriptive names.

Hopefully this deters you from making some of the same mistakes we did. Happy scripting.


On (Lack of) Documentation

Posted: May 7th, 2011 | Author: | Filed under: Computers, Ranting | Comments Off

I am really starting to get irritated with the lack of documentation present in some “production-ready” open source projects. Issues related to lack of documentation have hamstrung me multiple times in the last few months and it’s really starting to get on my nerves.

If you’re writing a library, your documentation is just as important as your code. The simple fact is that your library, regardless of how elegant or fast or awesome it is, is completely useless unless it’s got decent documentation. Decent documentation falls under a number of categories – all of these categories are important.

Thorough, up-to-date user-facing documentation: This means tutorials and example code, but it also means things like wikis that can change as users start to expose common traps and pitfalls. The documentation should change as the code changes, which means it should be auto-generated whenever possible.

Helpful exceptions and assertions: Don’t just assert(b == "foo"); actually attach a meaningful message to the assertion so that I know why the assertion was made and what it means if it failed. If you can give me a permalink to a page telling me what I’m doing wrong, so much the better. And don’t just give me “b isn’t foo. Something’s wrong.” That doesn’t give me any information. Similarly with exceptions: throwing a GenericException without any accompanying message or stack trace makes me want to punch you (seriously, I’ve seen this happen many times and it’s really irritating).

Also, please give me a stack trace. If a failed assertion doesn’t give me a stack trace and indistinguishable copies of the same assertion appear in 20 different places, the only way I’m going to know which assertion just failed is to hook a debugger to the program and try to reproduce the error. That sucks, and sometimes it isn’t even possible (if the problem is non-deterministic or the situation that causes it to happen is rare).

In-code documentation: I don’t necessarily believe that you should have a line of comments for every line of code you write; requirements that rigid lead to a lot of “This line adds 2 and 7 together” comments that just make the code harder to read. If the code gets messy, write some inline comments explaining at a high level what the code is supposed to do. Your users will thank you and when Future You looks at the code that Past You wrote, he might have a chance of understanding what it was Past You was thinking.

First, be helpful: Many things about the usage of a library may seem perfectly obvious to you because you wrote the library. To a new user, some things may not be so clear. So many times in mailing lists and message boards I see threads that look like this:

User: “Here’s a code block; it’s throwing some random error. Anyone know why?”

Developer (in Comic Book Guy voice): “I do not understand why you users are so stupid. Clearly you must initialize the host key container before initializing the SSL session but after initializing the session transport. Worst. Users. Ever.”

Or this:

User: “Here’s a code block; it’s throwing some random error. Anyone know why?”

Developers: *years of silence*

This is a great way to lose existing users and discourage new ones from using your library.

Commit messages are a part of internal documentation: Documentation is just as useful for other library developers as it is for users. Commit messages are a great deal more important for developers than they are for users, but they’re part of your documentation nonetheless. I heard a great quote relating to this in a post on source control by Troy Hunt: “Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live”.

tl;dr: Your documentation will never be perfect. It will probably never even be great, unless you’ve got people dedicated to working on documentation. Despite this, small improvements can make a big difference. Popular libraries become popular because they effectively solve a problem that a lot of users have and because it’s easier for users to use that library than it is to solve the problem themselves. Good library design and talented programmers make the first part happen; the second part can’t happen without good documentation.


App of the Moment: Bowtie

Posted: May 4th, 2011 | Author: | Filed under: Useful Software | Comments Off

I synchronize my music to my iPhone and carry it around with me. I listen to it in the car, at work, in the grocery store, while doing laundry … it probably gets a good 6-8 hours a day of use. At work, I’m faced with what I thought was a very obscure problem. I want to be able to control music playback on my iPhone from my computer.

It’s not that my phone isn’t sitting next to me on my desk when I’m working. I could just reach over, double-tap the phone button and press the screen to switch tracks. That requires my hands leaving the keyboard, though, and reaching over to switch tracks has probably cost me hours of accrued time debt over the years (sort of like how they say that you spent entire days of your life in total tying your shoes).

It’s irritating enough that I can’t share the music library on my phone with the local network; that would solve the problem right there. Unfortunately, Apple hasn’t seen a reason to implement that feature. I could also synchronize my iTunes library between my desktop and laptop, but unfortunately Apple hasn’t made that automatic and painless enough yet, and I’ve tried various other techniques (rsync, third-party apps, hosting the library on networked storage, you name it) without success.

For the longest time, I figured that I’d just have to deal with it (horrible, I know). Yesterday, the blog One Thing Well pointed me at an application called Bowtie.

The desktop version of Bowtie gives you basic control of iTunes (play/pause, next and previous tracks) with keyboard shortcuts and will show you the currently-playing track in a customizable little desktop widget. That isn’t very unique by itself; there are dozens of iTunes remote control apps of various maturity and feature-richness out there, and they’ve been around for years. Where Bowtie distinguishes itself is in the $0.99 companion app for the iPhone. Pair the iPhone application and the desktop application together, and you can control the iPhone’s music playback using the same keyboard shortcuts you use to control iTunes.

Pairing requires that the phone and the computer can see each other on the network (I’m not sure of the implementation details, but it probably relies on Bonjour). I use a wired connection at my desk (because the building’s wireless network is flaky), so I set up network sharing and connected my iPhone to the shared network and that has worked flawlessly so far.

Overall I’ve been really happy with it. If you’re running into the same first-world problem that I am, it’s more than worth the $0.99.


The Importance of Searchability

Posted: April 24th, 2011 | Author: | Filed under: Advice (Unsolicited) | Comments Off

When I was in college, I wrote notes in a collection of spiral notebooks. I also had a three-ring binder with a divider tab per class where I put things like graded midterms and assignments. After classes were over, I’d donate the midterms to the HKN exam archive and recycle everything else. As my classes shifted more toward reading a lot of papers, the papers would go in their own separate three-ring binder, with a divider tab per class.

What a waste.

In retrospect, I see a lot of this exercise as more psychological than actually helpful. I spent a lot of time in college feeling overwhelmed (whether or not I was actually overwhelmed is an open question), and the act of collecting and organizing this mountain of paper into something resembling order was a way to feel more in control of the situation. Also, taking notes certainly helped me stay focused in class. But the fact is, much of that mountain of paper was pretty much useless.

First, it was really hard to actually find anything. Trying to find all the information I had collected on a topic, or the definition for a particular term, usually involved doing a scan through a few weeks of notes or a glossary lookup in the course’s textbook, despite the fact that I had things fairly well organized.

Second, I didn’t actually need most of that data. My move from taking a notebook full of notes per class to a few dozen pages spoke more to my increased ability to separate what was important to capture from lecture from information that I could grab from lecture slides or the textbook. Keeping exams was helpful, but only because it told me what I didn’t understand well enough; the actual content of my answers wasn’t really all that helpful.

Third, my handwriting is terrible. Especially when I’m writing fast, which is what I’m doing when I’m writing notes or writing down answers for an exam. Many was the time where I looked at my notes and said, “Well, here’s where the stuff I’m looking for is, but I can’t make heads or tails of what I was writing here.”

I realized too late (after I had mostly stopped taking classes and had shifted into doing research full-time) that the only result of capturing all of this information by sticking it in binders and putting them on bookshelves was more full bookshelves.

It was then that I decided to adopt the following guiding principle for every piece of information I will ever need to look at more than once:

If I can’t search it, I will never find it again.

This immediately applies to things like notes. I found that it’s easier to use paper notes than to lug a full-sized laptop around, but with the emergence of tablets that’s becoming a decreasingly relevant problem. These days, I hardly write anything on paper anymore; I type significantly faster than I write, I can actually read what I wrote afterwards, and most importantly everything I type is instantly searchable.

Any piece of paper I need to keep (bills, receipts, etc.) gets scanned. I’m not as good at passing scanned papers through OCR as I should be (I’m still trying to figure out an effective OCR solution), but I make sure to name these scanned documents descriptively so that I can search by description later. Another huge benefit of this approach is that everything is backed up. I’ve lost notebooks for classes in the middle of the quarter before. It really sucks.

In the last post, I went over how I handle organizing and taking notes on research papers; a big component of making that process work effectively is the ability do full-text search on papers. Related work searches get so much easier when you can just straight-up search through reams of papers at once. One of my favorite Google Talk features is the ability to search past instant message conversations; it’s the one value-add feature that keeps me coming back to GTalk. GMail makes e-mail searchable (then again, so does basically everything else these days), which is always really helpful.

This “make everything searchable” principle has really helped me keep on top of the mountain of data that I receive and/or generate every day.


Software I Use Daily: Mendeley

Posted: April 17th, 2011 | Author: | Filed under: Useful Software | 1 Comment »

Almost two and a half years ago, I wrote a post here about my efforts to transfer my piles of research papers into digital form. At the time, I was running a combination of Referencer and Beagle, using Referencer to keep things organized and Beagle to make it all searchable. Unfortunately, this solution didn’t work out as well as I’d hoped. The main reason for this was the problem of manual cross-platform synchronization for both the papers themselves and all the various metadata associated with them. I didn’t want to waste time figuring out how best to keep everything synchronized between my desktop and laptop (one of the reasons I use my laptop exclusively for day-to-day development now), so I lived with that solution for about a year.

At some point in late 2009, I was introduced to Mendeley by some friends of mine in the CSE department. It’s like they took my wish list for a paper management system and implemented it. It’s fantastic. Here’s why:

Written by researchers, for researchers. It’s very clear that this application was written by people who have had to deal with academic papers a great deal. It strategically attacks so many pain points associated with dealing with large paper volumes that I can’t help but think that the entire design process was guided by researchers that were fed up with the current state of affairs.

It’s cross-platform. It works for OS X, Windows and Linux. And when I say “works”, that doesn’t mean “the Linux version barely works and the OS X version has a wonky UI”, which is true of a lot of cross-platform software in my experience.

Flexible organization. Like organizing with folders? It’s got folders. Like using tags instead? It’s got tags. Want to use both? Sure, go crazy.

Free, effortless synchronization. You can synchronize up to 500 MB of papers (both metadata and data) to Mendeley’s servers for free. For $5/month, that increases to 3.5GB. I’m currently at 365 MB and I’m storing 520 papers, so those 500 GB of space will go a long way. In my experience, synchronization between Mendeley instances “just works”, even across platforms.

Embedded notes and annotations. This is the killer feature for me. There’s nothing too complex here; just highlights, the ability to stick a note at a point in the text, and a dedicated notes region per-paper with basic formatting. The key here is that those notes synchronize across platforms and are actually readable everywhere.

It’s social (groan). It seems that in the new bubble, every piece of software you write has to have some social aspect to it. Thankfully, Mendeley manages to do this in a reasonably well-scoped and tasteful way. Your papers’ bibliographic information is sent to Mendeley, and they use that information to better recommend metadata for new papers that other users import. You can also share papers with other Mendeley users through “shared collections” (limited to 10 people in the free version), which is really useful for study groups and research teams that have to refer to the same pieces of literature. You can also track how many people are reading papers that you wrote and stroke your inner narcissist.

Mendeley is one of those applications I wish I had known about years ago. If you’re looking for a solution to your paper management problem, I encourage you to give it a shot.


It’s Thesis Proposal Week!

Posted: April 12th, 2011 | Author: | Filed under: Random | Comments Off

I’m currently working on my thesis proposal, and am hence entirely snowed until next Tuesday. One of these days I’m going to give myself a buffer a couple weeks long …

I’ll fill space this week with this:

First, congratulations to my friend Imran on a successful thesis defense!

Second, if you’re using Terminal.app, stop and use iTerm 2 instead. It’s an amazing piece of software.


An Introduction to Chip Music

Posted: April 5th, 2011 | Author: | Filed under: Music | 2 Comments »

Chip music (or chiptune, there seems to be some philosophical debate as to what to call it) is music that’s made with vintage sound chips, typically those used in old video game systems. Some tend to dismiss chip music as a simple nostalgia play aimed at twenty-something gamers, like those “You have died of dysentery” Oregon Trail shirts they sell at Hot Topic. Really though, there’s more to it than that. Much, much more.

When people talk about art, one of the things that comes up over and over again is the idea of limitations breeding creativity. For whatever reason, some people tend to be more creative when their toolkit is more restricted. Give someone a limitless number of Photoshop brushes, and they’ll spend all day playing with the brushes and no time actually making anything. Give them a pencil, and they have no choice but to use that pencil. Chip music is a lot like that pencil.

Today’s music, electronic music in particular, is almost absurdly engineered. Everything is really high-gloss, passed through enough filters and equalizers and post-processors to fill a small studio. Anyone can reproduce virtually any instrument ever made, that sounds like it’s being played in a room of whatever dimension or acoustic quality you want, instantly. You can make instruments that never existed, could never exist, easily on any consumer-grade laptop.

If you’re writing music that’s going to be played back on an NES, you get five sound channels. Two pulse (or square) waves with variable duty cycles, a triangle wave, a noise channel (that just outputs fuzz at varying pitches) and a DPCM sample to play back (very lo-fi, very short) samples. In a typical arrangement, you’d use the DPCM channel and/or the noise channel for drums, the triangle channel for a bassline, and the pulse waves for a lead with some basic two-note harmony.

We got some of the most iconic music of my generation from that simple chip. People have been making custom music for hardware this simple since the early days – indeed, the guys who wrote the music to games like Super Mario Brothers and Castlevania were some of the earliest chip musicians. There was also an underground music scene for platforms like the Amiga and the Commodore 64 that grew largely from the “demoscene”, a bunch of hackers trying all the dirty tricks in the book and competing with each other to get their hardware to do ridiculous things. Remember kids, no CDs, RAM measured in KB, 256 colors … and the things that they made these machines do are incredible.

In recent years, the widespread availability and affordable price of old game consoles coupled with a resurgence of music composition software development for these platforms has caused an explosion in the production of chip music. While it’s a lot more popular than it was even a few years ago and is starting to peek into the mainstream, it’s still very much the stuff of hobbyists and vintage hardware geeks.

That said, there are a lot of people make a lot of really good music out there. A lot of it is free, and the vast majority of what isn’t free is released through small labels or by the artists themselves.

If you want a place to start, check out these guys:

GOTO80: A Swedish artist and prolific chip musician, one of the elder statesmen of the chip music scene. Check out Breakfast for a bat-shit insane music video and a song with an infectious hook. Then look through the rest of his stuff.

Virt: You know how some people say that they want to write music for video games when they grow up? Well Jake Kaufman actually stuck with it, and he’s really, really good at it. His FX albums are fantastic and he contributed part of the soundtrack to the upcoming XBLA/WiiWare release Retro City Rampage (which I can’t wait to play, by the way). FX3 was on constant rotation in my car during my Google internship.

Alex Mauer: Writes in all different styles, for tons of different platforms, and does it all with an incredible amount of polish. 9999 is worth the purchase, if only for a song written for the Amiga called “OJ Finds the Real Killers”, a brilliant homage to the music of 1980s buddy-cop movies.

Anamanaguchi: they make straight-ahead face-melting rock with two guitars, a bass, some drums and an NES. They also did the soundtrack to the Scott Pilgrim video game. This is all you need to know.

EvilWezil: a friend of mine and author of some of the most original and innovative chip music I’ve ever come across. Never have you rocked out so hard to a song written in 9/8 time.

These are but a few among many talented artists out there. If any of this strikes a chord with you (so to speak), I encourage you to browse the stuff that labels like Pause and 8bitpeoples are putting out. Also, check out 8bc for a firehose of free chip music.


Break week for NSDI

Posted: March 28th, 2011 | Author: | Filed under: Random | Comments Off

Frankly, I didn’t think I’d be able to last this long without skipping a week.

I’ve been putting in absurd hours working on preparations for this year’s sort benchmark submission and the talk I’m giving on Wednesday at NSDI (the USENIX Symposium on Networked Systems Design and Implementation), so I haven’t been able to prepare anything.

In lieu of content, if you want to read our NSDI paper, it’s available online at tritonsort.eng.ucsd.edu.


Kindle: First Impressions

Posted: March 21st, 2011 | Author: | Filed under: Computers, Opinions (Uninformed) | 2 Comments »

I decided to buy a Kindle last week, for a few reasons. I really wanted to start getting into e-books; they’re cheaper than hardcover for new releases now, I don’t have to wait for them to get delivered and they don’t take up space. Although I probably would have preferred an iPad if money were no object, I’m not really willing to spend $500 on a tablet when I already have a laptop.

I bought the WiFi-only model (didn’t really see myself needing the 3G version) and have been fiddling around with it for a couple of days now. My first impressions are pretty favorable.

I’m really surprised at how fast the e-ink display refreshes. I’d played around with an earlier-generation Kindle and a Sony e-book reader for all of about a minute years ago, and was really turned off by the refresh speed on the display. No such problems with the latest-gen Kindle, at least when it comes to reading and menu navigation; there are times when I find myself getting ahead of it, but most of that is off of my typical operating path.

The Amazon marketing hype on the display isn’t too far off; it really does look a lot like paper and is pretty easy to read without a light on (although I’m trying to save my eyes by not reading in dim light these days). I haven’t tried it in direct sunlight yet.

The quality of e-books on the Kindle varies depending on what you’re reading. Some publishers didn’t really put a lot of effort annotating things like chapters in their books, which makes navigation a challenge; Kindle’s navigation works by jumping to “locations” rather than pages, so you often have to search for the location corresponding to a page rather than the page itself if you don’t have a bookmark handy. I’ve only had this problem for the freebies on the Kindle store; the books I’ve actually paid for have pretty well-groomed metadata.

The fact that the Kindle doesn’t support the EPUB standard and instead uses its own DRMed format is irritating, certainly, but I feel like I can live with it, especially since converting EPUB to their proprietary format is supposed to be fairly straightforward. I’m pretty confident that eventually they’ll do the same thing the iTunes Music Store did and drop DRM entirely, or at least support EPUB natively with a software update.

I’m pretty happy with the Kindle so far. Does anyone else out there have one of these things? Any tips and tricks I should know about?


Pi

Posted: March 14th, 2011 | Author: | Filed under: Random | Comments Off

Today is the 14th of March, which some of the dorkier among us call Pi Day because the date (3/14) corresponds to the first three digits of pi.

Recall that pi (\pi ) is the ratio of the circumference of a circle to its diameter. There are a number of surprising things about pi, perhaps the most well-known of which is that it is an irrational number. Irrational numbers can’t be expressed as the ratio of two integers; that is to say, there are no two integers m and n such that \pi = \frac{m}{n}. A side-effect of this is that the decimal digits of pi never end and never repeat. Humans with a lot of free time are capable of memorizing a few tens of thousands of digits of pi, and computers have calculated pi to about 5 trillion digits.

The thing that interests me most about pi is that it shows up everywhere, even in places where it’s not immediately apparent that it should be involved. Whether the ancient Egyptians knew it or not, the ratio of the perimeter of the Great Pyramid of Giza to its height is approximately 2\pi. The ratio of the length of a river to the straight-line distance from its source to its mouth has been shown to approach pi. So many things in the natural world seem to involve circles (or spheres) that maybe it’s no surprise that pi is so ubiquitous.

One of pi’s most elegant appearances relates to complex numbers. Complex numbers are numbers of the form z = a + b i, where a and b are real numbers and i is \sqrt{-1}. They show up all over the place in physics and mathematics.

You can think of each complex number as a point (a,b) on a two-dimensional plane called the complex plane, where a and b denote the real and imaginary parts of the number. Since you’ve got a point in a two-dimensional plane, you can form a circle based on it – just draw a radius from (a,b) to the origin (0,0), sweep that line around 360 degrees and boom, circle.

You can even use this radial line to re-define the point (a,b) in polar coordinates, so that rather than being defined by its real and imaginary parts, each complex number is represented by the angle that its radial line to the origin makes with the x-axis and the length of its radial line (see picture). So now instead of defining the point as (a,b), you define it as (r,\varphi ) such that

z = a + b i = r (\cos(\varphi) + i \sin(\varphi))

Now it turns out that Euler’s Formula states that any point on that circle in the complex plane can be described as Euler’s number – e \approx 2.71 – raised to a complex power. In particular,

e^{i\varphi} = \cos(\varphi) + i \sin(\varphi)

So any complex number can be written as z = r \cdot e^{i \varphi} for some values of r and \varphi. The really amazing thing here comes when we make \varphi = \pi. Since \cos(\pi) = -1 and \sin(\pi) = 0, Euler’s formula yields

e^{i\pi} = -1

or

e^{i \pi} + 1 = 0

which relates – in a single, extremely elegant equation – five of the most important constants in all of mathematics.

Happy Pi Day, everybody.