Scripting best practices
Posted: May 15th, 2011 | Author: Alex | Filed under: Computers | Comments OffAccording to sloccount, TritonSort has almost 9,000 lines of scripts. 95% of those lines are Python, the remaining 5% are bash and awk scripts. They do everything from setting up our testbed’s resources to monitoring experiments and computing statistics over results. Throughout the process of writing, re-writing and iterating over all those scripts, I’ve distilled a few hard-won lessons about what works and what doesn’t work when it comes to writing them.
A lot of this is going to be Python-specific, since that’s what most of my scripts are written in. However, I think this advice can be applied pretty readily to your favorite scripting language.
A giant, snarly Bash script is almost never the answer. Tools like grep, sed and awk are extremely powerful, and I do most of my ad-hoc text analysis by chaining these tools together with pipes. Unfortunately, anything more complicated than a for loop in Bash tends to get messy really quickly. Also, scripts like this that snarf in unstructured text tend to be rather brittle; if the format of your input data changes over time, your scripts tend to break in interesting ways.
Treat your scripts like libraries. It’s almost never a good idea to stick everything your script is doing in global scope. Instead, make the actual body of the script a function and write a few lines of main() boilerplate that takes in options and arguments and calls that function. Once the main body of your script is a function, you can just import that function somewhere else when you want to compose scripts together, which will make your life a lot easier down the road.
Script functions should return (at least) semi-structured data. If your script produces unstructured text, at some point you’re going to have to parse it. That can get messy really fast. If you want a script’s results to be human-readable, make the script function return some data structure and have main() print it. Better yet, have a second function in the script that prints a readable version of the data structure the script function spits out, or make the data structure a class that overrides __repr__ or __str__.
Make your output portable. If you expect that a program written in another language is going to have to consume a script’s output, it’s a good idea to make that output easy for the consuming program to read. If you’re just dumping out a list of numbers, by all means just dump that list of numbers with one number per line, but for anything more complicated than that you’ll want at least some metadata telling you what all this stuff you’re dumping actually is.
We’ve been starting to use JSON more and more since it’s got reasonably good support across a bunch of languages, is brain-dead simple to parse and is reasonably structured without much of the extra bloat that XML imposes. If you’ve got a really complicated configuration file that needs to be validated, XML might be a better choice, but most of the time you really just want key/value pairs and some limited support for nesting and lists and JSON does that just fine. I’ve also heard that YAML is awesome, but I’ve never used it.
Document, document, document. I know I’ve been on a bit of a documentation kick lately, but seriously, Future You will thank Present You for telling him what exactly it is that putTheThingInThePlaceWhereStuffGoes.py does, what input format it expects, etc. Along the same lines, don’t use names like that one. Give your scripts descriptive names.
Hopefully this deters you from making some of the same mistakes we did. Happy scripting.
No related posts.