A few steps toward cleaner, better-organized code

This post follows from a discussion with my Bozeman lab on code management (see my very ugly slides with more details, especially on using git and github through Rstudio, here).

Developing good coding habits takes a little time and thought, but the pay-off potential is high for (at least) these three reasons.

  • Well-maintained code is usually more reproducible than poorly-maintained code.
  • Clean code is often easier to run on a cluster than poorly-maintained code (for example, it’s easier to ship clean simulations and analyses away to a “big” computer, and save computing time on your local machine).
  • It’s likely that code, as well as data, will soon be a required component of publication, and we might as well be ready.

Since biologists and ecologists receive limited code management training, many grad students are unaware of established code and project management tools, and wind up reinventing the wheel. In this post, I go through a few simple steps to improve code management and keep analyses organized. Most of my examples pertain to R, since that’s my language of choice, but these same principles can be applied to most languages. Here’s my list, with more details on each element (and strategies for implementing them) below.

  1. Find a style guide you like, and use it
  2. Use a consistent filing system for all projects
  3. Functionalize and annotate analyses
  4. Use an integrated development environment like RStudio
  5. Use version control software like git, SVN, or Hg and incorporate github or bitbucket into your workflow

Like much of science, programming benefits for a little sunlight. If you’re serious about improving your code, get someone else to look at it and give you feedback (in fact, for R users, I can probably be that person — shoot me an email)!

1. Find a style guide you like, and use it

Most programming languages (R, Python, C++, etc.) have a particular set of conventions dictating code appearance. These conventions are documented in style guides, like this one for R. The point of a style guide is to keep code readable, and there’s solid research sitting behind most of the advice. In my opinion, these are the two most important stylistic suggestions for R:

  • Spacing: put a space around all binary operators (=, +, <-, -), and after all commas
  • Line breaks: break lines at 80 characters (you can set RStudio to display a vertical reminder line under Tools -> Global Options -> select Code Editing on the left, and then set “Show margins” to 80).

Following the suggested indentation specs is also a good idea, especially if you work (or plan to eventually work) in Python or other languages in which indentation and spacing are interpreted pieces of the code syntax.

2. Use a consistent filing system for all projects

Because of the ebbs and flows of data collection, teaching, and travel, it’s not unusual for scientists to juggle five to ten different projects simultaneously. One way to overcome the resulting chaos is to keep a consistent directory structure for all projects.

I keep every project (for me, “project” is usually synonymous with “manuscript”) in a folder named Research.  The Research folder has subdirectories (a.k.a. subfolders) for each project, and the folders are named to reflect their topics. Inside each project folder, I have exactly the same set of folders. These are

  • Data —————— contains all datasets associated with this project
  • Code —————— contains all code files required for this project
  • Documentation — contains manuscript drafts, notes, presentations associated with this project
  • Figures ————— contains all project figures

Collaborative projects (reviews, analyses with multiple analysts, or projects where I’m working as a consultant) have an additional “Communications” folder. I often use subdirectories within these folders, but their composition varies.

3. Functionalize and annotate analyses

Code for data analysis generally does one of the following: loads data, cleans data, generates functionality (reads in source functions), and runs analysis. Some programmers advocate having a separate script allocated specifically to each of these processes (I found these suggestions, and particularly this embedded link, very useful). I haven’t fully integrated this idea into my workflow yet, but I like it.

Functions are like little machines that take a set of inputs (or “arguments”), do some stuff based on those inputs, and return a set of outputs (or “values”).

There are at least two good reasons to functionalize. First, it facilitates error checking, and second, it improves code readability. My very basic rule of thumb is that if it takes me more than about five lines of code to get from inputs to outputs, it’s probably worth functionalizing.

Here’s an example of a function in R:

 
MyFunction <- function(input.in, input.out) {
  # this function returns a vector of integers from 
  # "input.in" to "input.out".
  #
  # Args
  # input.in = first integer in sequence
  # input.out = last integer in sequence
  #
  # Returns
  # k = vector of integers from input.in to input.out
  #
  k <- seq(from = input.in, to = input.out)
  return(k)
}

I strongly recommend using a header structure like I’ve used here, that specifically lists function purpose, arguments, and returns. It’s a good policy to save each function in its own file (I give those files the same name as the function they contain, so the file containing this function would be named MyFunction.R). To load the function into a different file, use R’s source command.

# head of new file.
# source in all necessary functions (including MyFunction) first
source("MyFunction.R")

my.function.test <- MyFunction(input.in = 2, input.out = 7)

 

Good code designers plan their code structure ahead of time (e.g., I want this big output; to get it, I will need to work through these five subprocesses; each subprocess gets its own function, and a wrapper function integrates them all into a single entity that I will eventually call in one line).

In my experience, this is often not how biologists write code. Instead, we often write our analyses (simulations or samplers or whatever) in giant, free-flowing scripts. To get from that programming style to functionalized code, I recommend breaking code into functions after-the-fact in a few existing projects. Doing this a few times helped me learn to identify relevant subroutines at the project’s outset and write in functions from the beginning. Here are a couple examples of relatively common disease ecology programming tasks, with off-the-cuff suggestions about reasonable subroutines.

Example 1: Simulating a discrete-time, individual-based SIR process
I know from the beginning that I’ll likely want “Births” and “Deaths” functions, a “GetInfected” function that moves individuals from S to I, and a “Recover” function that moves individuals from I to R (e.g., a function for each major demographic process). If each process gets its own function, I know exactly where to go to incorporate additional complexity (like density dependence, pulsed births, age-specific demographic and disease processes, etc.). I wrap all these functions in a single function that calls each subroutine in sequence.

Example 2: Writing an MCMC sampler
Usually I know from the beginning what parameters I’m estimating, and I should have a clear plan about how to step through the algorithm (e.g., which updates are Gibbs, which are Metropolis, which control a reversible jump, etc.). In this case, I’d use a separate function to update each element (or block of elements) in the parameter vector. This gives me the flexibility to change some aspect of one update without having to dig through the whole sampler and risk messing up some other piece. Again, I’d use a wrapper to link all the subroutines into one entity.

My rule of thumb is to annotate everything that I can’t quickly regenerate from basic logic. One of the reasons I’ve pushed toward more functionalized code (and git/github) is because it tightens my annotation process: even if all I do is write an appropriate header for the function, my code is already easier to follow than it was in a single giant script.

4. Use an integrated development environment like RStudio

At our roots most of us are scientists, not programmers or developers, and as such we need to keep up with what’s going on in science, not software development. The Open Science movement has invested a lot of effort in helping scientists incorporage modern computational workflows into research. One example project that tries to do this is RStudio. RStudio is an integrated development environment (“IDE”) set up for scientists and analysts. It integrates an R language compiler with LaTeX (a document generating program commonly used in math, computer science, and physics) and git (one flavor of version control software). Because of these integrations, I highly recommend that scientists using R use it through RStudio (this from someone who used the R-GUI for two years and vim-r-plugin for another two). In my mind, the integration is what makes RStudio worthwhile, even for Mac users whose R-GUI isn’t quite so bad.

5. Use version control software like git, SVN, or Hg and incorporate github or bitbucket into your workflow

Version control software solves the problem of ending all your code files in “_today’s_date”. These tools keep track of changes you make, and allow you to annotate precisely what you’re doing. Importantly, they also let you dig discarded code chunks out of the trash.

Several flavors of version control software are now integrated with online platforms. This allows for easy code sharing and cloud backup (and code publication). One platform, github (and regular git run locally), integrates directly with RStudio; another, bitbucket, has unlimited free repositories. Students also have access to five free repos on github — go to education.github.com and request the student pack; in my experience, it helps to send them a reminder email a few weeks after your first request. An aside: women (or, users with female first names) are crazily under-represented on github. Growing the female online programming community is important for science, and good for your code to boot!

Wrap-up
Some of these steps are easier to implement than others. I recommend establishing some protocols for yourself, applying these protocols to new projects, and gradually moving old projects into compliance as needed.

In my experience, adopting some parts of the style conventions (especially those pertaining to spacing and characters-per-line) were easy; others were harder. Start with the low-hanging fruit — any improvement is better than none!

I don’t know many people (and especially not many scientists) who are particularly proud of how their code looks, but having people look at your code can be incredibly instructive. Find a buddy and work together, if you can.

Finally, a few small changes can make a big difference, not only in the reproducibility of your science, but also in your confidence as a programmer. People do ask biologists and ecologists for code samples in the job application process, and a little finesse goes a long way!

This entry was posted in News. Bookmark the permalink.

4 Responses to A few steps toward cleaner, better-organized code

  1. Pingback: A few steps to cleaner, better-organized code |

  2. Pingback: Friday links: surviving science, the ultimate citation, why everything is broken, and more | Dynamic Ecology

  3. jshoyer says:

    Nice post Kezia. Do you use either the devtools or the inlinedocs packages?

    • keziamanlove says:

      I use devtools, but (full disclosure) I’m pretty bad about wrapping my code into actual packages (this is an area I’m working to improve… my biggest hinderance is apathy). inlinedocs is new to me, but looks really user-friendly — thanks for the heads up!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s