Question

How Do You Maintain Flexibility In Your Code, When You Don'T Know In Advance Which Parameters Will Flex?

10

Entering edit mode

13.7 years ago

Nathan Nehrt ▴ 250

In the data exploration phase of several bioinformatics projects I have worked on, we looked at the data from many different angles, deciding on appropriate approaches for further analysis. During this phase, the questions were often, "What happens if we vary this parameter, and this parameter as well, and all combinations of the two?", and later, "What happens if we also look at this other parameter?" for something I didn't even think about varying at the beginning. And so on.

As this process continues, it can take increasingly more time to rework the code to carry out the new analyses. It often feels like a delicate balancing act between maintaining flexibility so you don't code yourself into a corner, and not spending too much time enabling flexibility you are not sure you will need.

What are the best ways to handle this problem?

Do you keep modifying and refactoring a small collection of ever-more-complex programs?
Do you pipeline a larger number of limited function scripts?
Are object oriented approaches more appropriate for this type of analysis, or do they require too much design overhead?
Does adherence to a specific development technique (ex. Agile) work best?
Are these techniques feasible for a small academic lab with typically one to two people working on a project?
Are there easier ways?

subjective software • 4.6k views

ADD COMMENT • link updated 13.7 years ago by Anton Goloborodko ▴ 280 • written 13.7 years ago by Nathan Nehrt ▴ 250

Ram · Answer 1 · 2011-08-27

I'd imagine the real answer is some combination of what you mention, but getting there is an iterative process that probably takes a bit of experience to get just right. Just my 2 cents:

My current strategy involves a config file, a module of relatively atomic functions, and a pipeline that ties them all together. Some details:

I have a library of relatively atomic functions that accept "parameter" objects and return "result" objects
parameter objects are originally specified in YAML config files
pipelines tie together functions
when a pipeline runs, it copies over its config files to results directories. This way at least it's possible to keep track of which parameters were used.
careful directory and file naming (all configured in the YAML file) keeps this manageable.

Adding new functionality or tweaking parameters consists of adding a line to the config file and adding code just to the relevant function, or writing a new function and branch to the pipeline). ruffus has been invaluable for this, especially the latter case which can make pipelines branch in increasingly complex ways.

Which peak caller or which aligner you use can also be considered a parameter. In this case, a plugin approach is useful. If your pipeline peak-calling step expects an input BAM and an output BED, you just need to write a wrapping function that calls a peak caller of your choice that conforms to those expectations. While that individual plugin can get pretty ugly, it only has to be done once and the general pipeline doesn't have to be refactored to accomodate it.

Edit: Here's a toy example of three pipelines of increasing complexity: https://github.com/daler/pipeline-example. Details in the README, but:

the first pipeline is just mapping and counting
the second one introduces an adapter clipping step at the beginning
the third one adds an optional filtering step between mapping and counting.

While there are only a few tasks in each pipeline, they should work pretty well as a template for more complex workflows and hopefully give you an idea of what I mean.

Ram · Answer 2 · 2011-08-27

Very good question. My short answer would be "a combination of your three first suggestions"

I think it is unavoidable that one has to modify and refactor code. In when developing scientific software (and in fact when developing most software) it is impossible to foresee all possible future requirements. Sometimes new things come up that simply do not fit in well with the existing design, requiring major modifications or refactoring of the code.

I also often pipeline small tools to perform well-defined jobs. The pipelining is usually done with makefiles. I find that this setup works quite well, because adding new functionality is often just a matter of adding another independent script to the mix that performs the new function. The main downside of this setup is that it results in pipelines with many processing steps, which makes it somewhat inefficient if one is dealing with big data sets.

In those cases where performance is an issue, we (the members of my group) usually implement the performance critical parts in C++. When we do so, we try to make the object oriented code so that we provide "hooks" in all the places that one can foresee that we might need to do something extra in the future. This is often in the form of virtual functions with empty default implementations. Their sole purpose is to allow to allow subclassing that implements additional code for the various hooks as needed. This has worked very well for our text mining projects.

One important point that is often overlooked is to eliminate stale code when refactoring. Often, one needs to implement new functionality to be able to add some extra parameter to the scoring. However, after doing so it may turn out that adding this extra parameter did not improve matters. When that happens one should not forget to eliminate the complicated code again - otherwise, the code base easily ends up being unnecessarily complicated.

score 2 · Answer 3 · 2011-08-29

My experience is based mostly on solo projects and is in accordance with the other answers. I like to start with a simplest project layout and then elaborate and refactor it if needed. There are a few points/techniques I want to emphasize:

Usually my standard projects start with two directories, ./bin and ./data and a single computational script ./bin/main.py. The latter directory is used to separate initial data from scripts and results of computations.
In most cases, the computational part quickly becomes much heavier then visualization. At this point I begin to store secondary data (the data computed from the initial data) in a database and move all graph plotting and printing into another script, ./bin/plot.py. Sometimes I perform some lightweight computations in this second file, but as rule of thumb, plotting should take no longer than a few minutes.
As the project grows, I move all conservative code in ./bin/main.py into a separate library. This is especially useful in Python, as you can use Cython and get a 10x-100x increase in speed.
I use pyreport to print the results of computations and plots along with the code. Pyreport is a Python analogue of R Sweave, an utility that captures the output of executed code, intermixes it properly with the code itself and generates nice PDF/HTML reports.
With this layout, versioning also comes naturally. However, you need to start as early as you realize that you are going to spend some time on this project.
The key point is to increase the complexity of the code gradually. In my case, most projects begin with a small idea which usually doesn't work in its initial form. That is why I take the shortest route to the first result and only then begin to eleborate the project's structure.