Question

Organizing bioinformatics oriented python scripts

4

Entering edit mode

9.8 years ago

Pappu ★ 2.1k

I have written several hundred python scripts for bioinformatic analysis. I am wondering if there is any good way to organize them so that I don't waste time writing similar scripts again. I am sorry if the question is off topic.

python • 3.3k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by Pappu ★ 2.1k

1

Entering edit mode

Maybe more of a comment than an answer... First, I would say put your scripts in one or more repositories under version control, e.g. in github. However, this doesn't fix the issue of reinventing the wheel every now and then, of course. Then I would suggest adding README files to the repository and documenting your scripts carefully.

What I tend to do is to write stand alone scripts for tasks that are fairly general and I could use again and again. For small scripts that I use in one off cases (e.g. an R script to plot a particular figure), I use redmine and its wiki pages so I can easily search what I was doing in the past, in case I need to do something similar now.

ADD REPLY • link 9.8 years ago by dariober 15k

0

Entering edit mode

Hello Pappu!

We believe that this post does not fit the main topic of this site.

I thinks it's not related to bioinformatics. search the web for "modular programming" (e.g: http://www.python-course.eu/modules_and_modular_programming.php ) and see @dariober's answer.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 166k

6

Entering edit mode

As it happens we have to be careful with this since bioinformatics only looks like typical programming but it is not.

A typical programmer does not write hundreds of independent programs. Typical programming jobs tend to build large monolithic solutions from many disparate pieces. There it makes sense to modularize it it with packages, modules etc. Those really don't help with the hundreds of utility programs.

Bioinformatics is the opposite, we want to build many small solutions often based on one large monolithic library (say pysam, biopython etc).

There is really no other place to go to ask how to organize and code up bioinformatics scripts in particular.

I think the traditional answer of one program per task is both archaic and counterproductive. You end up with a suite composed of hundreds of difficult to remember pieces.

The modern bioinformatics zeitgeist is to create a single program that takes many subcommands. This allows users to only remember one program name and based on that offers self discovery like samtools, bwa, bedtools do.

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by Istvan Albert 102k

1

Entering edit mode

I agree with Istavan about the peculiarity of bioinformatics, so I'd like to see this question re-opened and hear what others think about it.

ADD REPLY • link 9.8 years ago by dariober 15k

0

Entering edit mode

I think to answer this we need to make the distinction between a script and a program. I don't know if there is a formal difference in the terminology, but I always assumed a script is something you can make quickly (like in a day) while a program takes significantly longer (weeks/months).

So under that definition, a script is something like a shell script (to co-ordinate inputs and outputs, temporary directories, a bit of logic for checking things worked, etc) or an AWK command, which again doesn't take a huge amount of time to write, but is a hassle to write and check. If you have hundreds of these, I'd say logging is the way to go. Associate your outputs with the shell script used to make them. If you want a certain output, find a similar sort of output and find the script used to make it. I'd point you towards the log program I wrote that does this, but it's in no position to be used right now.

If however you mean keeping track of programs under the definition described above, then that's really a different situation entirely. Here you're talking about reusing functionality in various stages of development, and as Istvan points out thats usually answered by modularizing the code, breaking it up into small units of functionality. Some people love doing this, and are happy to import or require a million things at the beginning of each program they write. Their programs become more like scripts this way - just glue for existing logic.
I personally hate this, because now i have to remember a million pip or npm module names to get anything done, and trust that these modules do what I expect them to do. I'd rather just copy the code out of an old program and paste it into a new one, but then version control becomes a nightmare. As dariobar mentions, GitHub would be a good place for your modules, or your programming-language-specific version of pip/npm.

ADD REPLY • link 9.8 years ago by John 13k

Ram · Answer 1 · 2015-10-02

I only have 3 python scripts, they're all filed in my pythonstuff directory :)

However I have dozens of perl scripts and modules. I put in modules code that I reuse in different contexts e.g. APIs to databases, code to deal with plates/wells, some algorithms... The scripts I keep are reusable utilities and are organized by "role/function" in dedicated directories e.g. format conversion, screen analysis...

If that's not flexible enough, you could also consider using a tagging system like TMSU.