Question

Forum:Which programming language to integrate several open source software into one working pipeline (software)

4

Entering edit mode

10.3 years ago

benjamin.hebisch ▴ 70

Hey there!

I'm completly new to this programming/coding aspect in bioinformatics, althrough I'm quite successful using phylogenetic software for 2 years.

I'm planning a PhD project in which several open access tools are combined into one working pipeline/workflow/software (similar to MEGA5) to run analyses over night/week/month. Manually, this workflow cost me 1 year to analyze a bunch of proteins and to get familiar with several tools. For the long run, I want to implement this for a automatic sub proteome analysis.

Is it with any language possible to program a tool which is able to obtain data from a certain database, subject it to software A, control software A, send the output of software A to software B, control software B and so on?

The general workflow is mainly linear or has just one branching point.

MEGA pipeline • 4.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by benjamin.hebisch ▴ 70

0

Entering edit mode

Wow! Many thanks!

I will dig through those programs. Especially those mentioned by Chris Evelo seem to be so user friendly that even wet-lab scientists could work with that :D

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by benjamin.hebisch ▴ 70

Ram · Answer 1 · 2014-08-08

7

Entering edit mode

10.3 years ago

Cytosine ▴ 460

You can do this with just about any programming language. Pick the one you're most comfortable with.

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Cytosine ▴ 460

Ram · Answer 2 · 2014-08-08

Did you look at Taverna, Knime, Bioclipse? You might not want to start a whole new project for something that has been done before and that even is open source so could be extended if it doesn't fit your needs completely.

Update. Let me explain a bit more.

People mentioned that basically every programming language OS can do this, and they are right. Make tools can make it easier for you since they were made to steer workflows, that is of course true as well. And still some things are just better at solving specific problems than others or allow you to use the parallelisation capacities of the system you use. So many people have a favorite and often for good reasons. So there are a lot of correct answers here.

But...

I have seen many instances where using output from one program as input for another was not so simple at all. You need fileformat changes, changes in Db identifier used, ontology mappings (for instance mappings between information in study descriptions about tissue function and cell types to find the studies that can be compared) and conversions from one standardized (?) format to another. Many questions here on Biostar are about such individual steps. The recent question about using pathways in BioPax was a nice example how quickly that can become complicated. In practice you often need to use a lot of blocks and you need glue in between the blocks. Things that take say BioPAX produced by one tool and produces SBGN needed by another. Creating that kind of converters to glue things together can take months of work and then often they still don't cover the fact that real data also contains format errors. So there is a big advantage to having a toolkit full of blocks and connections between blocks and tools that allow you to configure those connections.

It is unfortunately not true that those workflow tools are very simple to use. First of all you need to know what you are doing. In that respect it is useful you understand about both the tool and the biology. And then your specific problem will oftentimes still contain some really new steps, which you will have to code. Reusability comes at a price too. You need to document even better then you should do anyhow and ideally you would think more about the things that the next user might encounter. These workflow environments are in part built to force you to do that. But that sometimes makes it harder to use them than you would expect. But yes, if you collaborate with a wetlab group that kind of tools will be easier for them to use if you fix the patches for their specific problem.

The good thing really is in the reusability and thus in the sharing of solutions and building blocks. That is what a site like myexperiment.org is for.

score 4 · Answer 3 · 2014-08-08

4

Entering edit mode

10.3 years ago

Pierre Lindenbaum 164k

I would use GNU-Make http://www.gnu.org/software/make/manual/make.html

ADD COMMENT • link 10.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

What makes it special?

ADD REPLY • link 10.3 years ago by Medhat 9.8k

1

Entering edit mode

make is used by millions of software projects to build resources along a dependency tree. It will automatically run whichever steps are needed for your defined endpoint, after you write the Makefile .

ADD REPLY • link 10.3 years ago by karl.stamm 4.1k

Ram · Answer 4 · 2014-08-08

Use makefiles and GNU make. There's a nice overview of using make to build and control analysis pipelines at Bioinformatics Zen.

Why use makefiles? In brief, just about anything you can run from the command line can be called from a makefile, and if you need to rerun a pipeline, only changed intermediate files trigger rebuilding of related targets (unless you force otherwise). This can reduce the time required to rerun a pipeline, and a consistent build path also reduces the odds of user errors.

Further, you need very little customization to run makefiles via GNU make, which is a toolkit already on most OSS-based systems found in bioinformatics, and make is agnostic about specialized scripting tools. It doesn't care if you use Python, Java, Perl, bash, etc., and it is robust: it doesn't share their uniquely weird and fragile version and library dependencies. Things aren't going to break if you update a minor Python version, for instance. (Well, a Python script in your pipeline might break, but that's a separate issue.)

score 2 · Answer 5 · 2014-08-08

2

Entering edit mode

10.3 years ago

Biomonika (Noolean) 3.2k

Galaxy is very popular and convenient for connecting outputs from multiple programs/scripts, e.g. by creating workflows. Look for images when you google "workflow galaxy".

http://galaxyproject.org/

ADD COMMENT • link 10.3 years ago by Biomonika (Noolean) 3.2k

Ram · Answer 6 · 2014-08-08

I do this kind of thing daily with Bash. Big fanboy! Want to do something to every file in a folder?

for f in ../*.fasta
do
  name=$(basename "$f" .fasta)
  blastn -query $f -db someDB -out $name.output
done

Want to do something and then something else and then something else? Pipes everywhere!

blastn -query file.fasta -db someDB | cut -f1 | sort -u | grep whatelse

If then, else if then, else?

while read line
do
  chromosome=$(echo "$line" | cut -f1)
  if [ "$chromosome" = "chrX" ]
  then
    someNumber=$(echo "$line" | cut -f8)
  elif [ "$chromosome" = "chrY" ]
  then
    someNumber=$(echo "$line" | cut -f9)
  else
    echo "NA"
  fi
done < inputFile.tsv

Pros of Bash? 99% certain you're already using a Bash shell. Builtin tools are perfect for scripting and you'd be using them anyway for e.g. processing tab separated values. Syntax.

Cons? It's very slow in comparison to other languages, although this doesn't matter if you're just piping bits from one program into another. It's the execution time of the programs that matters. Syntax.

Syntax is very simple but it's easy to make mistakes like e.g. when the difference of echo $line and echo "$line" matters.

Ram · Answer 7 · 2014-08-08

We use ruffus here to pipe with python. Works very nicely for me, with a sensible system for checking file dependencies. I.e if you change your pipeline half way through or change an intermediate file, it'll work out where to start the re-analysis from automatically.

I'd echo what Cytosine says though and start from the programming language you're comfortable with.