Question

How To Decide Which Software To Use For Building Automated Ngs Analysis Pipeline

27

Entering edit mode

13.7 years ago

Mcmahanl ▴ 300

There are so many software available for building automated NGS analysis pipeline, how one decide on which one to use. For examples, listing below are some of the software I have come across:

And any other software that I should look into?

next-gen sequencing pipeline sequencing • 12k views

ADD COMMENT • link updated 12.4 years ago by Jeremy Leipzig 22k • written 13.7 years ago by Mcmahanl ▴ 300

0

Entering edit mode

Sometimes the solution is to be be aware of only one these ;-) . Now in all seriousness thanks for listing all these examples, this will be a great resource.

ADD REPLY • link 13.7 years ago by Istvan Albert 101k

0

Entering edit mode

Check out https://github.com/LPM-HMS/COSMOS2

ADD REPLY • link 7.8 years ago by egafni ▴ 30

score 10 · Answer 1 · 2011-03-28

I have experience with shell script based pipelines and Galaxy. While Galaxy provides a great front end for making pipelines, I have found it slower for running the tasks. One serious drawback with Galaxy is that it stores results at every intermediate step in all their full uncompressed glory. This, I am sure, partly accounts for the slowdown as the disk writing activity is heavy. Also, it leads to filling up of drives which can be an issue in itself especially if you are doing a lot of analyses.

Shell scripts can be really flexible and powerful but they are not as user-friendly although I am sure any kind of scripting language could deliver similar results.

score 6 · Answer 2 · 2011-03-28

6

Entering edit mode

13.7 years ago

Ryan Dale 5.0k

I write many of my NGS pipelines using Ruffus. It's really easy to run tasks in parallel. Simple pipelines are correspondingly simple to write, but at the same time it's rich enough to support very complex pipelines, too (e.g., http://www.ruffus.org.uk/gallery.html).

ADD COMMENT • link 13.7 years ago by Ryan Dale 5.0k

score 6 · Answer 3 · 2011-03-29

From what I have heard, for NGS Galaxy is the most widely used generic pipeline. Nonetheless, I guess more people are building their own pipelines from scratch. IMHO, the difficulty of using generic pipeline comes from the difference between parallelization environments. It is pretty easy if everything runs on the same node, but LSF/SQE/PBS and the different configurations (e.g. memory and runtime limitation) make things messy.

If you are the only users of your cluster and have full control, using a generic pipeline may be not hard. A friend of mine builds a private cloud and uses Galaxy. Everything runs smoothly. If you are using nodes part of a huge cluster, probably writing your own pipeline is easier. When you can control your pipeline, you can also avoid inefficient parts easily as is mentioned by Farhat. You know, implementing an initial pipeline is not that difficult. It will take time to purify it, but the same is true if you use generic pipeline frameworks.

score 5 · Answer 4 · 2011-03-29

5

Entering edit mode

13.7 years ago

Ketil 4.1k

I don't know what you mean by "NGS analysis", but coming from a comp.sci. background, I tend to use 'make' to construct non-trivial pipelines. For our current de novo project, the current pipeline consists of primary assembly (newbler, celera and CLC), secondary assembly (SSPACE), remapping of reads (bwa index, aln, and sampe), quality evaluation (samtools idxstats and flagstas), generating graphs (gnuplot) and so on.

Since many of these steps are time consuming, and since something always fails at some point, make's ability to skip already completed files saves a lot of time, especially with some careful sprinkling of .PRECIOUS. Also, make's -j option means that my pipeline is trivially parallelized.

The downside is that it can be a bit hard to debug, but -r --warn-undefined-variables helps a bit. I'm still missing some way of separating output from subprocesses, especially when running -j 16 :-)

ADD COMMENT • link 13.7 years ago by Ketil 4.1k

2

Entering edit mode

i've always wondered how people use MAKE this way. do you have validator scripts that can distinguish when an output file is not just garbarge?

ADD REPLY • link 13.7 years ago by Jeremy Leipzig 22k

1

Entering edit mode

Most of my output files are garbage. Maybe I should switch to Haskell.

ADD REPLY • link 13.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Why would an output file be garbage? If anything, make helps against this, since it deletes temporary files when something goes wrong.

But yes, validating results is always a good idea.

ADD REPLY • link 13.7 years ago by Ketil 4.1k

0

Entering edit mode

:-) Yes! Unfortunately, the old rule of garbage in - garbage out is universal, and independent of implementation language...

ADD REPLY • link 13.7 years ago by Ketil 4.1k

score 2 · Answer 5 · 2011-03-31

I have extensive experience with a large custom developed pipeline. Using a database to coordinate tasks on a private pool of commodity PCs. This approach is very flexible but exposes lots of coordination complexities with more complex workflows.

It's pretty clear that the more forking and joining you have in your workflow, the higher the complexity regardless of your approach.

Focus on what costs the most. When you're dealing with large amounts of data, storage is cheap but accessing and moving it is not. So the more localized the data is to the compute nodes the faster the throughput.

score 1 · Answer 6 · 2012-06-29

1

Entering edit mode

12.4 years ago

Jeremy Leipzig 22k

We have been building some genotyping pipelines in Pegasus, which is a very heavyweight platform for scientific pipelines, and is apparently NSF-funded through the 2016 Olympics in Rio. Plan accordingly.

Pegasus is very friendly with Condor, although it can be run on other batch systems with some headaches.

The nodes look like this (i've stripped away the angle brackets to conform to BioStar)

job id="ADDRG_01" namespace="align" name="java" version="4.0"
        argument
            -Xmx16g 
            -jar ${picardfolder}/AddOrReplaceReadGroups.jar 
            INPUT=${filename}.bam 
            OUTPUT=${filename}.sorted.bam
            SORT_ORDER=coordinate 
            RGPU= 
            RGID=1
            RGLB=bar 
            RGPL=${platform} 
            RGSM=${outputprefix} 
            CREATE_INDEX=True 
            VALIDATION_STRINGENCY=LENIENT 
            TMP_DIR =${picardtemp}
        /argument
        stdout name="${filename}.picardrg.out" link="output"/
        stderr name="${filename}.picardrg.err" link="output"/
    /job

ADD COMMENT • link 12.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

sorry about that ;-) we need to fix XML display pronto - I'll make this a priority

ADD REPLY • link 12.4 years ago by Istvan Albert 101k

1

Entering edit mode

also the main reason this has not been done so far is that I don't think I understand all the implications of properly escaping HTML, nor the conditions in which it should or shoulnd't happen, plus the escaping needs to interact with the prettyfier, also not obious how to do it, - thus I am afraid that I will open a javascript injection security hole with it