How To Decide Which Software To Use For Building Automated Ngs Analysis Pipeline
6
27
0
Entering edit mode

Sometimes the solution is to be be aware of only one these ;-) . Now in all seriousness thanks for listing all these examples, this will be a great resource.

ADD REPLY
0
Entering edit mode
ADD REPLY
10
Entering edit mode
13.7 years ago
Farhat ★ 2.9k

I have experience with shell script based pipelines and Galaxy. While Galaxy provides a great front end for making pipelines, I have found it slower for running the tasks. One serious drawback with Galaxy is that it stores results at every intermediate step in all their full uncompressed glory. This, I am sure, partly accounts for the slowdown as the disk writing activity is heavy. Also, it leads to filling up of drives which can be an issue in itself especially if you are doing a lot of analyses.

Shell scripts can be really flexible and powerful but they are not as user-friendly although I am sure any kind of scripting language could deliver similar results.

ADD COMMENT
6
Entering edit mode
13.7 years ago
Ryan Dale 5.0k

I write many of my NGS pipelines using Ruffus. It's really easy to run tasks in parallel. Simple pipelines are correspondingly simple to write, but at the same time it's rich enough to support very complex pipelines, too (e.g., http://www.ruffus.org.uk/gallery.html).

ADD COMMENT
6
Entering edit mode
13.7 years ago
lh3 33k

From what I have heard, for NGS Galaxy is the most widely used generic pipeline. Nonetheless, I guess more people are building their own pipelines from scratch. IMHO, the difficulty of using generic pipeline comes from the difference between parallelization environments. It is pretty easy if everything runs on the same node, but LSF/SQE/PBS and the different configurations (e.g. memory and runtime limitation) make things messy.

If you are the only users of your cluster and have full control, using a generic pipeline may be not hard. A friend of mine builds a private cloud and uses Galaxy. Everything runs smoothly. If you are using nodes part of a huge cluster, probably writing your own pipeline is easier. When you can control your pipeline, you can also avoid inefficient parts easily as is mentioned by Farhat. You know, implementing an initial pipeline is not that difficult. It will take time to purify it, but the same is true if you use generic pipeline frameworks.

ADD COMMENT
5
Entering edit mode
13.7 years ago
Ketil 4.1k

I don't know what you mean by "NGS analysis", but coming from a comp.sci. background, I tend to use 'make' to construct non-trivial pipelines. For our current de novo project, the current pipeline consists of primary assembly (newbler, celera and CLC), secondary assembly (SSPACE), remapping of reads (bwa index, aln, and sampe), quality evaluation (samtools idxstats and flagstas), generating graphs (gnuplot) and so on.

Since many of these steps are time consuming, and since something always fails at some point, make's ability to skip already completed files saves a lot of time, especially with some careful sprinkling of .PRECIOUS. Also, make's -j option means that my pipeline is trivially parallelized.

The downside is that it can be a bit hard to debug, but -r --warn-undefined-variables helps a bit. I'm still missing some way of separating output from subprocesses, especially when running -j 16 :-)

ADD COMMENT
2
Entering edit mode

i've always wondered how people use MAKE this way. do you have validator scripts that can distinguish when an output file is not just garbarge?

ADD REPLY
1
Entering edit mode

Most of my output files are garbage. Maybe I should switch to Haskell.

ADD REPLY
0
Entering edit mode

Why would an output file be garbage? If anything, make helps against this, since it deletes temporary files when something goes wrong.

But yes, validating results is always a good idea.

ADD REPLY
0
Entering edit mode

:-) Yes! Unfortunately, the old rule of garbage in - garbage out is universal, and independent of implementation language...

ADD REPLY
2
Entering edit mode
13.7 years ago
Ben Lange ▴ 210

I have extensive experience with a large custom developed pipeline. Using a database to coordinate tasks on a private pool of commodity PCs. This approach is very flexible but exposes lots of coordination complexities with more complex workflows.

It's pretty clear that the more forking and joining you have in your workflow, the higher the complexity regardless of your approach.

Focus on what costs the most. When you're dealing with large amounts of data, storage is cheap but accessing and moving it is not. So the more localized the data is to the compute nodes the faster the throughput.

ADD COMMENT
1
Entering edit mode
12.4 years ago

We have been building some genotyping pipelines in Pegasus, which is a very heavyweight platform for scientific pipelines, and is apparently NSF-funded through the 2016 Olympics in Rio. Plan accordingly.

Pegasus is very friendly with Condor, although it can be run on other batch systems with some headaches.

The nodes look like this (i've stripped away the angle brackets to conform to BioStar)

job id="ADDRG_01" namespace="align" name="java" version="4.0"
        argument
            -Xmx16g 
            -jar ${picardfolder}/AddOrReplaceReadGroups.jar 
            INPUT=${filename}.bam 
            OUTPUT=${filename}.sorted.bam
            SORT_ORDER=coordinate 
            RGPU= 
            RGID=1
            RGLB=bar 
            RGPL=${platform} 
            RGSM=${outputprefix} 
            CREATE_INDEX=True 
            VALIDATION_STRINGENCY=LENIENT 
            TMP_DIR =${picardtemp}
        /argument
        stdout name="${filename}.picardrg.out" link="output"/
        stderr name="${filename}.picardrg.err" link="output"/
    /job
ADD COMMENT
0
Entering edit mode

sorry about that ;-) we need to fix XML display pronto - I'll make this a priority

ADD REPLY
1
Entering edit mode

also the main reason this has not been done so far is that I don't think I understand all the implications of properly escaping HTML, nor the conditions in which it should or shoulnd't happen, plus the escaping needs to interact with the prettyfier, also not obious how to do it, - thus I am afraid that I will open a javascript injection security hole with it

ADD REPLY
0
Entering edit mode

ok thanks - whatever helps people post more code is always good

ADD REPLY

Login before adding your answer.

Traffic: 1729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6