Question

Which Bioinformatic Friendly Pipeline Building Framework?

84

Entering edit mode

10.9 years ago

Carlos Borroto ★ 2.1k

EDIT: You probably want to take a look at https://github.com/pditommaso/awesome-pipeline

Hi,

I'm part of a team involve in a project where we will be running a stable analysis pipeline over a large number of samples.

QC(custom scripts) / Mapping(bwa mem) / Variant Calling(GATK Best Practices).

We would like not to reinvent the wheel and build the pipeline using a stablished framework. Ideally this framework is not too focus in this particular pipeline in case we need something else in the future.

I got good information from this previous Biostars post. This is a summary of options from that post:

Don't bother, just write a README
make
Bash
waf (Python)
SCons (Python)
Rake (Ruby)
BioMake, now Skam (Prolog)
Ruffus (Python)
Paver (Python)
Galaxy (Python)
Snakemake (Python)
Anduril

Not mentioned in that post but that I'm also looking into:

bcbio-nextgen (Python)
gkno (Python)
Invoke (Python)
Queue (Scala, Java)

New options after this post was initially written:

NGSANE(bash)
BigDataScript(bds)
Nextflow (Java)
Bpipe (Groovy)
Omics Pipe (Python)
Cromwell/WDL (Scala)
Toil (Python)

I would love to get the community opinion on this subject. I'm particular fun right now of Snakemake, gkno and Invoke. I love Snakemake simplicity and how close to the regular make it is. It seems like Invoke is the current winner around the Python community at large.

gkno seems like exactly what we need, but I'm worry it could get too complex and hard to maintain.

Latest edit: Added Toil.

scripting • 33k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 10.9 years ago by Carlos Borroto ★ 2.1k

6

Entering edit mode

I've been very happy with snakemake. The cluster support is pretty robust and and works on our rather odd PBS system just fine. The author is extremely responsive (bug fixes in minutes to hours, typically).

ADD REPLY • link 10.9 years ago by Sean Davis 27k

3

Entering edit mode

I can second snakemake. It is very readable and intuitive, can work with clusters and pretty robust.

ADD REPLY • link 9.4 years ago by Tom ▴ 240

1

Entering edit mode

I like snakemake because I can whip something up quickly, but it gets very slow when running on hundreds of tasks.

ADD REPLY • link 8.1 years ago by Lynxoid ▴ 230

2

Entering edit mode

Look at all these Python pipeline frameworks!

Wrappers for subprocess, to wrap Popen, to wrap os.execvp, to finally, and inevitably, run somescript.sh

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by John 13k

2

Entering edit mode

Might add nextflow...

http://nextflow.io/

ADD REPLY • link 9.2 years ago by Sean Davis 27k

0

Entering edit mode

I think you are in as good a position as anyone to review these

ADD REPLY • link 10.9 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Nice list of pipeline tool/framework. It can be useful for lot of person.

Would be nice to add Bpipe as mentioned below.

This is a nice one well documented and well maintained. (Here is the publication in Bioinformatics)

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by Juke34 8.9k

0

Entering edit mode

Just a correction: the documentation for bpipes is now here. I am not affiliated with the tool in any way, other than being a user :)

ADD REPLY • link 8.8 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

We've recently developed NextflowWorkbench, which builds on Nextflow, but adds a user interface, modular workflow with libraries of processes and a docker IDE. Your workflows can be developed on a laptop/desktop and then run on a cluster or in the cloud. See this recent preprint: http://biorxiv.org/content/early/2016/03/28/041236

ADD REPLY • link 8.7 years ago by fac2003 ▴ 170

score 27 · Answer 1 · 2016-03-24

27

Entering edit mode

8.8 years ago

Jeremy Leipzig 22k

A review of bioinformatic pipeline frameworks

http://bib.oxfordjournals.org/content/early/2016/03/23/bib.bbw020.full

High-throughput bioinformatic analyses increasingly rely on pipeline frameworks to process sequence and metadata. Modern implementations of these frameworks differ on three key dimensions: using an implicit or explicit syntax, using a configuration, convention or class-based design paradigm and offering a command line or workbench interface. Here I survey and compare the design philosophies of several current pipeline frameworks. I provide practical recommendations based on analysis requirements and the user base.

I wrote this review paper in order to bring some organization to the discussion of pipeline frameworks.

enter image description here

ADD COMMENT • link 8.8 years ago by Jeremy Leipzig 22k

8

Entering edit mode

As I pointed out to the author on Twitter, the information in this table seems very arbitrary. If you read the review carefully, it never defines how the number of stars is determined for each category. In essence, these visuals are the opinion of the author, and I think are misguiding. My experience suggests that the category that contains Snakemake, BigDataScript and Nextflow have much better performance than Galaxy/Taverna, but are likely more difficult to use for beginners, pretty much the opposite of what the table shows.

ADD REPLY • link 8.7 years ago by fac2003 ▴ 170

1

Entering edit mode

Great article, thanks a lot. Glad to see the work behind toil getting mentioned as well. I stumbled onto Toil a little by accident this summer and have switched over completely. I've coded up a python library of wrappers for different tools and specific configurations and it all sits on top of Toil for handling task processing and job allocation/execution. Its quite powerful. Just getting my scripts up and running with Mesos now.

ADD REPLY • link 8.8 years ago by DG 7.3k

1

Entering edit mode

thanks. all this sounds very complicated - i'm subtracting a half-star

ADD REPLY • link 8.8 years ago by Jeremy Leipzig 22k

1

Entering edit mode

For toil? It can be very straightforward. The toil aspect of writing any code is actually itself quite simple (although the documentation is currently a little sparse). Its quite easy to write a script of toil tasks and just submit it. In my case I wanted a system a bit more like bcbio-nextgen in some respects, so thats all of the additional code I've been working on.

ADD REPLY • link 8.8 years ago by DG 7.3k

0

Entering edit mode

Any links to code? What do you do to get a good Mesos environment up-and-running? I've been using snakemake quite happily, but the common workflow language folks seem quite interested in toil. In addition, toil seems to be a bit more platform agnostic.

ADD REPLY • link 8.8 years ago by Sean Davis 27k

0

Entering edit mode

Getting mesos itself up and running is pretty straightforward, although I'm no expert and haven't yet done a lot of job submissions with it. I'm also currently debugging any tweaks I may need to make to my toil script as it doesn't seem to be cleanly submitting a job to the whole mesos cluster. But I think that is a configuration issue on my part. Just haven't had a chance to do it yet. I'll post something when I have it up. For getting a mesos cluster going I recommend the mesosphere tutorial: here. I'm working on a physical cluster with no other HPC software running on it, so it is set up like independent machines. The tutorial would also work for cloud instances.

ADD REPLY • link 8.7 years ago by DG 7.3k

0

Entering edit mode

Nice!! Thank you Jeremy :)

ADD REPLY • link 8.8 years ago by John 13k

0

Entering edit mode

Jeremy, Its a nice review article, thank you for posting it here

ADD REPLY • link 8.7 years ago by gsr9999 ▴ 310

Ram · Answer 2 · 2014-01-23

5

Entering edit mode

10.9 years ago

Christian ★ 3.1k

If you don't need cluster support, my vote goes to the good old make. Powerful and bug free.

ADD COMMENT • link 10.9 years ago by Christian ★ 3.1k

1

Entering edit mode

Make is good but not scalable in any way. Cluster support for shared and distributed filesystems (such as Amazon) are pretty much not possible with make.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by ngsbioinformatics ▴ 60

0

Entering edit mode

This is absolutely not true. Electric Make can let you build a cluster of literally any size of nodes, physical hardware or cloud, and parallelize not just Makefile builds but also any product that divides work by spawning processes. I used to be one of teir pre-sales solutions engineers. It's not free, but you get what you pay for. There is also a free community product but it's limited to your local developer network, max 8 machines, 8 cores per machine. But you'd be surprised how much perfomance you can get out of a small cluster like that.

ADD REPLY • link 7.7 years ago by flybd5 • 0

1

Entering edit mode

Appreciate your addition but the "This is absolutely not true" well, isn't true. @ngsbioinformatics was referring to plain old make, and not Electric Make. But good to know that that product exists and is capable.

ADD REPLY • link 7.7 years ago by DG 7.3k

Ram · Answer 3 · 2014-01-24

5

Entering edit mode

10.9 years ago

Johan ▴ 890

I've been working with Queue for about a year and a half now, and have it deployed in production at our core facility. I find that it strikes a good balance between expressiveness and simplicity. It has good cluster support, will of course play really nicely with all the GATK tools and is easy to extend to any command line program you might want to run. If you're interested here is the "fork" that we run: https://github.com/johandahlberg/piper including some pipelines.

ADD COMMENT • link 10.9 years ago by Johan ▴ 890

2

Entering edit mode

I'm kind of in awe of how much patience you have for this framework. Outside of the Broad itself you are pretty much the only one with a real working Queue pipeline on Github.

Have you thought about abstracting Queue into a DSL for mere mortals?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Jeremy Leipzig 22k

score 4 · Answer 4 · 2015-04-16

I use snakemake; it is a make implemented in python allowing you to use python in your rules. It is made especially for bioinformatics pipelines. Read all about it here: https://bitbucket.org/johanneskoester/snakemake/wiki/Documentation

If you know and like python, this might be the best choice for you.

It is robust, actively developed and open source.

Paper from bioinformatics: Snakemake—a scalable bioinformatics workflow engine

score 2 · Answer 5 · 2017-01-12

2

Entering edit mode

7.9 years ago

Juke34 8.9k

A really interesting list is available here => https://github.com/pditommaso/awesome-pipeline

ADD COMMENT • link 7.9 years ago by Juke34 8.9k

score 1 · Answer 6 · 2014-01-21

1

Entering edit mode

10.9 years ago

Neilfws 49k

Another option: NGSANE. Soon to be published.

ADD COMMENT • link 10.9 years ago by Neilfws 49k

Ram · Answer 7 · 2014-10-23

1

Entering edit mode

10.2 years ago

Milan Simonovic ▴ 20

Add BigDataScript to the list. It's another scripting language to learn, but then it allows you to seamlessly run pipelines locally or on a cluster, manage jobs, make checkpoints during execution, etc. Open sourced and published (2014).

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Milan Simonovic ▴ 20

0

Entering edit mode

Added. Looks pretty good. Love well documented projects from the beginning. If I have to write a new pipeline, I will make sure to consider BigDataScript.

ADD REPLY • link 10.2 years ago by Carlos Borroto ★ 2.1k

0

Entering edit mode

Yeah, I added it to my own list when I came across the paper.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by DG 7.3k

Ram · Answer 8 · 2015-04-16

1

Entering edit mode

9.7 years ago

Yahan ▴ 400

Bpipe is the tool of choice here. Excellent support for threading, easy restarting of jobs that failed at certain step in the workflow, easy stitching together different steps, management of input and output naming.

Amazing it's not in here already

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 9.7 years ago by Yahan ▴ 400

Ram · Answer 9 · 2016-01-28

In the paper Genome Modeling System: A Knowledge Management Platform for Genomics we talk about some of the principles one might think about when creating pipeline management infrastructure for genomics. In that publication we included a list of relevant resources: Genome Analysis Platforms.

Some additional options that perhaps could be included in your very nice list above:

Arvados
DNA Nexus
BaseSpace

score 1 · Answer 10 · 2016-04-08

The Broad recently announced their replacement for Queue, Cromwell/WDL. We just starting checking it out and it looks promising. When we did the initial search 2 years ago, we ended choosing Queue. It worked for us and it was nice to get free advanced scather-and-gather for GATK tools. However, maintaining Queue scripts in Scala was painful, particularly for non-GATK tools. We recently decided migrate to Snakemake, our initial runner up.

With the announcement of Cromwell and the near future release of WDL GATK Best Practices implementation, we are reconsidering migrating to Cromwell.

Ram · Answer 11 · 2014-01-23

0

Entering edit mode

10.9 years ago

DG 7.3k

I've been trying a few different approaches over the last year or so. Currently my production pipeline is implemented as a makefile, per sample. All of my analysis is being run on a local workstation and not a cluster so it works well for that. I have been developing a data management system (hopefully soon to be written up and published) and am trying out a few more complex approaches there to make it more scalable. For relatively straightforward pipelines I do really like make or snakemake, particularly if this doesn't need to be run on a cluster.

I highly recommend versioning your make file templates. Anytime you make changes it should be a new version. For all projects/samples always store the make file that was used with the data. This means you can always reproduce your data exactly. You should also version and indicate what versions of bin files (BWA, GATK, Picard, etc) were used.

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.9 years ago by DG 7.3k

0

Entering edit mode

You have one makefile per sample? So if you change the pipeline ...

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.9 years ago by brentp 24k

0

Entering edit mode

Well I also have helper scripts as well, and the pipeline is stored as a template. So If I change the pipeline I just generate new makefiles from the new template and re-run it on whatever samples I want to re-run it on. I'm currently experimenting with some alternatives though in a more robust management system.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.9 years ago by DG 7.3k

Ram · Answer 12 · 2014-01-24

0

Entering edit mode

10.9 years ago

Michele Busby ★ 2.2k

I have been meaning to look into GenePattern. They have some nice stuff set up though I don't know how cluster submission works outside Broad.

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.9 years ago by Michele Busby ★ 2.2k

Ram · Answer 13 · 2015-04-17

0

Entering edit mode

9.7 years ago

A. Domingues ★ 2.7k

I am looking at Omics Pipe and Bpipe at the moment. The former appears to be relatively easy to implement and later is used by our core facility. Decisions, decisions.

Does anyone here has experience with Omics pipe?

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 9.7 years ago by A. Domingues ★ 2.7k

Ram · Answer 14 · 2015-10-09

0

Entering edit mode

9.2 years ago

ngsbioinformatics ▴ 60

Of all these pipeline infrastructures, which allow you to distribute parts of the pipeline to compute nodes and other parts on a single node, such as the GATK Exome Pipeline. You can map the samples on different nodes, but when doing indel realigning or recalibration, its best to have all the samples on a single node. After that, you can continue processing each sample on the compute nodes. I'm only seen BDS and Queue be able to handle this.

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by ngsbioinformatics ▴ 60

0

Entering edit mode

Snakemake allows you to specify rules that are to be run locally (localrules). It would be more difficult to script that a specific rule get run on a specific node, but it's possible depending on your scheduler.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 9.2 years ago by Jeremy Leipzig 22k

score 0 · Answer 15 · 2017-06-07

0

Entering edit mode

7.5 years ago

kaixian110 • 0

how about Toil ? it's not suitable for bioinformatics,I think.

ADD COMMENT • link 7.5 years ago by kaixian110 • 0

1

Entering edit mode

Toil is explicitly written as a bioinformatics pipeline. It is developed by a genomics group after all. I've been using Toil for over a year now in Development and Production environments.

ADD REPLY • link 7.5 years ago by DG 7.3k

score 0 · Answer 16 · 2017-06-19

0

Entering edit mode

7.5 years ago

pwwang ▴ 40

Another option: pyppl

ADD COMMENT • link 7.5 years ago by pwwang ▴ 40