I'm part of a team involve in a project where we will be running a stable analysis pipeline over a large number of samples.
QC(custom scripts) / Mapping(bwa mem) / Variant Calling(GATK Best Practices).
We would like not to reinvent the wheel and build the pipeline using a stablished framework. Ideally this framework is not too focus in this particular pipeline in case we need something else in the future.
I got good information from this previous Biostars post. This is a summary of options from that post:
I would love to get the community opinion on this subject. I'm particular fun right now of Snakemake, gkno and Invoke. I love Snakemake simplicity and how close to the regular make it is. It seems like Invoke is the current winner around the Python community at large.
gkno seems like exactly what we need, but I'm worry it could get too complex and hard to maintain.
I've been very happy with snakemake. The cluster support is pretty robust and and works on our rather odd PBS system just fine. The author is extremely responsive (bug fixes in minutes to hours, typically).
We've recently developed NextflowWorkbench, which builds on Nextflow, but adds a user interface, modular workflow with libraries of processes and a docker IDE. Your workflows can be developed on a laptop/desktop and then run on a cluster or in the cloud. See this recent preprint: http://biorxiv.org/content/early/2016/03/28/041236
High-throughput bioinformatic analyses increasingly rely on pipeline frameworks to process sequence and metadata. Modern implementations of these frameworks differ on three key dimensions: using an implicit or explicit syntax, using a configuration, convention or class-based design paradigm and offering a command line or workbench interface. Here I survey and compare the design philosophies of several current pipeline frameworks. I provide practical recommendations based on analysis requirements and the user base.
I wrote this review paper in order to bring some organization to the discussion of pipeline frameworks.
As I pointed out to the author on Twitter, the information in this table seems very arbitrary. If you read the review carefully, it never defines how the number of stars is determined for each category. In essence, these visuals are the opinion of the author, and I think are misguiding. My experience suggests that the category that contains Snakemake, BigDataScript and Nextflow have much better performance than Galaxy/Taverna, but are likely more difficult to use for beginners, pretty much the opposite of what the table shows.
Great article, thanks a lot. Glad to see the work behind toil getting mentioned as well. I stumbled onto Toil a little by accident this summer and have switched over completely. I've coded up a python library of wrappers for different tools and specific configurations and it all sits on top of Toil for handling task processing and job allocation/execution. Its quite powerful. Just getting my scripts up and running with Mesos now.
For toil? It can be very straightforward. The toil aspect of writing any code is actually itself quite simple (although the documentation is currently a little sparse). Its quite easy to write a script of toil tasks and just submit it. In my case I wanted a system a bit more like bcbio-nextgen in some respects, so thats all of the additional code I've been working on.
Any links to code? What do you do to get a good Mesos environment up-and-running? I've been using snakemake quite happily, but the common workflow language folks seem quite interested in toil. In addition, toil seems to be a bit more platform agnostic.
Getting mesos itself up and running is pretty straightforward, although I'm no expert and haven't yet done a lot of job submissions with it. I'm also currently debugging any tweaks I may need to make to my toil script as it doesn't seem to be cleanly submitting a job to the whole mesos cluster. But I think that is a configuration issue on my part. Just haven't had a chance to do it yet. I'll post something when I have it up. For getting a mesos cluster going I recommend the mesosphere tutorial: here. I'm working on a physical cluster with no other HPC software running on it, so it is set up like independent machines. The tutorial would also work for cloud instances.
Make is good but not scalable in any way. Cluster support for shared and distributed filesystems (such as Amazon) are pretty much not possible with make.
This is absolutely not true. Electric Make can let you build a cluster of literally any size of nodes, physical hardware or cloud, and parallelize not just Makefile builds but also any product that divides work by spawning processes. I used to be one of teir pre-sales solutions engineers. It's not free, but you get what you pay for. There is also a free community product but it's limited to your local developer network, max 8 machines, 8 cores per machine. But you'd be surprised how much perfomance you can get out of a small cluster like that.
Appreciate your addition but the "This is absolutely not true" well, isn't true. @ngsbioinformatics was referring to plain old make, and not Electric Make. But good to know that that product exists and is capable.
I've been working with Queue for about a year and a half now, and have it deployed in production at our core facility. I find that it strikes a good balance between expressiveness and simplicity. It has good cluster support, will of course play really nicely with all the GATK tools and is easy to extend to any command line program you might want to run. If you're interested here is the "fork" that we run: https://github.com/johandahlberg/piper including some pipelines.
I'm kind of in awe of how much patience you have for this framework. Outside of the Broad itself you are pretty much the only one with a real working Queue pipeline on Github.
Have you thought about abstracting Queue into a DSL for mere mortals?
Add BigDataScript to the list. It's another scripting language to learn, but then it allows you to seamlessly run pipelines locally or on a cluster, manage jobs, make checkpoints during execution, etc. Open sourced and published (2014).
Added. Looks pretty good. Love well documented projects from the beginning. If I have to write a new pipeline, I will make sure to consider BigDataScript.
Bpipe is the tool of choice here. Excellent support for threading, easy restarting of jobs that failed at certain step in the workflow, easy stitching together different steps, management of input and output naming.
Amazing it's not in here already
ADD COMMENT
• link
updated 2.9 years ago by
Ram
44k
•
written 9.6 years ago by
Yahan
▴
400
The Broad recently announced their replacement for Queue, Cromwell/WDL. We just starting checking it out and it looks promising. When we did the initial search 2 years ago, we ended choosing Queue. It worked for us and it was nice to get free advanced scather-and-gather for GATK tools. However, maintaining Queue scripts in Scala was painful, particularly for non-GATK tools. We recently decided migrate to Snakemake, our initial runner up.
With the announcement of Cromwell and the near future release of WDL GATK Best Practices implementation, we are reconsidering migrating to Cromwell.
I've been trying a few different approaches over the last year or so. Currently my production pipeline is implemented as a makefile, per sample. All of my analysis is being run on a local workstation and not a cluster so it works well for that. I have been developing a data management system (hopefully soon to be written up and published) and am trying out a few more complex approaches there to make it more scalable. For relatively straightforward pipelines I do really like make or snakemake, particularly if this doesn't need to be run on a cluster.
I highly recommend versioning your make file templates. Anytime you make changes it should be a new version. For all projects/samples always store the make file that was used with the data. This means you can always reproduce your data exactly. You should also version and indicate what versions of bin files (BWA, GATK, Picard, etc) were used.
ADD COMMENT
• link
updated 2.9 years ago by
Ram
44k
•
written 10.9 years ago by
DG
7.3k
0
Entering edit mode
You have one makefile per sample? So if you change the pipeline ...
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.9 years ago by
brentp
24k
0
Entering edit mode
Well I also have helper scripts as well, and the pipeline is stored as a template. So If I change the pipeline I just generate new makefiles from the new template and re-run it on whatever samples I want to re-run it on. I'm currently experimenting with some alternatives though in a more robust management system.
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.9 years ago by
DG
7.3k
I am looking at Omics Pipe and Bpipe at the moment. The former appears to be relatively easy to implement and later is used by our core facility. Decisions, decisions.
Of all these pipeline infrastructures, which allow you to distribute parts of the pipeline to compute nodes and other parts on a single node, such as the GATK Exome Pipeline. You can map the samples on different nodes, but when doing indel realigning or recalibration, its best to have all the samples on a single node. After that, you can continue processing each sample on the compute nodes. I'm only seen BDS and Queue be able to handle this.
Snakemake allows you to specify rules that are to be run locally (localrules). It would be more difficult to script that a specific rule get run on a specific node, but it's possible depending on your scheduler.
Toil is explicitly written as a bioinformatics pipeline. It is developed by a genomics group after all. I've been using Toil for over a year now in Development and Production environments.
I've been very happy with snakemake. The cluster support is pretty robust and and works on our rather odd PBS system just fine. The author is extremely responsive (bug fixes in minutes to hours, typically).
I can second snakemake. It is very readable and intuitive, can work with clusters and pretty robust.
I like snakemake because I can whip something up quickly, but it gets very slow when running on hundreds of tasks.
Look at all these Python pipeline frameworks!
Wrappers for subprocess, to wrap Popen, to wrap os.execvp, to finally, and inevitably, run somescript.sh
Might add nextflow...
http://nextflow.io/
I think you are in as good a position as anyone to review these
Nice list of pipeline tool/framework. It can be useful for lot of person.
Would be nice to add Bpipe as mentioned below.
This is a nice one well documented and well maintained. (Here is the publication in Bioinformatics)
Just a correction: the documentation for bpipes is now here. I am not affiliated with the tool in any way, other than being a user :)
We've recently developed NextflowWorkbench, which builds on Nextflow, but adds a user interface, modular workflow with libraries of processes and a docker IDE. Your workflows can be developed on a laptop/desktop and then run on a cluster or in the cloud. See this recent preprint: http://biorxiv.org/content/early/2016/03/28/041236