Question

Docker - one image to rule them all?

12

Entering edit mode

9.7 years ago

Manuel ▴ 410

I am in the typical situation that I need a resequencing pipeline (i.e., FastQC, read preprocessing, FastQC again, alignment with BWA, variant calling). I need to fulfill both the requirements of having a stable pipeline with stable tools for the standard stuff (e.g., both "single-donor WES variant calling", "trio WES variant calling", but also "tumor/normal WES variant calling with somatic filtration") but I sometimes need more specialized functionality or more extensive downstream analysis.

I want to use Docker for isolating my tools against the uncertain, changing, and sadly oftentimes unversioned world of Bioinformatics software (I'm looking at you, vt and vcflib, but I'm still very grateful that you are around). What would be your recommendation for a best practice here:

one Docker image for everything, adding tools as I go
one Docker image for each pipeline step (e.g. combining BWA-MEM, samtools, samblaster for the alignment so I can use piping in a front-end script)
one Docker image for the standard stuff, then maybe some images for each additional step.

Does anyone know of a person/organization that has published their Dockerized pipeline stuff in a Blog post or elsewhere that goes beyond toy examples or "here is a Dockerfile for the tool that I wrote/published"?

Cheers!

Docker • 6.3k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Manuel ▴ 410

4

Entering edit mode

9.7 years ago

Giovanni M Dall'Olio 28k

You can have a look at the ngseasy pipeline by the KHP Informatics group in London. They have a github repository, and a Makefile that installs all the components of the pipeline. Most of the components are in separate containers, facilitating the installation and updates.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Giovanni M Dall'Olio 28k

2

Entering edit mode

9.7 years ago

Amos ▴ 50

Hi Manuel,

Putting everything in one container, is one option, but I think there are distinct limitations here i.e. limitation on reuse & size of the image.The docker maxim is one "concern per container", and think this works well in this context. And as redundant layers are essentially re-used you don't have too much overhead if you design your images in a hierarchical manner. Of course separating tools in separate containers means passing input and output between them either through the STDOUT/IN or via shared volumes and this can be a bit fiddly.

As Giovanni mentioned, take a look at NGSeasy (disclaimer I'm one of the authors).

But this is by no means the only game in town, see Nextflow, Rabix, the work of Michael Barton at Nucleotide.es and also Bioboxes are trying to build a specification here for what a bioinformatics container should look like.

A plug for the docker symposium we are running towards the end of the year, bringing together various groups working in this space (keep an eye on the page, we'll be opening registration in May).

http://core.brc.iop.kcl.ac.uk/events/compbio-docker-symposium-2015/

Regards,
Amos

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Amos ▴ 50

0

Entering edit mode

From looking at NGSeasy Readme and source files, it's not quite clear to me yet how this pipeline can be run in a distributed computing environment (e.g. a Slurm cluster). Any comments on this?

ADD REPLY • link 9.6 years ago by Christian ★ 3.1k

1

Entering edit mode

9.7 years ago

matted 7.8k

The bcbio-nextgen pipeline does a good job encapsulating standard alignment tasks and tracking tool versions. They have a fully Dockerized version that is designed to run on AWS. I'll copy a snippet of their README to go over the benefits:

Improved installation: Pre-installing all required biological code, tools and system libraries inside a container removes the difficulties associated with supporting multiple platforms. Installation only requires setting up docker and download of the latest container.
Pipeline isolation: Third party software used in processing is fully isolated and will not impact existing tools or software. This eliminates the need for modules or PATH manipulation to provide partial isolation.
Full reproducibility: You can maintain snapshots of the code and processing environment indefinitely, providing the ability to re-run an older analysis by reverting to an archived snapshot.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by matted 7.8k

1

Entering edit mode

9.7 years ago

Jeremy Leipzig 22k

This is a great question.

Typically it's one process per container, which is why there is Docker Compose (previously known as fig). However Compose is geared toward spawning running instances (databases, web servers), not pipelining, so a new framework might be necessary.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Jeremy Leipzig 22k

Ram · Accepted Answer · 2015-04-23

Illumina Base Space actually uses it to run their cloud based pipeline with Docker tools. http://basespace.illumina.com They keep it down to tools level.

Docker was initially built and optimized to run "one" process efficiently. Therefore it is advisable to keep it down to tools level. Also, if you think about pipeline building, you probably want to have a Lego-like tool box and combine different workflows with different - often the same - tools.

As Illumina solved this, they have in each Docker container an input folder and an output folder, in Basespace you can actually publish your own tools as Docker container.