Aim of this tutorial
Do you know one of these situations?
- During the installation of a new bioinformatic tool you have to solve many dependency problems.
- Only a very old version of the tool is available through the package manager of your OS.
- Want to try a new program breaks a running and tested workflow.
- Tired of entering the same commands again and again.
- You asked a colleague to test your commands and workflow on his PC but (s)he run into the problems above.
These problems can be resolved quite easily using conda
and snakemake
. This tutorial should give a brief introduction to this two programs and how they can work together, to create a workflow one can easily share and deploy accross different platforms with reproducible results.
Introduction to conda
What is conda
?
This is what the homepage of conda
says:
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer.
In other words: One can have installed different versions of a program side-by-side on the local computer. During installation one can specify which versions are needed. This environment can be exported and imported on a different computer. This enables to have the same configuration including program version there. One can test a new software without breaking any other running programs.
Installing conda
The installation process I describe here is for linux. But for Windows and MaxOS it should be quite similar. Go to the download page, download the appropiate version and execute this file without root privilege with $ sh ./Miniconda3-latest-Linux-x86_64.sh
. The difference between the ptyhon2.7 and python3.6 version is just which is the default python version in a new environment. But one can set it always explicit. I would recommend using the 64-bit python3.6 version.
After completing the installation one will find a folder named miniconda3
in the users home folder.
To run properly within your shell, one have to make changes to the .bashrc
(if you're using bash) or .zshrc.local (if you are using zsh).
In these files look for lines that look like this: export PATH="/home/<username>/miniconda3/bin:$PATH"
and remove them or comment them out. Instead insert this line: . /home/<username>/miniconda3/etc/profile.d/conda.sh
Close and restart your terminal.
There are several repository - called channels
- available, that hold the install instructions for the packages one like to install. The most important one in bioinformatic is the channel bioconda
. One need to add this channel to conda's config, so it automaticly search there for packages. This is done by:
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
The order of adding these channels is important!
Using conda
I don't want to go into details on how to use conda. These can be read on the manual page or the Cheat sheet
After installation conda already installed the first environment called base
. One can get a list of all environment by:
$ conda env list
For our bioinformatic project let's create a new environment called bio
and install the latest version of bwa
, samtools
and picard
.
$ conda create --name bio bwa samtools picard
conda
will resolve all dependencies that are necessary. After completion our new environment should be listed in the result of $ conda env list
.
You can start this new environment by
$ conda activate bio
Let's see where the excecutable from bwa is located:
$ which bwa
/home/<username>/miniconda3/envs/bio/bin/bwa
We now want to add bcftools
to our environment. To see if a package is available and what versions type:
$ conda search bcftools
One can install the latest version by:
$ conda install bcftools
Or set the version explicit:
$ conda install bcftools=1.7
Once we have a complete environment we like to share with somone, we can export the settings and list of installed packages:
$ conda env export -n bio > bio.yml
To create an environment based on that file do this:
$ conda create -f bio.yml
If you want to leave in active environment, deactivate it by
$ conda deactivate
Introduction to snakemake
What is snakemake
?
This is what the homepage of snakemake
says:
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language.
In other words: We can create rules like "alignment", "markduplicates", "variantcalling" using placeholders for filenames and define how the input and output files look like. After that, we tell snakemake
to do "variantcalling" for all fastq
files in a specific folder. It will than find out, which of these rules, in which order, have to be done for it.
Installing snakemake
There are several ways described in the manual to install snakemake
. In this tutorial we like to install it in the bio environment we've created before:
$ conda install -n bio snakemake
if you are outside the environment, or
$ conda activate bio
$ conda install snakemake
to first go into the environment and install it there.
Create and running a workflow
Imagine the following situation:
After a NGS Run we will have a folder containing two fastq
files per sample (Read1 and Read2). For all these samples we like to perform an alignment, sort the bam
file, do a deduplication, indexing the resulted file and copy bam
and index file into a folder called Alignment
within the folder of fastq
files.
As this is not a one-time job, we like to generalize all these step. Doing so, we will be able to repeat all these steps again on similar data.
This is where snakemake
comes into play.
The data
Let's first have a look on the data and files we need. I will than go through it and explain.
Our folder containing the data looks like this:
$ tree /media/RunData/tutorialData/
/media/RunData/tutorialData
├── Sample1_S1_L001_R1_001.fastq.gz
├── Sample1_S1_L001_R2_001.fastq.gz
├── Sample2_S2_L001_R1_001.fastq.gz
├── Sample2_S2_L001_R2_001.fastq.gz
├── Sample3_S3_L001_R1_001.fastq.gz
└── Sample3_S3_L001_R2_001.fastq.gz
Create a file named Snakefile
. This will contain all the rules we need in out workflow:
And a file named config.json
. This file will hold the path to the data and the reference genome we like to align against. After completion of the workflow definition, this will be the only file we have to edit to run the workflow again on other data.
{
"data": "/media/RunData/tutorialData",
"genome": "/media/Databases/Genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFASTA/genome.fa",
}
A quick walk-through the Snakefile
snakemake
extends the python language by adding syntax to define rules and other control structures. By default it runs our file from the top, excecuting every python code it finds, until it reaches the first rule definition. It then have a look at the input files and checks, which other rules needs to run to get these files.
Beside the input
directive there are others we can/must define like output
or shell
. I don't want to go into details here. I recommend having a look at the tutorial on snakemake's homepage.
At the very beginning we load our config.json
. Within this file we have defined in which folder our data is located and against which reference genome we like to align.
The fastq
files follow a specific pattern, so we can extract the sample's name from it. This is done by SAMPLES, = glob_wildcards(config['data']+"/{id}_L001_R1_001.fastq.gz")
.
The first rule in the Snakefile
("move") is our target rule. snakemake
will replace the placeholder {sample}
with each sample name it finds in the folder in the step before. So we will have then Sample1_S1.bam
, Sample1_S1.bai
, Sample2_S2.bam
, Sample2_S2.bai
and so on. snakemake
will check if these files already exists, if not it tries to find a rule which describes how to generate these files and will find it in the rule markduplicates
. This is done until it finds the files given in input
in the filesystem. So in our case the generated workflow will be align_sort
, markduplicates
and move
.
You see that some of the defined output file are marked as temporary files (by temp(filename)
). Doing this, snakemake
will cleanup this files, if they are not used anymore by any other rule.
Run the workflow
Make sure you are in the conda environment we've created before. If not already done, step into it.
$ conda activate bio
Open the config.json
and edit, so it fits your needs:
$ nano config.json
First let's try a dry-run and print out the commands that will be used:
$ snakemake -np
If everything looks fine, start the workflow:
$ snakemake
Sharing the workflow
Let's summarize what we have done so far. We've created an environment which contains snakemake and the programs we need to run. We've created a Snakefile
which defines our workflow. And we have a config file for this workflow.
So sharing is straightforward. We have to export our environment definition and put this file along the others. On the new location one can build the identical environment and start running the workflow.
Let's export our environment:
$ conda activate bio
$ conda env export > bio.yml
If you take a look inside bio.yml
you will see that it contains now informations about each program that was installed. For sharing you can tidy up this a bit, leaving behind only the necessary programs along with their version. In our case the file could look like this:
name: bio
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- bwa=0.7.17=ha92aebf_3
- picard=2.18.11=0
- samtools=1.9=h46bd0b3_0
- snakemake=5.2.1=0
On the new location you can import it now by:
$ conda env create -f bio.yml
Step into the environment:
$ conda activate bio
And run snakemake:
$ snakemake
It is a good practice to put your Snakefile
(and other files used in the workflow) and the environment definition into a git repository. Doing so, you have a version control about the workflow and the programs used in it.
Multiple environments
Let's go a little bit more advanced. Imagine you want to use a program which dependencies are incompatible to the other programs. snakemake
depends on python3
. Maybe you want to use a program that still depends on python2
.
Let's assume we want to extend the workflow above by a step that removes soft-clipped bases from our aligned reads. A program that can do this, is bamutils removeclipping
. If you're trying to install it via conda
in our existing environment, conda
will tell it can't, because it needs python2
and snakemake
needs python3
.
Luckily we can provide an environment file to each rule in snakemake
. snakemake
will then create this environment and excecute the rule in it. bamutils
is part of ngsutils
. Let's create a file called bio2.yml
:
name: bio2
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- ngsutils=0.5.9=py27_1
And this is how our new Snakefile looks like:
Pay attention to the conda
directive in the rule removeClipping
. If we like to run snakemake
now, we have to tell it, that it should use this:
$ snakemake --use-conda
The first time you start it, it will take some time as snakemake
needs to download and create the environment. When you start it next time, all things go fast as usual.
Summary and notes on the workflow
snakemake
let us create workflows for recurrent tasks. In combination with conda
we can create an environment we can share and this allow us to get reproducible results on other computers. Using the conda
directive in the rule definition and running conda
with the --use-conda
parameter makes it possible to have more than on environment per workflow and make is possible to use programs which dependencies cannot be resolved by conda.
If we store all these file in a git repository we have additional a versioning about the workflow and the programs used.
The workflow I described here is just for demonstration purpose. It is not optimized on performance. Furthermore ngsutils
is a very outdated package and removing soft clipped bases is normally never needed.
fin swimmer
Really great! - thank you finswimmer.
Hmm. So I suppose this is great for running many different linear workflows on your local machine.
But if you wanted to put a workflow system (that supports parallel execution and containers) into production on a cluster - what would you use?
Snakemake actually works on clusters and submit parallel jobs.
Read more in this tutorial https://hpc-carpentry.github.io/hpc-python/17-cluster/
I have a question. If you have a script that it already get the files need to run the analysis, but You have to pass multiple parameters:
How to pass this many parameters to an argparse script file?
Thanks