Tutorial:Creating workflows with snakemake and conda

Entering edit mode

6.7 years ago

finswimmer 16k

Aim of this tutorial

Do you know one of these situations?

During the installation of a new bioinformatic tool you have to solve many dependency problems.
Only a very old version of the tool is available through the package manager of your OS.
Want to try a new program breaks a running and tested workflow.
Tired of entering the same commands again and again.
You asked a colleague to test your commands and workflow on his PC but (s)he run into the problems above.

These problems can be resolved quite easily using conda and snakemake. This tutorial should give a brief introduction to this two programs and how they can work together, to create a workflow one can easily share and deploy accross different platforms with reproducible results.

Introduction to conda

What is `conda`?

This is what the homepage of conda says:

Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer.

In other words: One can have installed different versions of a program side-by-side on the local computer. During installation one can specify which versions are needed. This environment can be exported and imported on a different computer. This enables to have the same configuration including program version there. One can test a new software without breaking any other running programs.

Installing `conda`

The installation process I describe here is for linux. But for Windows and MaxOS it should be quite similar. Go to the download page, download the appropiate version and execute this file without root privilege with $ sh ./Miniconda3-latest-Linux-x86_64.sh. The difference between the ptyhon2.7 and python3.6 version is just which is the default python version in a new environment. But one can set it always explicit. I would recommend using the 64-bit python3.6 version.

After completing the installation one will find a folder named miniconda3 in the users home folder.

To run properly within your shell, one have to make changes to the .bashrc(if you're using bash) or .zshrc.local (if you are using zsh).

In these files look for lines that look like this: export PATH="/home/<username>/miniconda3/bin:$PATH"and remove them or comment them out. Instead insert this line: . /home/<username>/miniconda3/etc/profile.d/conda.sh

Close and restart your terminal.

There are several repository - called channels - available, that hold the install instructions for the packages one like to install. The most important one in bioinformatic is the channel bioconda. One need to add this channel to conda's config, so it automaticly search there for packages. This is done by:

$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge

The order of adding these channels is important!

Using `conda`

I don't want to go into details on how to use conda. These can be read on the manual page or the Cheat sheet

After installation conda already installed the first environment called base. One can get a list of all environment by:

$ conda env list

For our bioinformatic project let's create a new environment called bio and install the latest version of bwa, samtools and picard.

$ conda create --name bio bwa samtools picard

conda will resolve all dependencies that are necessary. After completion our new environment should be listed in the result of $ conda env list.

You can start this new environment by

$ conda activate bio

Let's see where the excecutable from bwa is located:

$ which bwa
/home/<username>/miniconda3/envs/bio/bin/bwa

We now want to add bcftools to our environment. To see if a package is available and what versions type:

$ conda search bcftools

One can install the latest version by:

$ conda install bcftools

Or set the version explicit:

$ conda install bcftools=1.7

Once we have a complete environment we like to share with somone, we can export the settings and list of installed packages:

$ conda env export -n bio > bio.yml

To create an environment based on that file do this:

$ conda create -f bio.yml

If you want to leave in active environment, deactivate it by

$ conda deactivate

Introduction to snakemake

What is `snakemake`?

This is what the homepage of snakemake says:

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language.

In other words: We can create rules like "alignment", "markduplicates", "variantcalling" using placeholders for filenames and define how the input and output files look like. After that, we tell snakemake to do "variantcalling" for all fastq files in a specific folder. It will than find out, which of these rules, in which order, have to be done for it.

Installing `snakemake`

There are several ways described in the manual to install snakemake. In this tutorial we like to install it in the bio environment we've created before:

$ conda install -n bio snakemake

if you are outside the environment, or

$ conda activate bio
$ conda install snakemake

to first go into the environment and install it there.

Create and running a workflow

Imagine the following situation:

After a NGS Run we will have a folder containing two fastq files per sample (Read1 and Read2). For all these samples we like to perform an alignment, sort the bam file, do a deduplication, indexing the resulted file and copy bam and index file into a folder called Alignment within the folder of fastq files.

As this is not a one-time job, we like to generalize all these step. Doing so, we will be able to repeat all these steps again on similar data.

This is where snakemake comes into play.

The data

Let's first have a look on the data and files we need. I will than go through it and explain.

Our folder containing the data looks like this:

$ tree /media/RunData/tutorialData/
/media/RunData/tutorialData
├── Sample1_S1_L001_R1_001.fastq.gz
├── Sample1_S1_L001_R2_001.fastq.gz
├── Sample2_S2_L001_R1_001.fastq.gz
├── Sample2_S2_L001_R2_001.fastq.gz
├── Sample3_S3_L001_R1_001.fastq.gz
└── Sample3_S3_L001_R2_001.fastq.gz

Create a file named Snakefile. This will contain all the rules we need in out workflow:

	configfile:
	"config.json"

	SAMPLES, = glob_wildcards(config['data']+"/{id}_L001_R1_001.fastq.gz")

	rule move:
	input:
	bam = expand("{sample}.final.bam", sample=SAMPLES),
	bai = expand("{sample}.final.bai", sample=SAMPLES)
	params:
	output = config['data']+"/Alignment"
	shell:
	"""
	mkdir -p {params.output}
	mv {input} {params.output}
	"""

	rule align_sort:
	input:
	config['data']+"/{sample}_L001_R1_001.fastq.gz",
	config['data']+"/{sample}_L001_R2_001.fastq.gz"
	output:
	temp("{sample}.mapped.bam")
	params:
	rg = "@RG\tID:{sample}\tPL:ILLUMINA\tSM:{sample}",
	ref = config['genome']
	conda:
	"envs.yaml"
	shell:
	"bwa mem -R '{params.rg}' -M {params.ref} {input} \| samtools sort -o {output} -"

	rule mark_duplicates:
	input:
	"{sample}.mapped.bam"
	output:
	bam = "{sample}.final.bam",
	bai = "{sample}.final.bai",
	metrics = temp("{sample}_dedup_metricx.txt")
	conda:
	"envs.yaml"
	shell:
	"picard MarkDuplicates I={input} O={output.bam} M={output.metrics} CREATE_INDEX=true"

view raw Snakefile_basic hosted with ❤ by GitHub

And a file named config.json. This file will hold the path to the data and the reference genome we like to align against. After completion of the workflow definition, this will be the only file we have to edit to run the workflow again on other data.

{
    "data": "/media/RunData/tutorialData",
    "genome": "/media/Databases/Genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFASTA/genome.fa",
}

A quick walk-through the `Snakefile`

snakemake extends the python language by adding syntax to define rules and other control structures. By default it runs our file from the top, excecuting every python code it finds, until it reaches the first rule definition. It then have a look at the input files and checks, which other rules needs to run to get these files.

Beside the input directive there are others we can/must define like output or shell. I don't want to go into details here. I recommend having a look at the tutorial on snakemake's homepage.

At the very beginning we load our config.json. Within this file we have defined in which folder our data is located and against which reference genome we like to align.

The fastq files follow a specific pattern, so we can extract the sample's name from it. This is done by SAMPLES, = glob_wildcards(config['data']+"/{id}_L001_R1_001.fastq.gz").

The first rule in the Snakefile ("move") is our target rule. snakemake will replace the placeholder {sample} with each sample name it finds in the folder in the step before. So we will have then Sample1_S1.bam, Sample1_S1.bai, Sample2_S2.bam, Sample2_S2.bai and so on. snakemake will check if these files already exists, if not it tries to find a rule which describes how to generate these files and will find it in the rule markduplicates. This is done until it finds the files given in input in the filesystem. So in our case the generated workflow will be align_sort, markduplicates and move.

You see that some of the defined output file are marked as temporary files (by temp(filename)). Doing this, snakemake will cleanup this files, if they are not used anymore by any other rule.

Run the workflow

Make sure you are in the conda environment we've created before. If not already done, step into it.

$ conda activate bio

Open the config.json and edit, so it fits your needs:

$ nano config.json

First let's try a dry-run and print out the commands that will be used:

$ snakemake -np

If everything looks fine, start the workflow:

$ snakemake

Sharing the workflow

Let's summarize what we have done so far. We've created an environment which contains snakemake and the programs we need to run. We've created a Snakefile which defines our workflow. And we have a config file for this workflow.

So sharing is straightforward. We have to export our environment definition and put this file along the others. On the new location one can build the identical environment and start running the workflow.

Let's export our environment:

$ conda activate bio
$ conda env export > bio.yml

If you take a look inside bio.yml you will see that it contains now informations about each program that was installed. For sharing you can tidy up this a bit, leaving behind only the necessary programs along with their version. In our case the file could look like this:

name: bio
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - bwa=0.7.17=ha92aebf_3
  - picard=2.18.11=0
  - samtools=1.9=h46bd0b3_0
  - snakemake=5.2.1=0

On the new location you can import it now by:

$ conda env create -f bio.yml

Step into the environment:

$ conda activate bio

And run snakemake:

$ snakemake

It is a good practice to put your Snakefile (and other files used in the workflow) and the environment definition into a git repository. Doing so, you have a version control about the workflow and the programs used in it.

Multiple environments

Let's go a little bit more advanced. Imagine you want to use a program which dependencies are incompatible to the other programs. snakemake depends on python3. Maybe you want to use a program that still depends on python2.

Let's assume we want to extend the workflow above by a step that removes soft-clipped bases from our aligned reads. A program that can do this, is bamutils removeclipping. If you're trying to install it via conda in our existing environment, conda will tell it can't, because it needs python2 and snakemake needs python3.

Luckily we can provide an environment file to each rule in snakemake. snakemake will then create this environment and excecute the rule in it. bamutils is part of ngsutils. Let's create a file called bio2.yml:

name: bio2
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - ngsutils=0.5.9=py27_1

And this is how our new Snakefile looks like:

	configfile:
	"config.json"

	SAMPLES, = glob_wildcards(config['data']+"/{id}_L001_R1_001.fastq.gz")

	rule move:
	input:
	bam = expand("{sample}.final.bam", sample=SAMPLES),
	bai = expand("{sample}.final.bai", sample=SAMPLES)
	params:
	output = config['data']+"/Alignment"
	shell:
	"""
	mkdir -p {params.output}
	mv {input} {params.output}
	"""

	rule align_sort:
	input:
	config['data']+"/{sample}_L001_R1_001.fastq.gz",
	config['data']+"/{sample}_L001_R2_001.fastq.gz"
	output:
	temp("{sample}.mapped.bam")
	params:
	rg = "@RG\tID:{sample}\tPL:ILLUMINA\tSM:{sample}",
	ref = config['genome']
	shell:
	"bwa mem -R '{params.rg}' -M {params.ref} {input} \| samtools sort -o {output} -"

	rule removeClipping:
	input:
	"{sample}.mapped.bam"
	output:
	temp("{sample}.clipped.bam")
	conda:
	"bio2.yml"
	shell:
	"bamutils removeclipping {input} {output}"

	rule mark_duplicates:
	input:
	"{sample}.clipped.bam"
	output:
	bam = "{sample}.final.bam",
	bai = "{sample}.final.bai",
	metrics = temp("{sample}_dedup_metricx.txt")
	shell:
	"picard MarkDuplicates I={input} O={output.bam} M={output.metrics} CREATE_INDEX=true"

view raw Snakefile_advanced hosted with ❤ by GitHub

Pay attention to the conda directive in the rule removeClipping. If we like to run snakemake now, we have to tell it, that it should use this:

$ snakemake --use-conda

The first time you start it, it will take some time as snakemake needs to download and create the environment. When you start it next time, all things go fast as usual.

Summary and notes on the workflow

snakemake let us create workflows for recurrent tasks. In combination with conda we can create an environment we can share and this allow us to get reproducible results on other computers. Using the conda directive in the rule definition and running conda with the --use-conda parameter makes it possible to have more than on environment per workflow and make is possible to use programs which dependencies cannot be resolved by conda.

If we store all these file in a git repository we have additional a versioning about the workflow and the programs used.

The workflow I described here is just for demonstration purpose. It is not optimized on performance. Furthermore ngsutils is a very outdated package and removing soft clipped bases is normally never needed.

fin swimmer

workflow snakemake conda • 30k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 6.7 years ago by finswimmer 16k

0

Entering edit mode

Really great! - thank you finswimmer.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hmm. So I suppose this is great for running many different linear workflows on your local machine.

But if you wanted to put a workflow system (that supports parallel execution and containers) into production on a cluster - what would you use?

ADD REPLY • link 6.4 years ago by LayneSadler ▴ 90

5

Entering edit mode

Snakemake actually works on clusters and submit parallel jobs.

Snakemake can make use of cluster engines that support shell scripts and have access to a common filesystem, (e.g. the Sun Grid Engine). In this case, Snakemake simply needs to be given a submit command that accepts a shell script as first positional argument:

$ snakemake --cluster qsub -j 32

Here, -j denotes the number of jobs submitted being submitted to the cluster at the same time (here 32). The cluster command can be decorated with job specific information, e.g.

Read more in this tutorial https://hpc-carpentry.github.io/hpc-python/17-cluster/

ADD REPLY • link 6.4 years ago by Medhat 9.8k

0

Entering edit mode

I have a question. If you have a script that it already get the files need to run the analysis, but You have to pass multiple parameters:

$ python script.py -p directory_files -d dir_to_save_analysis -o name_to_the_final_files -ki (len_min) -ka (len_max) -mode ('all', 'word', 'plot')

How to pass this many parameters to an argparse script file?

Thanks

ADD REPLY • link 4.5 years ago by psschlogl ▴ 50

4

Entering edit mode

5.7 years ago

finswimmer 16k

A small update about conda (Thanks to Devon Ryan )

You may encounter, that within a conda environment, python is not complete isolated from the system's python. This is the case when one have installed python packages in you "normal" environment using pip's --user flag, Than the python package is installed in ~/.local/lib/python<version>/site-packages. This folder is read in by python within a conda environment as well!

You can avoid this by setting PYTHONNOUSERSITE=True every time you activate your conda env.

export PYTHONNOUSERSITE=True

This can be done automatically when you activate an env. See here.

ADD COMMENT • link 5.7 years ago by finswimmer 16k

0

Entering edit mode

Thanks, also there are new features that was added to Snakake like:

pipe
group
and checkpoints

Maybe you would like to add them to the main tutorial.

ADD REPLY • link 5.7 years ago by Medhat 9.8k

Login before adding your answer.

Aim of this tutorial

Introduction to conda

What is conda?

Installing conda

Using conda

Introduction to snakemake

What is snakemake?

Installing snakemake

Create and running a workflow

The data

A quick walk-through the Snakefile

Run the workflow

Sharing the workflow

Multiple environments

Summary and notes on the workflow

What is `conda`?

Installing `conda`

Using `conda`

What is `snakemake`?

Installing `snakemake`

A quick walk-through the `Snakefile`