Tutorial:Use a workflow management tool to manage your computational pipelines
1
16
Entering edit mode
8.1 years ago
Eric Lim ★ 2.2k

Our group recently had to take over the development and maintenance of a collaborator's computational workflow. I received an email from the PI this morning with an attachment of what had been done. It is, I kid you not, 900 lines of pure torture in a single bash script. From nested if-else and for loops to bash scripts making bash scripts to more pipe than Mario can possibly handle, I wonder if the use of any workflow management tool is uncommon in our field.

I only started hanging out here a couple of weeks ago and I seldom see questions about the use of any workflow management system. I thought perhaps I could start one and hope the rest of the community would post code snippets to encourage others to start using one.

I have used Ruffus, Luigi/SciLuigi, NextFlow, and Make, but ultimately settled with Snakemake.

Here is a good collection of existing frameworks and libraries. Jeremy Leipzig, a veteran here, my ex officemate, recently published this paper.

I'm going to start by including a very simple Snakemake's workflow to turn RNASeq data from fastqs to sorted bam using STAR with mostly default parameters. It'd really be great if the community will start posting some examples of your favorite frameworks.

"""
wf.py
A simple Snakemake's workflow to process RNA-Seq using STAR.
"""
import os

# define an ENCODE's library name and their associated accessions
# as an example for this workflow
samples = {
    'ENCLB555AOP': {
        '1': 'ENCFF000FXU',
        '2': 'ENCFF000FYF'
    }
}

localrules: run_STAR
shell.prefix("source ~/.bash_profile; set -euo pipefail;")

rule run_STAR:
    """
    Run simple STAR workflow.
    """
    input: expand('results/{samples}/{samples}.star.sorted.bam', samples=samples.keys())

rule download_sample:
    """
    Download RNA-Seq data from ENCODE.
    """
    output: fq = '{prefix}/{sample}_{read}.fq.gz'
    run:
        acc = samples[wildcards.sample][wildcards.read]
        shell('wget -O {output.fq} https://www.encodeproject.org/files/{acc}/@@download/{acc}.fastq.gz')

rule download_ensembl_gtf:
    """
    Download GRCh37 GTF from ensembl.
    """
    output: gtf = 'references/ensembl.gtf'
    params: url = 'ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz'
    run:
        shell("""
              wget -O {output.gtf}.gz {params.url}
              pigz -d {output.gtf}.gz
              """)

rule download_ensembl_dna:
    """
    Download genomic DNA sequences from ensembl.
    """
    output: fa ='references/dna.fa'
    params: url = 'ftp://ftp.ensembl.org/pub/grch37/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz'
    run:
        shell("""
              wget -O {output.fa}.gz {params.url}
              pigz -d {output.fa}.gz
              """)

rule index_star:
    """
    Index genomic DNA sequences using STAR.
    """
    input: fa = rules.download_ensembl_dna.output.fa,
           gtf = rules.download_ensembl_gtf.output.gtf
    output: Genome = 'references/Genome'
    threads: 8
    run:
        res_dir = os.path.dirname(output.Genome)
        shell("""
              STAR --runMode genomeGenerate        \
                   --runThreadN {threads}          \
                   --genomeFastaFiles {input.fa}   \
                   --sjdbGTFfile {input.gtf}       \
                   --genomeDir {res_dir}/
              """)

rule align_star:
    """
    Align sequencing reads using STAR.
    """
    input: fq1 = '{prefix}_1.fq.gz',
           fq2 = '{prefix}_2.fq.gz',
           idx = rules.index_star.output.Genome
    output: bam = '{prefix}.star.bam',
            sj = '{prefix}.star.sj'
    threads: 8
    run:
        res_dir = os.path.dirname(output.bam)
        idx_dir = os.path.dirname(input.idx)
        shell("""
              STAR --runThreadN {threads}                 \
                   --runMode alignReads                   \
                   --readFilesCommand pigz -d -c          \
                   --outSAMtype BAM Unsorted              \
                   --genomeDir {idx_dir}/                 \
                   --outFileNamePrefix {res_dir}/         \
                   --readFilesIn {input.fq1} {input.fq2}
              mv {res_dir}/Aligned.out.bam {output.bam}
              mv {res_dir}/SJ.out.tab {output.sj}   
              """)

rule sort_bam:
    """
    Sort bam by coordinates using sambamba.
    """
    input: bam = '{prefix}.bam'
    output: bam = '{prefix}.sorted.bam'
    params: mem = '35G'
    threads: 8
    run:
        tmp_dir = os.path.dirname(output.bam)
        shell('sambamba sort --tmpdir {tmp_dir} -t {threads} -m {params.mem} -o {output.bam} {input.bam}')

Assuming you have a standalone machine with 32 cores.

snakemake -s wf.py run_STAR -j32

To submit the workflow to a SGE scheduler,

snakemake -s wf.py run_STAR -j999 -c "qsub -pe smp {threads}"

This example only uses some basic features of Snakemake. To learn more, visit here.

workflow-management snakemake pipeline • 6.2k views
ADD COMMENT
3
Entering edit mode
ADD REPLY
3
Entering edit mode
6.1 years ago
alaincoletta ▴ 170

For Nextflow, I always recommend to anybody get started with implementations patterns. There is a really nice repository with plenty of examples:

https://github.com/nextflow-io/patterns

And maybe follow up with one of the workflows:

https://nextflow-io.github.io/hack17-varcall/

This post shows an example running a pipeline to process RNA-Seq Data (Download, align, count reads) for more than 40 samples (70 GB) in around 2h.

https://medium.com/@alaincoletta/how-to-create-a-nextflow-pipeline-with-all-necessary-tools-for-a-series-of-repetitive-35cf5b3b19a8

ADD COMMENT

Login before adding your answer.

Traffic: 2397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6