Align split-reads with bowtie-2 using snakemake
1
0
Entering edit mode
2.8 years ago
Ivan ▴ 60

I want to use snakemake to do bowtie2 mapping of split read files to a reference genome, and I'd like that rule to be integrated in the general workflow. For that purpose, I first defined a rule to create a bowtie index

rule build_bowtie_index:
    input:
        referenceGenomeFasta
    output:
        expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    shell:
        "bowtie2-build {input} {params.btindex}"

referenceGenomeFasta is reference genome fasta file, while referenceGenomeName is the same file without .fna suffix. The goal was to take foo.fna reference genome file and produce a list of index files foo.1.bt2, foo.2.bt2., foo.rev.bt1 etc. because bowtie2 requires that index (-X flag) be the name of the folder up to final dot, so I figured I'd store the referenceGenomeName as a global configurable variable.

Now I have a list of files containing split reads in the form of sampleFoo_1.fastq,sampleFoo_2.fastq, sampleBar_1.fastq,sampleBar_2.fastq etc. When I used bash scripts for mapping, I used the command:

The rule for creating bowtie2 index is the following:

bowtie2 -x {myindex} -1 {sample}_1.fastq -2 {sample}_2.fastq | samtools view -bS > {sample}.aligned.bam

I also used parallel processing via Sun Grid Engine to speed up the computation as I have a lot of samples. Now I want to transfer the workflow into snakemake. My rule for bowtie2 mapping must include my samples and the outputs of build_bowtie_index rule. The total thing looks like this:

SAMPLES = ["Foo", "Bar"]
rule align_via_bowtie2: 
    input:
        firstPair=expand("{rawSampleFolder}/{samples}_1.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
        secondPair=expand("{rawSampleFolder}/{samples}_2.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
        btindex1=expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        btindex2=expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    output:
        expand("{proccSampleFolder}/{samples}.aligned.bam", samples=SAMPLES, proccSampleFolder=proccSampleFolder)
    shell:
        "bowtie2 -x {params.btindex} -1 {input.firstPair} -2 {input.secondPair} | samtools view -bS -> {output}"

When I run this script with SAMPLES=["Foo"] the script works perfectly, and what's run is:

 bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq -2 raw_samples/Foo_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam

But when I run the script with SAMPLES=["Foo","Bar"], what snakemake executes is:

 bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq raw_samples/Bar_1.fastq -2 raw_samples/Foo_2.fastq raw_samples/Bar_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam proccessed_samples/Bar.aligned.bam

In other words, instead of mapping first Foo, and Bar second, it maps Foo Bar and I get an error. My problem is that I want to process a list of files as a separate job - or two separate lists of files containing split reads. What am I doing wrong?

snakemake bowtie2 • 1.4k views
ADD COMMENT
0
Entering edit mode
2.8 years ago
Ivan ▴ 60

After some tinkering, I found a working solution. I'm typing it below:

SAMPLES = ### a defined Python list ###
referenceGenomeName="ref_genome"

rule all:
    input:
        expand("processed_files/{sample}.bam.bai",sample=SAMPLES),
rule align_via_bowtie2:
    input:
        "raw_samples/{sample}_1.fastq",
        "raw_samples/{sample}_2.fastq",
        expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    output:
        temp("processed_files/{sample}.aligned.bam")
    shell:
        "bowtie2 -x {params.btindex} -1 {input[0]} -2 {input[1]} | samtools view -bS -> {output}"
rule samtools_fixmate:
    input:
        "processed_files/{sample}.aligned.bam"
    output:
        temp("processed_files/{sample}.fixmated.bam")
    shell:
        "samtools fixmate -m -O {input} {output}"
rule samtools_sort:
    input:
        "processed_files/{sample}.fixmated.bam"
    output:
        temp("processed_files/{sample}.sorted.bam")
    shell:
        "samtools sort {input} -o {output}"
rule samtools_markduplicates:
    input:
        "processed_files/{sample}.sorted.bam"
    output:
        "processed_files/{sample}.markduped.bam"
    shell:
        "samtools markdup {input} {output}"
rule samtools_index:
    input:
        "processed_files/{sample}.markduped.bam"
    output:
        "processed_files/{sample}.bam.bai",
    shell:
        """
        samtools index {input}
        """

So far, only the first rule is currently processing - I got the correct files in the processed_files directory. The rest is just further processing. I've used the following command to start parallel jobs on a cluster:

snakemake -j 5 -p --cluster "qsub -V -cwd -now y"
ADD COMMENT

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6