I want to use snakemake
to do bowtie2
mapping of split read files to a reference genome, and I'd like that rule to be integrated in the general workflow.
For that purpose, I first defined a rule to create a bowtie index
rule build_bowtie_index:
input:
referenceGenomeFasta
output:
expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
params:
btindex = referenceGenomeName
shell:
"bowtie2-build {input} {params.btindex}"
referenceGenomeFasta
is reference genome fasta file, while referenceGenomeName
is the same file without .fna suffix. The goal was to take foo.fna
reference genome file and produce a list of index files foo.1.bt2, foo.2.bt2., foo.rev.bt1
etc. because bowtie2
requires that index (-X
flag) be the name of the folder up to final dot, so I figured I'd store the referenceGenomeName
as a global configurable variable.
Now I have a list of files containing split reads in the form of sampleFoo_1.fastq,sampleFoo_2.fastq, sampleBar_1.fastq,sampleBar_2.fastq
etc. When I used bash scripts for mapping, I used the command:
The rule for creating bowtie2 index
is the following:
bowtie2 -x {myindex} -1 {sample}_1.fastq -2 {sample}_2.fastq | samtools view -bS > {sample}.aligned.bam
I also used parallel processing via Sun Grid Engine to speed up the computation as I have a lot of samples. Now I want to transfer the workflow into snakemake. My rule for bowtie2 mapping must include my samples and the outputs of build_bowtie_index
rule. The total thing looks like this:
SAMPLES = ["Foo", "Bar"]
rule align_via_bowtie2:
input:
firstPair=expand("{rawSampleFolder}/{samples}_1.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
secondPair=expand("{rawSampleFolder}/{samples}_2.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
btindex1=expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
btindex2=expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
params:
btindex = referenceGenomeName
output:
expand("{proccSampleFolder}/{samples}.aligned.bam", samples=SAMPLES, proccSampleFolder=proccSampleFolder)
shell:
"bowtie2 -x {params.btindex} -1 {input.firstPair} -2 {input.secondPair} | samtools view -bS -> {output}"
When I run this script with SAMPLES=["Foo"]
the script works perfectly, and what's run is:
bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq -2 raw_samples/Foo_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam
But when I run the script with SAMPLES=["Foo","Bar"]
, what snakemake executes is:
bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq raw_samples/Bar_1.fastq -2 raw_samples/Foo_2.fastq raw_samples/Bar_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam proccessed_samples/Bar.aligned.bam
In other words, instead of mapping first Foo
, and Bar
second, it maps Foo Bar
and I get an error. My problem is that I want to process a list of files as a separate job - or two separate lists of files containing split reads. What am I doing wrong?