I would like to use snakemake to analyze my data sets. As I am going to work with different organisms, I would like snakemake to create a folder for each of them when indexing the genome.
let's say I would like to work with human and mouse data.
In my config.yaml
file I have the following snippet:
organism:
"Dmel"
# "Dpse"
Each time I am working with an organism I can comment out the others in the list.
Below are the rules for getting the fastA files as well as the annotation and the indexing step (as an example) with STAR.
#### get reference genomic data for the mapping (fasta and gtf files). Links are added in the config file
rule get_genome:
output:
fastA="genome/genome.fa",
gtf="genome/genome.gtf"
shell:
"""
wget -nc -O - {config[fastA]} | gunzip -c - > {output.fastA}
wget -nc -O - {config[gtf]} | gunzip -c - > {output.gtf}
"""
### Indexing the reference genome
rule star_index:
input:
fasta="genome/genome.fa",
gtf="genome/genome.gtf"
output:
directory("genome/starIndex/")
threads: 16
params:
prefix = {config["organism"]}
shell:
"mkdir -p {output} && "
"STAR --runThreadN {threads} "
"--outFileNamePrefix {output}{params.prefix} "
"--runMode genomeGenerate "
"--genomeDir {output} "
"--limitGenomeGenerateRAM {config[RAM]} "
"--genomeSAindexNbases {config[SAindex]} "
"--genomeFastaFiles {input.fasta} "
"--sjdbGTFfile {input.gtf} "
"--sjdbOverhang 100"
I would like to know how to add the value from the config file in the organism, e.g. config[organism]
into the rules, so that, when I download the fastA and gtf they will be renamed accordingly.
I have tried to add config[organism]
to the path, but it gives me an error. I have also tried this for example:
rule get_genome:
output:
fastA="genome/{config[organism]}.fa",
gtf="genome/{config[organism]}.gtf"
But it just creates the files {config[organism]}.gtf
and {config[organism]}.fa
.
Can someone please help me understand how I can add the parameter from the config file to the different paths in the various rules?
thanks
thanks for this great example. I will try it immediately. I was just wondering what you mean in the
run
comment.Something in the way adding a list in the config file like:
But where does snakemake know how to get the link for each of these genomes?
That'd be completely up to your design in
config.yaml
. For instance, if your yaml looks like this,You can access the
url
viaconfig['organisms'][wildcards.organism]['url']
inget_genome
.Your recipe can then be defined with the following,
I don't know what happened, I thought I have got it, but it seems to be not working anymore.
This is how my
config.yaml
file look likeand than I have my
snakefile
But when I try to run it I get this error message:
But the path to my fasta file should be correct, as it seems from the command
which gives me this ftp://ftp.ensembl.org/pub/current_fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz
Where do I go wrong here?
Sorry for the late response. I was away this weekend. Snakemake's bracket markup only replaces a variable with its string representation. In other words, it doesn't evaluate code. To accomplish what you're trying to do, I'd use something like a
param
.