how to index different genomes with snakemake with different parameters?
1
2
Entering edit mode
5.1 years ago
Assa Yeroslaviz ★ 1.9k

I would like to use snakemake to analyze my data sets. As I am going to work with different organisms, I would like snakemake to create a folder for each of them when indexing the genome.

let's say I would like to work with human and mouse data.

In my config.yaml file I have the following snippet:

organism:
 "Dmel"
# "Dpse"

Each time I am working with an organism I can comment out the others in the list.

Below are the rules for getting the fastA files as well as the annotation and the indexing step (as an example) with STAR.

#### get reference genomic data for the mapping (fasta and gtf files). Links are added in the config file
rule get_genome:
   output:
         fastA="genome/genome.fa",
         gtf="genome/genome.gtf"
   shell:
         """
         wget -nc -O - {config[fastA]} | gunzip -c - > {output.fastA}
         wget -nc -O - {config[gtf]}   | gunzip -c - > {output.gtf}
         """

### Indexing the reference genome

rule star_index:
   input:
          fasta="genome/genome.fa",
          gtf="genome/genome.gtf"
   output:
          directory("genome/starIndex/")
   threads: 16
   params:
         prefix = {config["organism"]}
   shell:
          "mkdir -p {output} && "
          "STAR --runThreadN {threads} "
          "--outFileNamePrefix {output}{params.prefix} "
          "--runMode genomeGenerate "
          "--genomeDir {output} "
          "--limitGenomeGenerateRAM {config[RAM]} "
          "--genomeSAindexNbases {config[SAindex]} "
          "--genomeFastaFiles {input.fasta} "
          "--sjdbGTFfile {input.gtf} "
          "--sjdbOverhang 100"

I would like to know how to add the value from the config file in the organism, e.g. config[organism] into the rules, so that, when I download the fastA and gtf they will be renamed accordingly.

I have tried to add config[organism] to the path, but it gives me an error. I have also tried this for example:

rule get_genome:
   output:
         fastA="genome/{config[organism]}.fa",
         gtf="genome/{config[organism]}.gtf"

But it just creates the files {config[organism]}.gtf and {config[organism]}.fa.

Can someone please help me understand how I can add the parameter from the config file to the different paths in the various rules?

thanks

snakemake indexing mapping • 2.8k views
ADD COMMENT
4
Entering edit mode
5.1 years ago
Eric Lim ★ 2.2k

You'd need wildcards. https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards

[~/Data/scratch/tmp/biostar]$ cat wildcards.py
rule:
    input: expand("{organisms}/starIndex/", organisms=config['organism'])

rule get_genome:
    output: fa = touch('{organism}/{organism}.fa'),
            gtf = touch('{organism}/{organism}.gtf')
    run:
        # remove `touch` and implement logic to download fa and gtf
        # you can access organism via wildcards, i.e. config[wildcards.organism]
        pass

rule star_index:
    input: fa = '{organism}/{organism}.fa',
           gtf = '{organism}/{organism}.gtf'
    output: directory("{organism}/starIndex/")
    shell: 'mkdir -p {output}'

[~/Data/scratch/tmp/biostar]$ snakemake -s wildcards.py --config organism=GRCh38
...
[~/Data/scratch/tmp/biostar]$ tree GRCh38/
GRCh38/
├── GRCh38.fa
├── GRCh38.gtf
└── starIndex

1 directory, 2 files

In this case, config['organism'] can be a list of organisms and the workflow will expand itself. Hope the example helps.

ADD COMMENT
0
Entering edit mode

thanks for this great example. I will try it immediately. I was just wondering what you mean in the run comment.

Something in the way adding a list in the config file like:

organism:
     ["GRCh38", "GRCm38", "BDGP6.22"]

But where does snakemake know how to get the link for each of these genomes?

ADD REPLY
2
Entering edit mode

That'd be completely up to your design in config.yaml. For instance, if your yaml looks like this,

[~/Data/scratch/tmp/biostar]$ cat config.yaml 
organisms:
  human:
    url: google.com
  mouse:
    url: yahoo.com

You can access the url via config['organisms'][wildcards.organism]['url'] in get_genome.

Your recipe can then be defined with the following,

rule:
    input: expand("{organisms}/starIndex/", organisms=config['organisms'].keys())
ADD REPLY
0
Entering edit mode

I don't know what happened, I thought I have got it, but it seems to be not working anymore.

This is how my config.yaml file look like

organism:
  Dmel:
    fasta: "ftp://ftp.ensembl.org/pub/current_fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz"
    gtf: "ftp://ftp.ensembl.org/pub/current_gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.22.98.gtf.gz"
  Dpse:
    fasta: "ftp://ftp.ensemblgenomes.org/pub/current/metazoa/fasta/drosophila_pseudoobscura/dna/Drosophila_pseudoobscura.Dpse_3.0.dna.toplevel.fa.gz"
    gtf: "ftp://ftp.ensemblgenomes.org/pub/current/metazoa/gtf/drosophila_pseudoobscura/Drosophila_pseudoobscura.Dpse_3.0.45.gtf.gz"

and than I have my snakefile

configfile: "config.yaml"

rule all:
    input:
        expand("genome/{org}/starIndex/", org=config['organism']),

    #### get reference genomic data for the mapping (fasta and gtf files). Links are added in the config file
rule get_genome:
   output:
         fastA="genome/{org}.fa",
         gtf="genome/{org}.gtf"
   shell:
         """
         wget -nc -O - {config['organism'][wildcards.org]['fasta']} | gunzip -c - > {output.fastA}
         wget -nc -O - {config['organism'][wildcards.org]['gtf']}   | gunzip -c - > {output.gtf}
         """      
### Indexing the reference genome  
rule star_index:
   input:
          fasta="genome/{org}.fa",
          gtf="genome/{org}.gtf"
   output:
          directory("genome/{org}/starIndex/")
   threads: 16
   params:
         prefix = lambda wildcards: "{org}".format(org=wildcards.org)
   shell:
          "mkdir -p {output} && "
          "STAR --runThreadN {threads} "
   ...

But when I try to run it I get this error message:

$snakemake -s getGenome_IndexGenome.Snakefile  -np -j 30
Building DAG of jobs...
Job counts:
        count   jobs
        1       all
        2       get_genome
        2       star_index
        5

[Mon Nov 18 10:59:16 2019]
rule get_genome:
    output: genome/Dmel.fa, genome/Dmel.gtf
    jobid: 4
    wildcards: org=Dmel

RuleException in line 16 of /local/Assa/projects/automation/P135.automation/getGenome_IndexGenome.Snakefile:
NameError: The name "'organism'" is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}

But the path to my fasta file should be correct, as it seems from the command

print(config['organism']['Dmel']['fasta'])

which gives me this ftp://ftp.ensembl.org/pub/current_fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.22.dna.toplevel.fa.gz

Where do I go wrong here?

ADD REPLY
1
Entering edit mode

Sorry for the late response. I was away this weekend. Snakemake's bracket markup only replaces a variable with its string representation. In other words, it doesn't evaluate code. To accomplish what you're trying to do, I'd use something like a param.

rule get_genome:
    output: fa="genome/{org}.fa"
    params: url = lambda wildcards: config['organism'][wildcards.org]['fasta']
    shell: 'wget -nc -O - {params.url} | gunzip -c - > {output.fa}'
ADD REPLY

Login before adding your answer.

Traffic: 1758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6