Question

Snakemake - generate an unkown file in input

0

Entering edit mode

5.5 years ago

regue.hadrien ▴ 40

Hello,

It's me again for a new snakemake question! This time, I want to read a table in input using a R script, to generate a list of reads to keep. But this list may came from multiple source, (genotype in this case).

I generate with my R script an output file {barecode}_list_{genotype}.txt .
barecode wildcard is already defined, and genotype can be either A/B/C/...

I tried something like this:

rule R_HBV_analysis:
input:
    R_data = datapath+"BLASTN/{barcode}_fmt.txt"
output:
    Rresult = datapath+"R_RESULT/{barcode}_list_{genotype}.txt",
wildcard_constraints:
    genotype =["GTA","GTB","GTC","GTD","GTE","GTF","GTG","GTH","GTI","GTJ"]

params:
    path = datapath  
shell:
    """
    if [ ! -d {params.path}RDATA ];then
        mkdir {params.path}RDATA
    fi
    if [ ! -d {params.path}R_RESULT ];then
        mkdir {params.path}R_RESULT
    fi
    Rscript script/HBV_analysis.R {input} {params.path}
    """

with

(GENOTYPE)={}
GENOTYPE['genotype'] =("GTA","GTB","GTC","GTD","GTE","GTF","GTG","GTH","GTI","GTJ")
rule all:
input:
    Rresults = expand(datapath+"R_RESULT/{barcode}_list_{genotype}.txt",barcode=BARCODE,genotype=GENOTYPE['genotype'])

But it return me an error, because snakemake expect a file for all genotypes, wich is not the case.

I hope I was clear in my question. thanks,

Hadrien

snakemake R • 1.9k views

ADD COMMENT • link updated 5.5 years ago by russhh 5.8k • written 5.5 years ago by regue.hadrien ▴ 40

1

Entering edit mode

Sorry, how does your R-script know which of the genotypes are to be present in the output file?

ADD REPLY • link 5.5 years ago by russhh 5.8k

0

Entering edit mode

In fact, I'm counting the most represented genotype in my dataset. So, depending on the dataset, it could be A, or B etc .....

ADD REPLY • link 5.5 years ago by regue.hadrien ▴ 40

1

Entering edit mode

Then why not have a single output file name used by each run of your Rscript, and encode the mode-genotype within the file, rather than within the filename?

ADD REPLY • link 5.5 years ago by russhh 5.8k

0

Entering edit mode

Well, you totally rigth! I didn't think about it at all. It should be more easier to handle this way with snakemake. Thanks for this advice and all of your tips on the second answer.

ADD REPLY • link 5.5 years ago by regue.hadrien ▴ 40

score 2 · Answer 1 · 2020-02-07

Although this isn't a code review site, do you mind if I rewrite your workflow? I'll add some comments to cover any confusing bits that might be the source of your problem.

I can't tell from your code whether datapath looks like "some/path/" or "some/path/some_prefix" (I've assumed the former), and as such I'd strongly recommend you use os.path.join(...) or expand("{prefix}/R_RESULT/{barcode}_list_{genotype}.txt", ...) to build your result paths more explicitly.

# You didn't define `datapath` and `barcodes`, so I'm assuming you've defined them in some
# config file so that your pipeline is reusable:
configfile: "my/config.yaml"

datapath = config["datapath"]
BARCODES = config["barcodes"]
# please put these into a config as well ...
GENOTYPES = ["GTA","GTB","GTC","GTD","GTE","GTF","GTG","GTH","GTI","GTJ"]

result_path = os.path.join(datapath, "R_RESULT")
blastn_path = os.path.join(datapath, "BLASTN")

# what does this rule tell the computer to do? For each combination of barcode and genotype
# that the user specifies,  generate the file "<datapath>/R_RESULT/<barcode>_list_<genotype>.txt"
rule all:
    input:
        expand(
            "{result_path}/{barcode}_list_{genotype}.txt",
            result_path=result_path,
            barcode=BARCODES,
            genotype=GENOTYPES
        )

# This rule tells the computer:
# If you want the file (singular) "<datapath>/R_RESULT/<barcode>_list_<genotype>.txt",
# then run this R-script and it will generate that file.
# 
# Since you have specified that you're interested in 10 genotypes for each barcode (see your
# `rule all`),  this rule will be ran 10 * |number of barcodes| times.
#
# But, for a given barcode, the input to this rule would be the same for each of those 10 runs,
# Should you have a {genotype} stub in your input filepath?
#
# Or, If this rule should run once for each barcode, but during that run it generates a file for each of the
# 10 genotypes, then you should change the output to
# `expand(join(datapath, "R_RESULT", "{{barcode}}_list_{genotype}.txt"), genotype = GENOTYPES)`
#
rule R_HBV_analysis:
    input:
        R_data = os.path.join(blastn_path, {barcode}_fmt.txt")
    output:
        R_result = os.path.join(result_path, "{barcode}_list_{genotype}.txt"),
    wildcard_constraints:
        # I hadn't seen lists of values passed in as wildcard_constraints before, I usually
        # use `genotype = "|".join(GENOTYPES)` to build the constraint regex
        # please check that the syntax is correct
        genotype = GENOTYPES
    params:
        path = datapath,
        result_path = result_path,
        # I don't quite get why this path is `mkdir`ed; if it's contents are made by your script, you could
        # add it as a `directory()` output of this rule.
        #
        # Nonetheless, I moved their definition to here, to reduce the duplication 
        # in your shell command
        data_path = os.path.join(datapath, "RDATA")
    shell:
        """
        if [ ! -d "{params.data_path}" ];then
          mkdir "{params.data_path}"
        fi
        if [ ! -d "{params.result_path}" ];then
          mkdir "{params.result_path}"
        fi
        Rscript script/HBV_analysis.R {input} {params.path}
        """