Snakemake - generate an unkown file in input
1
0
Entering edit mode
4.9 years ago

Hello,

It's me again for a new snakemake question! This time, I want to read a table in input using a R script, to generate a list of reads to keep. But this list may came from multiple source, (genotype in this case).

I generate with my R script an output file {barecode}_list_{genotype}.txt .
barecode wildcard is already defined, and genotype can be either A/B/C/...

I tried something like this:

rule R_HBV_analysis:
input:
    R_data = datapath+"BLASTN/{barcode}_fmt.txt"
output:
    Rresult = datapath+"R_RESULT/{barcode}_list_{genotype}.txt",
wildcard_constraints:
    genotype =["GTA","GTB","GTC","GTD","GTE","GTF","GTG","GTH","GTI","GTJ"]

params:
    path = datapath  
shell:
    """
    if [ ! -d {params.path}RDATA ];then
        mkdir {params.path}RDATA
    fi
    if [ ! -d {params.path}R_RESULT ];then
        mkdir {params.path}R_RESULT
    fi
    Rscript script/HBV_analysis.R {input} {params.path}
    """

with

(GENOTYPE)={}
GENOTYPE['genotype'] =("GTA","GTB","GTC","GTD","GTE","GTF","GTG","GTH","GTI","GTJ")
rule all:
input:
    Rresults = expand(datapath+"R_RESULT/{barcode}_list_{genotype}.txt",barcode=BARCODE,genotype=GENOTYPE['genotype'])

But it return me an error, because snakemake expect a file for all genotypes, wich is not the case.

I hope I was clear in my question. thanks,

Hadrien

snakemake R • 1.4k views
ADD COMMENT
1
Entering edit mode

Sorry, how does your R-script know which of the genotypes are to be present in the output file?

ADD REPLY
0
Entering edit mode

In fact, I'm counting the most represented genotype in my dataset. So, depending on the dataset, it could be A, or B etc .....

ADD REPLY
1
Entering edit mode

Then why not have a single output file name used by each run of your Rscript, and encode the mode-genotype within the file, rather than within the filename?

ADD REPLY
0
Entering edit mode

Well, you totally rigth! I didn't think about it at all. It should be more easier to handle this way with snakemake. Thanks for this advice and all of your tips on the second answer.

ADD REPLY
2
Entering edit mode
4.9 years ago
russhh 5.7k

Although this isn't a code review site, do you mind if I rewrite your workflow? I'll add some comments to cover any confusing bits that might be the source of your problem.

I can't tell from your code whether datapath looks like "some/path/" or "some/path/some_prefix" (I've assumed the former), and as such I'd strongly recommend you use os.path.join(...) or expand("{prefix}/R_RESULT/{barcode}_list_{genotype}.txt", ...) to build your result paths more explicitly.

# You didn't define `datapath` and `barcodes`, so I'm assuming you've defined them in some
# config file so that your pipeline is reusable:
configfile: "my/config.yaml"

datapath = config["datapath"]
BARCODES = config["barcodes"]
# please put these into a config as well ...
GENOTYPES = ["GTA","GTB","GTC","GTD","GTE","GTF","GTG","GTH","GTI","GTJ"]

result_path = os.path.join(datapath, "R_RESULT")
blastn_path = os.path.join(datapath, "BLASTN")

# what does this rule tell the computer to do? For each combination of barcode and genotype
# that the user specifies,  generate the file "<datapath>/R_RESULT/<barcode>_list_<genotype>.txt"
rule all:
    input:
        expand(
            "{result_path}/{barcode}_list_{genotype}.txt",
            result_path=result_path,
            barcode=BARCODES,
            genotype=GENOTYPES
        )

# This rule tells the computer:
# If you want the file (singular) "<datapath>/R_RESULT/<barcode>_list_<genotype>.txt",
# then run this R-script and it will generate that file.
# 
# Since you have specified that you're interested in 10 genotypes for each barcode (see your
# `rule all`),  this rule will be ran 10 * |number of barcodes| times.
#
# But, for a given barcode, the input to this rule would be the same for each of those 10 runs,
# Should you have a {genotype} stub in your input filepath?
#
# Or, If this rule should run once for each barcode, but during that run it generates a file for each of the
# 10 genotypes, then you should change the output to
# `expand(join(datapath, "R_RESULT", "{{barcode}}_list_{genotype}.txt"), genotype = GENOTYPES)`
#
rule R_HBV_analysis:
    input:
        R_data = os.path.join(blastn_path, {barcode}_fmt.txt")
    output:
        R_result = os.path.join(result_path, "{barcode}_list_{genotype}.txt"),
    wildcard_constraints:
        # I hadn't seen lists of values passed in as wildcard_constraints before, I usually
        # use `genotype = "|".join(GENOTYPES)` to build the constraint regex
        # please check that the syntax is correct
        genotype = GENOTYPES
    params:
        path = datapath,
        result_path = result_path,
        # I don't quite get why this path is `mkdir`ed; if it's contents are made by your script, you could
        # add it as a `directory()` output of this rule.
        #
        # Nonetheless, I moved their definition to here, to reduce the duplication 
        # in your shell command
        data_path = os.path.join(datapath, "RDATA")
    shell:
        """
        if [ ! -d "{params.data_path}" ];then
          mkdir "{params.data_path}"
        fi
        if [ ! -d "{params.result_path}" ];then
          mkdir "{params.result_path}"
        fi
        Rscript script/HBV_analysis.R {input} {params.path}
        """
ADD COMMENT
0
Entering edit mode

Thx for your feedback. I'm still not very familiar with snakemake, that explain why i'm sometime using unapropriate functions etc... I'll analyze this carefully, thanks.

The idea is not having 10 files per barcode, but only one file, determined by the Rscript. So it seems i've coded something I dont want to!
I should probably rework my rule all, but I dont know if it's possible to define someting more like :

expand(
    "{result_path}/{barcode}_list_{genotype}.txt",
    result_path=result_path,
    barcode=BARCODES,
    genotype in GENOTYPES
ADD REPLY
0
Entering edit mode

Could you explain in english what you want to do, please (rather than in approximate snakemake syntax). For each barcode, is there a single genotype associated with it?

ADD REPLY

Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6