Let's say I want to genotype a set of BAMs using GATK. A basic DSL2 nextflow workflow would look like:
workflow {
take:
reference
beds
bams
main:
hc = haplotypecaller(reference,bams.combine(beds))
bed2vcf = combinegvcf(hc.groupTuple())
vcf = gathervcfs(bed2vcf.collect())
}
process haplotypecaller {
input:
val(reference)
tuple val(bam),val(bed)
output:
tuple bed,path("sample.g.vcf.gz")
script:
"""
gatk HaplotypeCaller -R ${reference} -I ${bam} -L ${bed} -ERC GVCF -O sample.g.vcf.gz
"""
}
process combinegvcf {
input:
tuple val(bed),val(gvcfs)
ouput:
script:
"""
(...)
"""
}
process gathervcfs {
input:
val(vcfs)
ouput:
path("final.vcf.gz")
script:
"""
(...)
"""
}
but then I'm asked to run this workflow for a set of BAMS that come from different mappers (bwa, bowtie). hum... ok , easy, I can add a mapper in the input tuple
tuple val(bam),val(bed),val(mapper)
and use a composite key when using operators like groupTuple()
but then I'm asked to run this new workflow with a ploidy
parameter that will change with the sex (female,male) and the bed (PAR/X/Y/autosome),
but then I'm asked to run this new new workflow with various values for --min-mapping-quality
, but then etc... etc...
So my question is: what is the best practice to design and reuse a process/sub-workflow. My feeling is to use an associative map to store the parameters but than how to handle this map in the output ? how can I reuse things after groupTuple()