Hi, I have multiple sample bams which are split over two lanes. e.g. for 2 samples:
flowcells: ["CCGB1ANXX-", "CCHNTANXX-"]
samples: ["3666-02-45-1_S2", "3666-03-45-1_S3"]
lanes: ["L005", "L006"]
I am trying to use snakemake to merge these bams. The problem I have is that it is not treating the sample ids separately in the merge rule but rather reading all sample bams and merging into one (for how ever many input samples there are so for two samples it is repeating the 'wrong' rule twice, for three samples, three times etc). How do I correct the rule so it only merges bams with the same sample id? Also, I have actually sorted these before this rule - should I sort after merging? My code is:
rule mergeBams:
input:
bams = expand(intermediateDirectory + "{flowcell}{sample}_{lane}.unsorted.bam",flowcell = flowcells, sample = samples, lane=lanes)
output:
pileup = intermediateDirectory + "{sample}.merged.bam"
log:
"logs/{sample}.germlinePileup.log"
threads:
8
resources:
mem = "8GB",
time = "10:00:00"
container:
singularityContainers + "samtools-1.9.simg"
shell:
"samtools merge -@ {threads} {output.pileup} {input.bams}"
The std output from snakemake when running (it runs the rule twice):
samtools merge -@ 8 3666-03-45-1_S3.merged.bam CCGB1ANXX-3666-02-45-1_S2_L005.unsorted.bam CCGB1ANXX-3666-02-45-1_S2_L006.unsorted.bam CCGB1ANXX-3666-03-45-1_S3_L005.unsorted.bam CCGB1ANXX-3666-03-45-1_S3_L006.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L005.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L006.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L005.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L006.unsorted.bam
.....
samtools merge -@ 8 3666-02-45-1_S2.merged.bam CCGB1ANXX-3666-02-45-1_S2_L005.unsorted.bam CCGB1ANXX-3666-02-45-1_S2_L006.unsorted.bam /CCGB1ANXX-3666-03-45-1_S3_L005.unsorted.bam CCGB1ANXX-3666-03-45-1_S3_L006.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L005.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L006.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L005.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L006.unsorted.bam
Many thanks! I'm a bit of a newbie....would you mind explaining how/why this works?
Sure, what expand function does is to replace the variable marked by {} with the provided list (flowcells and lanes in your case), it will try to find all unique combinations of the list elements. For example, if flowcells = [f1, f2] and lanes=[l1, l2], there will be "f1" + "l1", "f1" + "l2", "f2" + "l1", "f2" + "l2" four combinations.
However, expand function doesn't know which sample id is needed for each output file, it just includes all possible sample id, that's why you found that the input files are basically all the possible input files you can get given the three lists (samples, flowcells, lanes)
The right way to let snakemake know which sample id (the one specified in output file name) is wanted is to use wildcards. Unfortunately, snakemake uses the same format to specify wildcards "{}", so to prevent expand function trying to interpret wildcards, double bracket is used here to let expand function know no replacement is needed.
Ref: snakemake documentation: expand function