Question

snakemake to merge bams from multiple samples

2

Entering edit mode

3.8 years ago

samuel ▴ 260

Hi, I have multiple sample bams which are split over two lanes. e.g. for 2 samples:

flowcells: ["CCGB1ANXX-",  "CCHNTANXX-"]
samples: ["3666-02-45-1_S2", "3666-03-45-1_S3"]
lanes: ["L005", "L006"]

I am trying to use snakemake to merge these bams. The problem I have is that it is not treating the sample ids separately in the merge rule but rather reading all sample bams and merging into one (for how ever many input samples there are so for two samples it is repeating the 'wrong' rule twice, for three samples, three times etc). How do I correct the rule so it only merges bams with the same sample id? Also, I have actually sorted these before this rule - should I sort after merging? My code is:

rule mergeBams:
input:
    bams = expand(intermediateDirectory + "{flowcell}{sample}_{lane}.unsorted.bam",flowcell = flowcells, sample = samples, lane=lanes)
output:
    pileup = intermediateDirectory + "{sample}.merged.bam"
log:
    "logs/{sample}.germlinePileup.log"
threads:
    8
resources:
    mem = "8GB",
    time = "10:00:00"
container:
    singularityContainers + "samtools-1.9.simg"
shell:
    "samtools merge -@ {threads} {output.pileup} {input.bams}"

The std output from snakemake when running (it runs the rule twice):

samtools merge -@ 8 3666-03-45-1_S3.merged.bam CCGB1ANXX-3666-02-45-1_S2_L005.unsorted.bam CCGB1ANXX-3666-02-45-1_S2_L006.unsorted.bam CCGB1ANXX-3666-03-45-1_S3_L005.unsorted.bam CCGB1ANXX-3666-03-45-1_S3_L006.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L005.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L006.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L005.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L006.unsorted.bam

.....

samtools merge -@ 8 3666-02-45-1_S2.merged.bam CCGB1ANXX-3666-02-45-1_S2_L005.unsorted.bam CCGB1ANXX-3666-02-45-1_S2_L006.unsorted.bam /CCGB1ANXX-3666-03-45-1_S3_L005.unsorted.bam CCGB1ANXX-3666-03-45-1_S3_L006.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L005.unsorted.bam CCHNTANXX-3666-02-45-1_S2_L006.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L005.unsorted.bam CCHNTANXX-3666-03-45-1_S3_L006.unsorted.bam

snakemake • 2.5k views

ADD COMMENT • link updated 3.8 years ago by Jianyu ▴ 580 • written 3.8 years ago by samuel ▴ 260

score 4 · Accepted Answer · 2021-07-18

4

Entering edit mode

3.8 years ago

Jianyu ▴ 580

If i understand correctly, you want to merge sub-samples with the same sample id to one bam file right?

If that's the case, you should use double bracket "{{}}" to let expand function keep the "{sample}" string instead of replacing them. Try the following input rule

input:
   bams = expand(intermediateDirectory + "{flowcell}{{sample}}_{lane}.unsorted.bam",flowcell = flowcells, lane=lanes)

ADD COMMENT • link 3.8 years ago by Jianyu ▴ 580

0

Entering edit mode

Many thanks! I'm a bit of a newbie....would you mind explaining how/why this works?

ADD REPLY • link 3.8 years ago by samuel ▴ 260

1

Entering edit mode

Sure, what expand function does is to replace the variable marked by {} with the provided list (flowcells and lanes in your case), it will try to find all unique combinations of the list elements. For example, if flowcells = [f1, f2] and lanes=[l1, l2], there will be "f1" + "l1", "f1" + "l2", "f2" + "l1", "f2" + "l2" four combinations.

However, expand function doesn't know which sample id is needed for each output file, it just includes all possible sample id, that's why you found that the input files are basically all the possible input files you can get given the three lists (samples, flowcells, lanes)

The right way to let snakemake know which sample id (the one specified in output file name) is wanted is to use wildcards. Unfortunately, snakemake uses the same format to specify wildcards "{}", so to prevent expand function trying to interpret wildcards, double bracket is used here to let expand function know no replacement is needed.

Ref: snakemake documentation: expand function

ADD REPLY • link 3.8 years ago by Jianyu ▴ 580