Question

Downsample BAM for targeted NGS panels

0

Entering edit mode

6.4 years ago

NB ▴ 960

Hello

I am downsampling BAMs down to 10% on control samples to check at what minimum read depth are we able to detect a certain set of SNPs.

What approach do people usually adopt for this

1) downsample the same BAM 3 times with different seed options and then take the average of the read depth ?

sambamba view -h -f bam -t 10 --subsampling-seed=3 -s 0.1 $BAM -o $downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=2 -s 0.1 $BAM -o $downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=1 -s 0.1 $BAM -o $downsample_0.10.bam

2) do it just once

sambamba view -h -f bam -t 10 --subsampling-seed=34223 -s 0.1 $BAM -o $downsample_0.10.bam

is subsampling-seed relevant for reproducibility or just a number ?

downsample targeted-NGS sambamba • 3.5k views

ADD COMMENT • link updated 6.3 years ago by shrirambhosle ▴ 30 • written 6.4 years ago by NB ▴ 960

0

Entering edit mode

subsampling-seed relevant or just a number

I speculate that you will get the same set of reads if you are using a seed.

ADD REPLY • link 6.4 years ago by GenoMax 152k

0

Entering edit mode

ive edited a bit of my question - so does this help with the reproducibility as well ?

ADD REPLY • link 6.4 years ago by NB ▴ 960

0

Entering edit mode

You can easily test it with sampling a small number of reads.

You can also tryreformat.sh from BBMap to do subsampling. I think it should work with a BAM file. You will have a rich set of options for sampling

ADD REPLY • link 6.4 years ago by GenoMax 152k

score 4 · Accepted Answer · 2019-03-21

4

Entering edit mode

6.4 years ago

Kevin Blighe 89k

In Sheffield Children's NHS Foundation Trust, we already did this in 2013/4 and found that a total position read depth of 18 was the minimum at which one should be reporting [edit: germline] variants.

The general workflow was:

obtain a few dozen patient samples that had matched NGS and Sanger data over our regions of interest
downsample the aligned BAMs using Picard's DownsampleSam - I believe we chose 75%, 50%, and 25% random reads
check the last known position read depth at which all Sanger-confirmed variants were called

That was it. To obtain better precision, one could generate even more downsampled BAMs. Had we had time to publish, my plan was to downsample in 5% decrements, from 100% to 5%.

It was through this process that we also inadvertently 'recovered' the missed GATK variants, i.e., we would frequently encounter Sanger-confirmed variants, not in the original BAM, but in one of the downsampled BAMs.

Example for 75% random reads:

java -jar "${Picard_root}"picard.jar DownsampleSam \
  INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \ 
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.75 \
  VALIDATION_STRINGENCY=SILENT ;

"${SAMtools_root}"samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam` ;

Kevin

ADD COMMENT • link 6.3 years ago by Kevin Blighe 89k

1

Entering edit mode

I guess, adding fixed randon_seed will help to reproduce the results

ADD REPLY • link 6.3 years ago by shrirambhosle ▴ 30

0

Entering edit mode

That's pretty much the plan - to downsample from 50% to 10% and then check for its read depth. I was opting for my second option of doing it just once but I was suggested to downsample the same BAM thrice and then take its average read depth

ADD REPLY • link 6.4 years ago by NB ▴ 960

0

Entering edit mode

So why did you use a random seed of 50 here ?

ADD REPLY • link 6.4 years ago by NB ▴ 960

0

Entering edit mode

Spun a coin? - not sure - that is another parameter to test. I believe, in a standard routine run, it should be left null, so that the seed changes

ADD REPLY • link 6.4 years ago by Kevin Blighe 89k