Downsample BAM for targeted NGS panels
1
0
Entering edit mode
5.7 years ago
NB ▴ 960

Hello

I am downsampling BAMs down to 10% on control samples to check at what minimum read depth are we able to detect a certain set of SNPs.

What approach do people usually adopt for this

1) downsample the same BAM 3 times with different seed options and then take the average of the read depth ?

sambamba view -h -f bam -t 10 --subsampling-seed=3 -s 0.1 $BAM -o $downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=2 -s 0.1 $BAM -o $downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=1 -s 0.1 $BAM -o $downsample_0.10.bam

2) do it just once

sambamba view -h -f bam -t 10 --subsampling-seed=34223 -s 0.1 $BAM -o $downsample_0.10.bam

is subsampling-seed relevant for reproducibility or just a number ?

downsample targeted-NGS sambamba • 2.9k views
ADD COMMENT
0
Entering edit mode

subsampling-seed relevant or just a number

I speculate that you will get the same set of reads if you are using a seed.

ADD REPLY
0
Entering edit mode

ive edited a bit of my question - so does this help with the reproducibility as well ?

ADD REPLY
0
Entering edit mode

You can easily test it with sampling a small number of reads.

You can also tryreformat.sh from BBMap to do subsampling. I think it should work with a BAM file. You will have a rich set of options for sampling

ADD REPLY
4
Entering edit mode
5.7 years ago

In Sheffield Children's NHS Foundation Trust, we already did this in 2013/4 and found that a total position read depth of 18 was the minimum at which one should be reporting [edit: germline] variants.

The general workflow was:

  1. obtain a few dozen patient samples that had matched NGS and Sanger data over our regions of interest
  2. downsample the aligned BAMs using Picard's DownsampleSam - I believe we chose 75%, 50%, and 25% random reads
  3. check the last known position read depth at which all Sanger-confirmed variants were called

That was it. To obtain better precision, one could generate even more downsampled BAMs. Had we had time to publish, my plan was to downsample in 5% decrements, from 100% to 5%.

It was through this process that we also inadvertently 'recovered' the missed GATK variants, i.e., we would frequently encounter Sanger-confirmed variants, not in the original BAM, but in one of the downsampled BAMs.

Example for 75% random reads:

java -jar "${Picard_root}"picard.jar DownsampleSam \
  INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \ 
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.75 \
  VALIDATION_STRINGENCY=SILENT ;

"${SAMtools_root}"samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam` ;

Kevin

ADD COMMENT
1
Entering edit mode

I guess, adding fixed randon_seed will help to reproduce the results

ADD REPLY
0
Entering edit mode

That's pretty much the plan - to downsample from 50% to 10% and then check for its read depth. I was opting for my second option of doing it just once but I was suggested to downsample the same BAM thrice and then take its average read depth

ADD REPLY
0
Entering edit mode

So why did you use a random seed of 50 here ?

ADD REPLY
0
Entering edit mode

Spun a coin? - not sure - that is another parameter to test. I believe, in a standard routine run, it should be left null, so that the seed changes

ADD REPLY

Login before adding your answer.

Traffic: 2482 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6