Hello
I am downsampling BAMs down to 10% on control samples to check at what minimum read depth are we able to detect a certain set of SNPs.
What approach do people usually adopt for this
1) downsample the same BAM 3 times with different seed options and then take the average of the read depth ?
sambamba view -h -f bam -t 10 --subsampling-seed=3 -s 0.1 $BAM -o $downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=2 -s 0.1 $BAM -o $downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=1 -s 0.1 $BAM -o $downsample_0.10.bam
2) do it just once
sambamba view -h -f bam -t 10 --subsampling-seed=34223 -s 0.1 $BAM -o $downsample_0.10.bam
is subsampling-seed
relevant for reproducibility or just a number ?
I speculate that you will get the same set of reads if you are using a seed.
ive edited a bit of my question - so does this help with the reproducibility as well ?
You can easily test it with sampling a small number of reads.
You can also try
reformat.sh
from BBMap to do subsampling. I think it should work with a BAM file. You will have a rich set of options for sampling