HI I have an insane coverage in my bamfiles (thanks to only 12way multiplexing of a tiny genomic region). Since some programs can't handle insane coverages, I would like to downsample, that is pick say 1000 aligned reads for each position randomly (E.g. with
bamtools random: Select random alignments from existing BAM file(s)) but I need to keep the mates intact. Another way of putting this: I want to achieve a maximal read-depth of 1000 for each position.
Has anyone experience with downsampling? Are there any pitfalls I might not be aware of?
Well it is a targeted resequencing experiment with multiple overlapping PCR fragments so I have spikes of coverage going into the 100,000+ and then I have areas which hardly have any reads at all. But conceptually I'm after the functionality GATK offers with -dcov but I want to have the result written as a bam file.
Downsampling support has been added to "samtools view" in SVN. It always keeps both ends of a pair using a very simple strategy (<10 lines of additional source code). Credit to Hideo Imamura at Sanger.
Perhaps a lot of your reads are PCR artifacts - maybe you can consider removing them using samtools (the rmdup feature). Since your reads are paired-end, the script will consider pairs A-B and C-D to be duplicates if A and C map to the same co-ordinate and also B and D map to the same co-ordinate. As an added bonus, the script retains the pairs with the highest quality.
Even if you choose not to use this for your main downsizing, it might be a good first step to reduce the size of your BAM in a rational way.
I have done this for dealing with similar coverages. The pitfall is if you are downsampling you don't want to lose signal from either allele (this is diploid?) or any sample. 1000/12 = 83 reads on average. Naive binomial probability of missing an allele is about 1e-25, so nothing to worry about there. However, I would split the file by multiplex, downsample and rejoin to ensure equal reads of each sample
A problem you might have with this approach is that you might have high levels of noise/allelic imbalance due to polymerase errors. I would therefore prefer to use all the data wherever I can. I have been investigating chemware solutions to these problems, you might want to get in touch to dicsuss. Too late for these data.
Yes, using all the data is a sensible thing to do although that does not protect from polymerase biases because programs usually go for the majority feature they see anyways, that's maybe why downsampling is default in GATK (to 250 reads). Hence doing the downsampling yourself might be a better way of handling things (especially since you can downsample on each sample and then combine -- as you suggested). But my question is still what program to use for the downsampling :)
Could you go into more details on what chemware's solution offers?
Well it is a targeted resequencing experiment with multiple overlapping PCR fragments so I have spikes of coverage going into the 100,000+ and then I have areas which hardly have any reads at all. But conceptually I'm after the functionality GATK offers with -dcov but I want to have the result written as a bam file.
3 answers none chosen as correct or even upvoted! I'm only answering for the karma ;)
When you say "insane", can you quantify the range please?