Hello all, Am using samtools to remove duplicates. Here is my workflow for removing duplicates from a *.bam file. 1. Sort bam file by readname. 2. Run the fixmate command on the sorted file. This apparantly fills in mate coordinates, ISIZE and mate-related flags into the output .bam file. 3. Sort the output file from step 2, this time according to chromosomal cooridinates. 4. Run rmdup on this sorted file, that now has the correct ISIZE and mate information.
The problem is I seem to be removing the majority of the reads from my original sam file. As an example if I start with 791406 reads, I end up with 208724 reads at the end of step 4. Some spot checking indicates that I do indeed remove reads that are PCR duplicates (identical read sequence, aligned position, insert size and mate position). Am concerned that I may be removing some perfectly fine reads as well. Feedback from users of the "rmdup" option is welcome. Thanks.
you can probably double check your results by running Picard's MarkDuplicates
Are your reads from whole genome sequencing, exome, or a targeted panel? Do you expect to have high coverage in a small region?
@Razzmataz- this is a targeted sequencing panel, we do expect to see high coverage in a relatively small region.
Is it amplicon based? I have seen recommendations not to deduplicate your data when doing amplicon-based sequencing as you expect there to be lots of reads with the same ends by design.
@Dan. Yes, it is indeed amplicon based. We design primers to amplify specific regions in our genome of interest. These are then fragmented and used in the library prep process. Do you have any references for "recommendations not to dedup data when doing amplicon-based sequencing"? Many thanks.
Mostly just in various posts around the web. Here are a few:
http://gatkforums.broadinstitute.org/discussion/2934/call-variants http://weallseqtoseq.blogspot.ca/2013/10/gatk-best-practices-workshop-data-pre.html
Part of it depends on how big your amplicons are as well. But if they are basically the same size as your inserts the ends will be basically identical for all your sequencing reads coming from an amplicon.