Question

Remove Duplicates Workflow

1

Entering edit mode

11.6 years ago

Anjan ▴ 840

Hello all, Am using samtools to remove duplicates. Here is my workflow for removing duplicates from a *.bam file. 1. Sort bam file by readname. 2. Run the fixmate command on the sorted file. This apparantly fills in mate coordinates, ISIZE and mate-related flags into the output .bam file. 3. Sort the output file from step 2, this time according to chromosomal cooridinates. 4. Run rmdup on this sorted file, that now has the correct ISIZE and mate information.

The problem is I seem to be removing the majority of the reads from my original sam file. As an example if I start with 791406 reads, I end up with 208724 reads at the end of step 4. Some spot checking indicates that I do indeed remove reads that are PCR duplicates (identical read sequence, aligned position, insert size and mate position). Am concerned that I may be removing some perfectly fine reads as well. Feedback from users of the "rmdup" option is welcome. Thanks.

ngs samtools • 8.3k views

ADD COMMENT • link updated 11.6 years ago by Alastair Kerr 5.3k • written 11.6 years ago by Anjan ▴ 840

2

Entering edit mode

you can probably double check your results by running Picard's MarkDuplicates

ADD REPLY • link 11.6 years ago by Pavel Senin ★ 1.9k

1

Entering edit mode

Are your reads from whole genome sequencing, exome, or a targeted panel? Do you expect to have high coverage in a small region?

ADD REPLY • link 11.6 years ago by Razzmataz ▴ 30

0

Entering edit mode

@Razzmataz- this is a targeted sequencing panel, we do expect to see high coverage in a relatively small region.

ADD REPLY • link 11.6 years ago by Anjan ▴ 840

1

Entering edit mode

Is it amplicon based? I have seen recommendations not to deduplicate your data when doing amplicon-based sequencing as you expect there to be lots of reads with the same ends by design.

ADD REPLY • link 11.6 years ago by DG 7.3k

0

Entering edit mode

@Dan. Yes, it is indeed amplicon based. We design primers to amplify specific regions in our genome of interest. These are then fragmented and used in the library prep process. Do you have any references for "recommendations not to dedup data when doing amplicon-based sequencing"? Many thanks.

ADD REPLY • link 11.6 years ago by Anjan ▴ 840

0

Entering edit mode

Mostly just in various posts around the web. Here are a few:

http://gatkforums.broadinstitute.org/discussion/2934/call-variants http://weallseqtoseq.blogspot.ca/2013/10/gatk-best-practices-workshop-data-pre.html

Part of it depends on how big your amplicons are as well. But if they are basically the same size as your inserts the ends will be basically identical for all your sequencing reads coming from an amplicon.

ADD REPLY • link 11.6 years ago by DG 7.3k

score 1 · Answer 1 · 2014-01-21

It could be a problem with saturation of reads globally or in specific regions. Check that the regions deduped are not just repeats such as rDNA. If so, I would exclude these regions from your analysis (assuming you are not interested in them). Also normalising between samples becomes much easier then.

If you are interested in such repeat regions, or if the possible saturation is experiment-wide, I would do analysis in parallel, with and without dedup. If both data supports your hypothesis then great. If not, or you are wish to look at SNP data, I would check out these threads from SeqAnswers that address some of the many issues