Question

Samtools + Picard Markduplicates

23

Entering edit mode

14.4 years ago

Nery ▴ 180

Hello,

I'm using Samtools to call variants and I am using Picard MarkDuplicates to mark duplicates in my bam file. My question is: Is it enough to just mark the duplicates using Picard and then use the marked bam file to call variants using Samtools or do the duplicates actually need to be removed from the bam file using the REMOVE_DUPLICATES=true argument in Picard? In other words, does Samtools pileup recognize the marked duplicates? Also, for calling the consensus sequence using Samtools, do I need to worry about duplicates or can I just use the original bam file without duplicates marked?

Thanks in advance for your help. (This group is great and I've learned a lot from it).

samtools picard markduplicates variant consensus • 36k views

ADD COMMENT • link updated 14.4 years ago by lh3 33k • written 14.4 years ago by Nery ▴ 180

score 20 · Answer 1 · 2010-11-30

It is safe to use MarkDuplicates+mpileup/pileup. By default, duplicates are ignored.

MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.

As to SNP calling, the possibly best strategy is to do realignment with GATK, cap base quality by samtools, call SNPs with GATK and then recalibrate variant quality with GATK afterwards (i.e., you still need samtools for one step). In addition, for now, GATK is (arguably) less accurate than samtools for multi-sample low-coverage SNP calling, but the next version of UnifiedGenotyper is coming out which completely closes the gap because GATK and samtools will be using the same SNP calling model. As to variant quality recalibration, I am little cautious of its use of ts/tv. This is just a personal opinion, though, and perhaps biased.

score 9 · Answer 2 · 2010-11-29

From my experience, the removal of duplicates does help with the consensus calling by removing the "artificial" coverage brought on by PCR dups.

Of course if you have a sample that has very deep coverage (>200x), then the duplicates might actually not be PCR duplicates...But that's a whole other set of problems.

As for rmdup vs MarkDuplciates: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_What_is_the_difference_between_MarkDuplicates_and_samtools_rmdup.3F

"samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates."

I'm not sure rmdup works on single reads (non-paired). That's what it stated awhile ago in the manual.

Ram · Answer 3 · 2010-11-29

2

Entering edit mode

14.4 years ago

Allpowerde ★ 1.3k

Given that the new samtool function rmdup (Version: 0.1.11) removes the duplicates, it's reasonable to assume that removing the duplicates with Picard is a saver option than just labeling them when doing variant detection with samtools. But why are you using samtools rather than GATK ?

ADD COMMENT • link 14.4 years ago by Allpowerde ★ 1.3k

2

Entering edit mode

why should we use GATK rather than samtools for this process ? is there a bug in samtools ?

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

HI Pierre, I have not used samtools for SNP calling myself but as far as I know the user has to employ hard thresholds to classify true SNPs from sequencing error (coverage>X, strandbias<Y) with samtools, while GATK offers a gaussian mixture model, trained on external data (dbSNP, Array,...), which might be better. Do you have more infos about samtools and SNP calling?

ADD REPLY • link 14.4 years ago by Allpowerde ★ 1.3k

0

Entering edit mode

Hi all I need some information regarding the removal of PCR duplicates. I want to make a pipeline for exome sequencing data analysis. So I am doing a test run combining several pipelines that are available in the different blogs and trying to optimize it. As of now am working with paired end reads and I have selected a sample from 1000 genome project and trying to see how it goes. I have just used a single sample. And till now I have proceeded to the step quickly sort the bam files after I convert the sam to bam format for the sample. I am a bit confused what to do next? As far as I know we can directly index the sorted bam file and then work with the GATK packages for identifying the target regions for realignment and then do the various calls for INDEL and SNPs. But in some pipelines I see they do remove the PCR duplicates by marking it first. Should I use the MarkPCR duplicates and then remove the duplicates and then go for the GATK realigner and other stuffs or shall I directly proceed with the GATK steps? Also since I donot have any normal samples in my test case. Its a sample from the 1000 genomes project so it should be nice to tell me how to do the variant calling with GATK , should my vcf file be compared with the dbSNP vcf and other public vcf that are present ? Please reply it would be of great help.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 11.5 years ago by ivivek_ngs ★ 5.2k