Samtools + Picard Markduplicates
3
23
Entering edit mode
14.0 years ago
Nery ▴ 180

Hello,

I'm using Samtools to call variants and I am using Picard MarkDuplicates to mark duplicates in my bam file. My question is: Is it enough to just mark the duplicates using Picard and then use the marked bam file to call variants using Samtools or do the duplicates actually need to be removed from the bam file using the REMOVE_DUPLICATES=true argument in Picard? In other words, does Samtools pileup recognize the marked duplicates? Also, for calling the consensus sequence using Samtools, do I need to worry about duplicates or can I just use the original bam file without duplicates marked?

Thanks in advance for your help. (This group is great and I've learned a lot from it).

samtools picard markduplicates variant consensus • 35k views
ADD COMMENT
20
Entering edit mode
14.0 years ago
lh3 33k

It is safe to use MarkDuplicates+mpileup/pileup. By default, duplicates are ignored.

MarkDuplicates is "more correct" in the strict sense. Rmdup is more efficient simply because it does handle those tough cases. Rmdup works for single-end, too, but it cannot do paired-end and single-end at the same time. It does not work properly for mate-pair reads if read lengths are different.

As to SNP calling, the possibly best strategy is to do realignment with GATK, cap base quality by samtools, call SNPs with GATK and then recalibrate variant quality with GATK afterwards (i.e., you still need samtools for one step). In addition, for now, GATK is (arguably) less accurate than samtools for multi-sample low-coverage SNP calling, but the next version of UnifiedGenotyper is coming out which completely closes the gap because GATK and samtools will be using the same SNP calling model. As to variant quality recalibration, I am little cautious of its use of ts/tv. This is just a personal opinion, though, and perhaps biased.

ADD COMMENT
9
Entering edit mode
14.0 years ago

From my experience, the removal of duplicates does help with the consensus calling by removing the "artificial" coverage brought on by PCR dups.

Of course if you have a sample that has very deep coverage (>200x), then the duplicates might actually not be PCR duplicates...But that's a whole other set of problems.

As for rmdup vs MarkDuplciates: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_What_is_the_difference_between_MarkDuplicates_and_samtools_rmdup.3F

"samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates."

I'm not sure rmdup works on single reads (non-paired). That's what it stated awhile ago in the manual.

ADD COMMENT
0
Entering edit mode

The 1000 genomes procedure seems to be duplicate removal by default and they have low coverage. It would be good to really see a specific comparison...

ADD REPLY
2
Entering edit mode
14.0 years ago
Allpowerde ★ 1.3k

Given that the new samtool function rmdup (Version: 0.1.11) removes the duplicates, it's reasonable to assume that removing the duplicates with Picard is a saver option than just labeling them when doing variant detection with samtools. But why are you using samtools rather than GATK ?

ADD COMMENT
2
Entering edit mode

why should we use GATK rather than samtools for this process ? is there a bug in samtools ?

ADD REPLY
0
Entering edit mode

HI Pierre, I have not used samtools for SNP calling myself but as far as I know the user has to employ hard thresholds to classify true SNPs from sequencing error (coverage>X, strandbias<Y) with samtools, while GATK offers a gaussian mixture model, trained on external data (dbSNP, Array,...), which might be better. Do you have more infos about samtools and SNP calling?

ADD REPLY
0
Entering edit mode

Hi all I need some information regarding the removal of PCR duplicates. I want to make a pipeline for exome sequencing data analysis. So I am doing a test run combining several pipelines that are available in the different blogs and trying to optimize it. As of now am working with paired end reads and I have selected a sample from 1000 genome project and trying to see how it goes. I have just used a single sample. And till now I have proceeded to the step quickly sort the bam files after I convert the sam to bam format for the sample. I am a bit confused what to do next? As far as I know we can directly index the sorted bam file and then work with the GATK packages for identifying the target regions for realignment and then do the various calls for INDEL and SNPs. But in some pipelines I see they do remove the PCR duplicates by marking it first. Should I use the MarkPCR duplicates and then remove the duplicates and then go for the GATK realigner and other stuffs or shall I directly proceed with the GATK steps? Also since I donot have any normal samples in my test case. Its a sample from the 1000 genomes project so it should be nice to tell me how to do the variant calling with GATK , should my vcf file be compared with the dbSNP vcf and other public vcf that are present ? Please reply it would be of great help.

ADD REPLY

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6