Question

Question About The Concordance Between Gatk And Samtools

2

Entering edit mode

11.8 years ago

sxzhuxu ▴ 20

Hi,I'm using GATK and samtools to call SNPs.

For GATK: java -Xmx6g -jar GenomeAnalysisTK.jar -R example.fa -T UnifiedGenotper -I sampleSort.bam -o GATK.vcf -mbq 30 -glm BOTH

For samtools: samtools mpileup -Q 13 -ugf example.fa sampleSort.bam | bcftools view -bvcg > sample.raw.bcf bcftools view sample.raw.bcf | vcfutils.pl varFilter -d 10 -w 5 -D 100 > samtools.vcf

But I find there is less common SNP between GATK.vcf and samtools.vcf, about 50% or less. I don't know why ? Even though I have used a lot default value or other parameter , but the result between GATK and samtools is still out of my expectation.

How should I improve the concordance between GATK and Samtools ? Would you like to share your command with me or what should I pay attention to?

Thanks for your answer.

gatk samtools snp • 4.7k views

ADD COMMENT • link updated 10.3 years ago by Biostar 20 • written 11.8 years ago by sxzhuxu ▴ 20

score 6 · Answer 1 · 2013-03-01

6

Entering edit mode

11.8 years ago

Brad Chapman 9.7k

There are several differences between your use of the two callers:

You use different base quality thresholds for GATK (30) and samtools (13). This is samtools default but you are much more stringent than the default of 17 that GATK uses.
You are filtering the samtools calls but not GATK calls. See GATK's best practice guidelines for the filtering approaches they recommend with UnifiedGenotyper calling.
Your samtools filtering only keeps calls with read depth support of 10 to 100, while you're keeping any callable variant independent of depth with GATK.

Standardizing the quality cutoffs you use (or not adjusting from default for GATK) and filtering equivalently by depth should help increase the overlap between the callers. Hope this helps.

ADD COMMENT • link 11.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

Thank you for your reply. I hava tried with the same base quality control and the both results are filtered by samtools's varFilter by same parameter before .The result seems better in low minimum base quality control(both 13 ),but worse in a high minimum base quality control (both 30 ).I think there are false positive in the result . I have tried standardized the cutoffs ,but result still out fo my expection.

ADD REPLY • link 11.8 years ago by sxzhuxu ▴ 20

score 3 · Answer 2 · 2013-03-02

3

Entering edit mode

11.8 years ago

Istvan Albert 102k

A good description of the challenges and expectation can be found in Brad Chapman 's blog Blue Collar Bioinformatics with a series of relevant posts like this An automated ensemble method for combining and evaluating genomic variants from multiple callers