Question About The Concordance Between Gatk And Samtools
2
2
Entering edit mode
11.7 years ago
sxzhuxu ▴ 20

Hi,I'm using GATK and samtools to call SNPs.

For GATK: java -Xmx6g -jar GenomeAnalysisTK.jar -R example.fa -T UnifiedGenotper -I sampleSort.bam -o GATK.vcf -mbq 30 -glm BOTH

For samtools: samtools mpileup -Q 13 -ugf example.fa sampleSort.bam | bcftools view -bvcg > sample.raw.bcf bcftools view sample.raw.bcf | vcfutils.pl varFilter -d 10 -w 5 -D 100 > samtools.vcf

But I find there is less common SNP between GATK.vcf and samtools.vcf, about 50% or less. I don't know why ? Even though I have used a lot default value or other parameter , but the result between GATK and samtools is still out of my expectation.

How should I improve the concordance between GATK and Samtools ? Would you like to share your command with me or what should I pay attention to?

Thanks for your answer.

gatk samtools snp • 4.7k views
ADD COMMENT
6
Entering edit mode
11.7 years ago

There are several differences between your use of the two callers:

  • You use different base quality thresholds for GATK (30) and samtools (13). This is samtools default but you are much more stringent than the default of 17 that GATK uses.
  • You are filtering the samtools calls but not GATK calls. See GATK's best practice guidelines for the filtering approaches they recommend with UnifiedGenotyper calling.
  • Your samtools filtering only keeps calls with read depth support of 10 to 100, while you're keeping any callable variant independent of depth with GATK.

Standardizing the quality cutoffs you use (or not adjusting from default for GATK) and filtering equivalently by depth should help increase the overlap between the callers. Hope this helps.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I hava tried with the same base quality control and the both results are filtered by samtools's varFilter by same parameter before .The result seems better in low minimum base quality control(both 13 ),but worse in a high minimum base quality control (both 30 ).I think there are false positive in the result . I have tried standardized the cutoffs ,but result still out fo my expection.

ADD REPLY
3
Entering edit mode
11.7 years ago

A good description of the challenges and expectation can be found in Brad Chapman 's blog Blue Collar Bioinformatics with a series of relevant posts like this An automated ensemble method for combining and evaluating genomic variants from multiple callers

ADD COMMENT
0
Entering edit mode

That's a good article, I will read carefully. Thank you , Albert

ADD REPLY

Login before adding your answer.

Traffic: 1670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6