I am interested in finding the difference in algorithms used by the tools gatk and samtools.Although both of them use a bayesian model for genotype and SNP calling, the results are slightly different, as they produce different number of variants. Does anyone know, where this difference comes from?
I have got the following email. I will put the answer here.
I have been using both GATK and Samtools for variant calling in
individual samples. Both these tools uses a bayesian approach to call
the variants but still produces slightly different number of variants.
Is the difference between the two algorithms arise because of prior
probabilities or likelihood calculation they take into account to
calculate the posterior probability or any other cutoff values used in
the algorithm?
I have read the following papers to figure out the difference between
the algorithms. Unfortunately i am only able to find out the common
methods carried by both the algorithms but not the difference leading
to different number of SNP calls between the two.
Framework for variation discovery and genotyping from
next-generation DNA sequencing (GATK).
Mapping short DNA sequencing reads and calling variants using
mapping quality scores (Samtools)
The Genome Analysis Toolkit: a MapReduce framework for analyzing
next-generation DNA sequencing data
Preprocess of alignment. GATK drops reads with low mapping quality (if I am right), but samtools uses all reads by default. Samtools caps base quality by alignment quality. GATK did not do this at least initially. Mark and I used to talk about this. I do not know if GATK does the same as samtools nowadays. GATK also collapses overlapping ends in a read pair. This is the right thing to do. However, my impression is this will not greatly improve the accuracy from their analysis, so I have not implemented this bit in samtools. Furthermore, samtools uses BAQ by default. GATK has the BAQ feature, but probably not using it by default.
SNP genotype likelihood model. GATK assumes sequencing errors are independent while samtools believes the second error comes at a higher chance. This difference has little effect unless the depth is beyond hundreds of folds.
Indel genotype likelihood model. This is one of the major differences between samtools and GATK. Samtools' model is derived from BAQ. It employs quite a few ad hoc heuristic to identify alleles and so on. GATK's model was derived from Dindel's model initially, but the model has been subjected to various changes with time. I don't know how exactly it works now.
Variant calling model. The model behind samtools and GATK is nearly identical. This model was first proposed by Rasmum Nielsen and Richard Durbin independently and then more formally described in my samtools methodology paper. Before the publication, GATK has also implemented it using identical formulae. GATK later augmented the model to work with multi-allelic cases. Samtools does not include that bit.
Filtering. Another major difference. Samtools uses hand-tuned filters, while GATK learns filters from data. Of course GATK's approach is more convenient and powerful at least for human variant calling where you have enough data to train the model and you do not need non-polymorphic sites. In addition, for effective filtering, GATK collects more statistic about the alignment (e.g. haplotype score) than samtools. They are quite useful for the diagnostic purposes.
Other GATK specific features: haplotype caller, indel realignment among many others. Haplotype caller is the most important feature in GATKv2. It has been shown to have much higher sensitivity to longer indels.
Other samtools specific features: genotype-free analysis, physical phasing and so on.
Technical difference. GATK is clearly the more sophisticated and the more complicated. Nonetheless, samtools is equally scalable to huge data sets in spite of its much simpler framework.
GATK and samtools are similar overall, but those who have implemented a large project like samtools or GATK would know that subtle details, such as the thresholds in use, heuristic employed to reduce complexity, tricks for better performance and so on, can be very different, which is likely to lead to observable difference in the final results. In general, samtools focuses more on a multi-purpose tool. It tends to use filters producing less bias. It produces the consensus in a more uniform (and the right) way and has the ability to infers population parameters not doable with GATK. I would say GATK focuses more on variant calling for human data. It is indeed a better (I would not say not much better) variant caller than samtools for 1000g types of data, largely thanks to its advanced filtering and the bulk of training data available to human. On non-human data and other data types, probably they differ little in SNP accuracy.
Although an old post, I wanted to add a few helpful links for anyone visiting this post again and wanting a more mathematical explanation of the samtools/bcftools algorithms.
Thanks for answering. Here, i would like to say that the bam files which i had used for SNP and Indel calling with GATK and Samtools has undegone same pre-processing steps like realignment around indels, base quality recalibration...The only difference is, instead of calling variants with only gatk, i have also sent the bam file to samtools for variant calling. So which of the above reasons would be affecting in this case?
Another question is, i have two case-control pools of 10 samples each and i have used both gatk and samtools. GATK identified most of the variants in both the pools whereas samtools identified few homozygous variants with good depth and quality only in one pool but not in the other. Could you comment on this issue!!
Well, there have been several sources which are trying to describe the Bayesian genotype models used in both GATK and samtools. Could someone give a source of finding the SNP calling models used in both these algorithm?
Although an old post, I wanted to add a few helpful links for anyone visiting this post again and wanting a more mathematical explanation of the samtools/bcftools algorithms.
Multiallelic calling algorithm in bcftools
Mathematical notes on samtools algorithms
I found these links in the publication section of the htslib webpage of all places. I have not yet found a clear mathematical explanation for GATK.
Thanks for answering. Here, i would like to say that the bam files which i had used for SNP and Indel calling with GATK and Samtools has undegone same pre-processing steps like realignment around indels, base quality recalibration...The only difference is, instead of calling variants with only gatk, i have also sent the bam file to samtools for variant calling. So which of the above reasons would be affecting in this case?
Another question is, i have two case-control pools of 10 samples each and i have used both gatk and samtools. GATK identified most of the variants in both the pools whereas samtools identified few homozygous variants with good depth and quality only in one pool but not in the other. Could you comment on this issue!!