I am trying to find SNP/Indel in a dataset using both "GATK" and "bcftools call". I find two big differences for generating VCF files with these tools and I appreciate your help and experience:
1- "GATK" result has higher "DP" value than "bcftools", any clue?
e.g.
GATK.vcf ==> NC_000 747 . A G 4054 . AC=1;AF=1.00;AN=1;DP=113;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.98;MQ0=0;QD=28.10;SOR=0.769 GT:AD:DP:GQ:PL 1:0,108:108:99:4084,0
bcftools.vcf ==> NC_000913.3 747 . A G 228 . DP=76;VDB=0.987864;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,20,56;MQ=59 GT:PL 1/1:255,229,0
2- Why does GATK take much longer time(30-60 mins difference) compared to bcftools? is it normal or someoothing wrong with my usage?
Yes, Haplotypecaller is quite slow. You could try to use a target file (in case you're not doing genome sequencing, depending on project). use -L targets.bed -ip 50 (ip: intervalpadding).
Well I am working on a bacterial genome that is why dbsnp doesn't work for it. And I read that GATK is not good to be used for bacterial genome. But still, it is strange to me that GATK show more Read Depth (DP) compared to the other tool Bcftools.
Yes, I used BWA MEM for both alignment. do you feel something wrong?
I also updated the post to add the whole two lines of VCF files in case it is more informative. One more things that seems strange, for Diploid the GT has two value looks like this (0/1, 0/0). But for Haploid the GT has one value. So my data is Bacteria which is Haploid and true for GATK which is one value while in Bcftools it is two values! that might tells us something too.
HaplotypeCaller from GATK is a haplotype based caller. It takes the aligned *.bam file and then realigns it again while calling the polymorphisms simulataneously at the loci; by default it calls maximum of 6 alternate alleles at any locus and output 2 best ones.
Bcftools isn't a haplotype based caller, it takes the aligned bam file and output polymporhisms without doing any realignment. If you happen to use UnifiedGenotyper from GATK its way faster since its not a haplotype based caller. For the samples I have worked HaplotypeCaller took almost 3 days and UnifiedGenotyper would take on average 30-50 minutes to complete the jobs.
But, polymorphisms data from HaplotypeCaller are considered to be of greater confidence.
Thanks. Yes HaplotypeCaller takes much longer but based on GATK documentation, this one is a newer version and better to be used. And I think it is also used for both Diploid and non-Diploid organism(data).
Which one do you personally use, GATK or Bcftools?
Well, some searching around paid off.
My suggestions has always been GATK - due to its good capabilities to detect greater amount of true positives.
I don't think you will need to remove orphan reads - the reason being that the other mate pair may not have mapped due to large indels and it can actually happen when you align datasets from other population than the one reference was generated from.
Yes, right. But I also saw suggestions around that it is recommended to remain only proper pair reads and remove orphans. I have to also think about this to make a decision :)
GATK report DP values twice: one under INFO which generally is the filered DP values and one for the FORMAT which generally is the unfiltered DP values for that allele.
You have different DP due the different statisticial methods employed by different variant callers.
Also they have different filtering parameters they might have by default. If you ran the variant calling using default parameters, please check what are default filtering parameter between GATK and bcftools to understand your DP values more clearly.
Thanks again. The answer makes sense.
As I also asked above, if one needs to use DP value for a type of research, as we discussed there are differences between DP value of the two tools, so which one to be used the research? e.g. if for a SNP from GATK DP is 125 but from bcftools is 97. Just in case if you have an idea.
I think I know why, so I was wondering in my first comment if the aligners where the same. I am going to assume that they are the same aligner and if you did your workflow like this it would explain the difference.
Clean Reads
|
V
Map Reads
| |
V V
1. Samtools 2. GATK
1. bcftools 2.GATK Haplotype
This is because the GATK pipeline realigns its Indels, so you could have less error in those regions and more reads would be remapped back to them, whereas the first mapping is all you work with in the samtools pipeline.
Thanks. Yes I used the same aligner. In experiment, both results were the same, is it also fine? so it is possible sometimes to have the same DP value and sometimes not?
One more important question, based on your experience, if one needs to use DP value for a type of project, as we discussed there are differences between DP value of the two tools, so which one to be used? just in case if you have an idea.
For more true positive data GATK is my preference, actually I have almost always used GATK for calling variants.
Regarding the DP value you will need to do some research by yourself. You will need to check the global coverage, local coverage and the coverage (DP value) you want to use as a cutoff to pull variants from that particular loci.
Yes, Haplotypecaller is quite slow. You could try to use a target file (in case you're not doing genome sequencing, depending on project). use -L targets.bed -ip 50 (ip: intervalpadding).
Well I am working on a bacterial genome that is why dbsnp doesn't work for it. And I read that GATK is not good to be used for bacterial genome. But still, it is strange to me that GATK show more Read Depth (DP) compared to the other tool Bcftools.
I agree with you it is weird that they have different DPs for the same SNP did you use BWA MEM for both alignments or different aligners?
Yes, I used BWA MEM for both alignment. do you feel something wrong? I also updated the post to add the whole two lines of VCF files in case it is more informative. One more things that seems strange, for Diploid the GT has two value looks like this (0/1, 0/0). But for Haploid the GT has one value. So my data is Bacteria which is Haploid and true for GATK which is one value while in Bcftools it is two values! that might tells us something too.