Question

Why GATK and bcftools SNP calling different?

6

Entering edit mode

8.8 years ago

fernardo ▴ 180

Hi all,

I am trying to find SNP/Indel in a dataset using both "GATK" and "bcftools call". I find two big differences for generating VCF files with these tools and I appreciate your help and experience:

1- "GATK" result has higher "DP" value than "bcftools", any clue? e.g.

GATK.vcf ==> NC_000 747 . A G 4054 . AC=1;AF=1.00;AN=1;DP=113;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.98;MQ0=0;QD=28.10;SOR=0.769 GT:AD:DP:GQ:PL 1:0,108:108:99:4084,0

bcftools.vcf ==> NC_000913.3 747 . A G 228 . DP=76;VDB=0.987864;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,20,56;MQ=59 GT:PL 1/1:255,229,0

2- Why does GATK take much longer time(30-60 mins difference) compared to bcftools? is it normal or someoothing wrong with my usage?

Pipelines I used:

 GATK pipeline ==> java -jar ./GATK3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -stand_call_conf 30 -stand_emit_conf 10 -o GATK.vcf 

bcftools pipeline ==> samtools mpileup -uf reference.fasta sample.bam | ./bcftools/bcftools call -m -v -o bcftools.vcf

Thanks in advance

SNP next-gen GATK bcftools variant calling • 12k views

ADD COMMENT • link updated 2.6 years ago by Kevin Blighe 88k • written 8.8 years ago by fernardo ▴ 180

0

Entering edit mode

Yes, Haplotypecaller is quite slow. You could try to use a target file (in case you're not doing genome sequencing, depending on project). use -L targets.bed -ip 50 (ip: intervalpadding).

ADD REPLY • link 8.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Well I am working on a bacterial genome that is why dbsnp doesn't work for it. And I read that GATK is not good to be used for bacterial genome. But still, it is strange to me that GATK show more Read Depth (DP) compared to the other tool Bcftools.

ADD REPLY • link 8.8 years ago by fernardo ▴ 180

0

Entering edit mode

I agree with you it is weird that they have different DPs for the same SNP did you use BWA MEM for both alignments or different aligners?

ADD REPLY • link 8.8 years ago by skbrimer ▴ 740

0

Entering edit mode

Yes, I used BWA MEM for both alignment. do you feel something wrong? I also updated the post to add the whole two lines of VCF files in case it is more informative. One more things that seems strange, for Diploid the GT has two value looks like this (0/1, 0/0). But for Haploid the GT has one value. So my data is Bacteria which is Haploid and true for GATK which is one value while in Bcftools it is two values! that might tells us something too.

ADD REPLY • link 8.7 years ago by fernardo ▴ 180

score 6 · Answer 1 · 2016-03-25

6

Entering edit mode

8.8 years ago

kirannbishwa01 ★ 1.6k

HaplotypeCaller from GATK is a haplotype based caller. It takes the aligned *.bam file and then realigns it again while calling the polymorphisms simulataneously at the loci; by default it calls maximum of 6 alternate alleles at any locus and output 2 best ones.

Bcftools isn't a haplotype based caller, it takes the aligned bam file and output polymporhisms without doing any realignment. If you happen to use UnifiedGenotyper from GATK its way faster since its not a haplotype based caller. For the samples I have worked HaplotypeCaller took almost 3 days and UnifiedGenotyper would take on average 30-50 minutes to complete the jobs.

But, polymorphisms data from HaplotypeCaller are considered to be of greater confidence.

Thanks,

ADD COMMENT • link 8.8 years ago by kirannbishwa01 ★ 1.6k

1

Entering edit mode

Thanks. Yes HaplotypeCaller takes much longer but based on GATK documentation, this one is a newer version and better to be used. And I think it is also used for both Diploid and non-Diploid organism(data). Which one do you personally use, GATK or Bcftools?

ADD REPLY • link 8.8 years ago by fernardo ▴ 180

0

Entering edit mode

Using samtools / bcftools, though, provides higher sensitivity to Sanger sequencing.

ADD REPLY • link 2.6 years ago by Kevin Blighe 88k

score 3 · Answer 2 · 2016-04-02

3

Entering edit mode

8.7 years ago

fernardo ▴ 180

Solved!! After researching around this case, shortly, I came up with the solution as follows:

Doing "samtools tview" you can see three types of reads:

1- "." : positive strands.

2- "," : negative strands

3- "underlined reads" : Orphan reads (means the other mate-read not aligned).

GATK ==> sums all three of them including Orphan reads Bcftools ==> sums only 1 and 2. Doesn't count for Orphan reads.

A questions comes up which I will be asking in a different post: - Which one is recommended, removing orphan reads or not?

Note: to remove Orphan reads, samtools and bamtools can be used.

bamtools filter -isProperPair true -in File.bam -out File_no_orphan.bam

ADD COMMENT • link 8.7 years ago by fernardo ▴ 180

0

Entering edit mode

Well, some searching around paid off. My suggestions has always been GATK - due to its good capabilities to detect greater amount of true positives. I don't think you will need to remove orphan reads - the reason being that the other mate pair may not have mapped due to large indels and it can actually happen when you align datasets from other population than the one reference was generated from.

ADD REPLY • link 8.7 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Yes, right. But I also saw suggestions around that it is recommended to remain only proper pair reads and remove orphans. I have to also think about this to make a decision :)

ADD REPLY • link 8.7 years ago by fernardo ▴ 180

score 2 · Answer 3 · 2016-03-25

2

Entering edit mode

8.8 years ago

kirannbishwa01 ★ 1.6k

GATK report DP values twice: one under INFO which generally is the filered DP values and one for the FORMAT which generally is the unfiltered DP values for that allele.

You have different DP due the different statisticial methods employed by different variant callers. Also they have different filtering parameters they might have by default. If you ran the variant calling using default parameters, please check what are default filtering parameter between GATK and bcftools to understand your DP values more clearly.

ADD COMMENT • link 8.8 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

Thanks again. The answer makes sense. As I also asked above, if one needs to use DP value for a type of research, as we discussed there are differences between DP value of the two tools, so which one to be used the research? e.g. if for a SNP from GATK DP is 125 but from bcftools is 97. Just in case if you have an idea.

ADD REPLY • link 8.8 years ago by fernardo ▴ 180

0

Entering edit mode

Click the "Add reply" button to reply to the comment, instead of making a new answer each time.

ADD REPLY • link 8.7 years ago by andrew.j.skelton73 6.6k

score 1 · Answer 4 · 2016-03-25

1

Entering edit mode

8.8 years ago

skbrimer ▴ 740

I think I know why, so I was wondering in my first comment if the aligners where the same. I am going to assume that they are the same aligner and if you did your workflow like this it would explain the difference.

    Clean Reads
         |
         V
     Map Reads
     |       |
     V       V
1. Samtools   2. GATK
1. bcftools   2.GATK Haplotype

This is because the GATK pipeline realigns its Indels, so you could have less error in those regions and more reads would be remapped back to them, whereas the first mapping is all you work with in the samtools pipeline.

ADD COMMENT • link 8.8 years ago by skbrimer ▴ 740

0

Entering edit mode

Thanks. Yes I used the same aligner. In experiment, both results were the same, is it also fine? so it is possible sometimes to have the same DP value and sometimes not? One more important question, based on your experience, if one needs to use DP value for a type of project, as we discussed there are differences between DP value of the two tools, so which one to be used? just in case if you have an idea.

ADD REPLY • link 8.8 years ago by fernardo ▴ 180

score 1 · Answer 5 · 2016-03-26

For more true positive data GATK is my preference, actually I have almost always used GATK for calling variants.

Regarding the DP value you will need to do some research by yourself. You will need to check the global coverage, local coverage and the coverage (DP value) you want to use as a cutoff to pull variants from that particular loci.

Hope that helps,

score 1 · Answer 6 · 2016-03-27

1

Entering edit mode

8.7 years ago

chen ★ 2.5k

Different tools give much difference for low-frequency mutations.

ADD COMMENT • link 8.7 years ago by chen ★ 2.5k