Hello, Recently, I used samtools to call variants and got the results in VCF format. I need to distinguish heterozygotes and homozygotes for each sample, then I seeked answers in forums and groups, and found several rules: 1) FQ; 2)AF1; 3)PL; 4)GT.
##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1.fastq.gz.baf.sorted.bam
chr10 126278 . TAA TCTGAAA 5.8 . INDEL;DP=3;VDB=0.0404;AF1=1;AC1=2;DP4=0,0,1,0;MQ=42;FQ=-37.5 GT:PL:GQ 1/1:42,3,0:3
chr10 126402 . T C 3.98 . DP=2;VDB=0.0200;AF1=1;AC1=2;DP4=0,0,0,2;MQ=32;FQ=-33 GT:PL:GQ 1/1:33,6,0:5
1) FQ: Notably, given one sample, FQ is positive at hets and negative at homs. From: http://samtools.sourceforge.net/mpileup.shtml <FQ=->homs
2) AF1: <AF1=1>homs
3) PL: For example, suppose REF=C and ALT=A,G, PL=7,0,37,13,40,49 means for the sample we are looking at, P(D|CC)=10^{-0.7}, P(D|CA)=1, P(D|AA)=10^{-3.7}, P(D|CG)=10^{-1.3}, P(D|AG)=1e-4 and P(D|GG)=10^{-4.9}. From: http://samtools.sourceforge.net/mpileup.shtml <,0:>homs
4) GT: <1/1>homs
According these rules, I did a rough statistics of homs and hets using the following commands:
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"FQ=-\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"AF1=1\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \",0:\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"1/1\"|wc -l && echo "$1; }'|sh
And I got different numbers of homs and hets for each sample, e.g.
20344 sample1.vcf.gz # "FQ=-"; homs
16954 sample1.vcf.gz # "AF1=1"; homs
16565 sample1.vcf.gz # ",0:"; homs
15543 sample1.vcf.gz # "1/1"; homs
7600 sample1.vcf.gz # not "FQ=-"; hets
10990 sample1.vcf.gz # not "AF1=1"; hets
11379 sample1.vcf.gz # not ",0:"; hets
12401 sample1.vcf.gz # not "1/1"; hets
How could this happen? Which is the most accurate rule for distinguishing heterozygotes and homozygotes, and any more rules to do this job?
Any suggestions and discussions could help me. Thanks!
I have used snpEff and SeattleSeq Annotation tools to do variants' annotation, and found that they seems using the GT(Genotype) rule to distinguish homs and hets.