Is there any way to calculate the per variant heterozygosity (i.e. number of 1/0 or 0/1 genotypes observed at given variant site for set of individuals in VCF file) from VCF file?
I knew per individual heterozygosity can be calculated by --het tag from VCFtools
NB - this does NOT support multi-allelic records. If you have these, split them via bcftools norm -m-any
----------------------
Assuming that you're confident that your VCF is normalised and that you genuinely just want to count occurrences of heterozygous calls per line, then the following will work for either phased (0|1 or 1|0) or unphased (0/1 or 1/0) genotypes, or a mixture of these.
Here, I'm actually accessing a BCF and it's the imputed 1000 Genomes Phase III data. So, the genotypes are phased.
The gsub function in AWK conveniently returns the number of matched patterns.
paste <(bcftools view 1000Genomes.Norm.bcf |\
awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
!/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' 1000Genomes.Norm.bcf |\
awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}')
CHR POS ID REF ALT nHet
1 10177 rs367896724 A AC 1490
1 10235 rs540431307 T TA 6
1 10352 rs555500075 T TA 2025
1 10505 rs548419688 A T 1
1 10506 rs568405545 C G 1
1 10511 rs534229142 G A 1
1 10539 rs537182016 C A 3
1 10542 rs572818783 C T 1
1 10579 rs538322974 C A 1
1 10616 rs376342519 CCGCCGTTGCAAAGGCGCGCCG C 35
1 10642 rs558604819 G A 21
1 11008 rs575272151 C G 403
1 11012 rs544419019 C G 403
1 11063 rs561109771 T G 15
1 13011 rs574746232 T G 3
1 13110 rs540538026 G A 134
1 13116 rs62635286 T G 414
1 13118 rs200579949 A G 414
This is almost perfect. The problem are sites like this:
0/1:1,1:2:26:0|1:27975_T_C:39,0,26
This site gives me two heterozygous counts instead of 1, I am trying to solve it, adding to gsub something like this but is not working
gsub(/^0\|1|^1\|0|^0\/1|^1\/0/,"")
Hey, that will require an extra if statement. The code above is designed for multi-allelic records that are encoded as 1/2, 2/2, 1/0, etc. I do not exactly have time to edit this, so, my recommendation would be to just split your multiallelic records via bcftools norm -m-any and then run the above AWK code.
I am not surprised (seriously) - this is an issue that re-occurs every now and then. GATK's programs can produce VCFs in a format that is difficult to use with other scripts/programs.
Yep, thats True, I this case, because my file is not phased, the solution was to use your script only with the nonphases counts. like this gsub(/0\/1|1\/0/,"").
It worked perfectly fine.. thanks for the help.. @Kevin
You're welcome, Sohail
This is almost perfect. The problem are sites like this:
0/1:1,1:2:26:0|1:27975_T_C:39,0,26
This site gives me two heterozygous counts instead of 1, I am trying to solve it, adding to gsub something like this but is not working gsub(/^0\|1|^1\|0|^0\/1|^1\/0/,"")
Any Idea?
Hey, that will require an extra if statement. The code above is designed for multi-allelic records that are encoded as
1/2
,2/2
,1/0
, etc. I do not exactly have time to edit this, so, my recommendation would be to just split your multiallelic records viabcftools norm -m-any
and then run the above AWK code.How did you produce your file?
I made the SNP calling with GATK.
I am not surprised (seriously) - this is an issue that re-occurs every now and then. GATK's programs can produce VCFs in a format that is difficult to use with other scripts/programs.
Yep, thats True, I this case, because my file is not phased, the solution was to use your script only with the nonphases counts. like this gsub(/0\/1|1\/0/,"").
Thank you for your script It help me a lot.
Hey! Thanks. It works fine. Just a small update to use this command with a VCF file including multiallelic sites (up to four).
Thank you. Can you show some sample output just to help users?