calculate Per variant Heterozygosity from VCF file
1
4
Entering edit mode
6.9 years ago
SOHAIL ▴ 410

Hi everybody,

Is there any way to calculate the per variant heterozygosity (i.e. number of 1/0 or 0/1 genotypes observed at given variant site for set of individuals in VCF file) from VCF file?

I knew per individual heterozygosity can be calculated by --het tag from VCFtools

Thanks!

-sohail

ngs vcf • 11k views
ADD COMMENT
14
Entering edit mode
6.9 years ago

NB - this does NOT support multi-allelic records. If you have these, split them via bcftools norm -m-any

----------------------

Assuming that you're confident that your VCF is normalised and that you genuinely just want to count occurrences of heterozygous calls per line, then the following will work for either phased (0|1 or 1|0) or unphased (0/1 or 1/0) genotypes, or a mixture of these.

Here, I'm actually accessing a BCF and it's the imputed 1000 Genomes Phase III data. So, the genotypes are phased.

The gsub function in AWK conveniently returns the number of matched patterns.

paste <(bcftools view 1000Genomes.Norm.bcf |\
    awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
      !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' 1000Genomes.Norm.bcf |\
    awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}')

CHR POS ID  REF ALT nHet
1   10177   rs367896724 A   AC  1490
1   10235   rs540431307 T   TA  6
1   10352   rs555500075 T   TA  2025
1   10505   rs548419688 A   T   1
1   10506   rs568405545 C   G   1
1   10511   rs534229142 G   A   1
1   10539   rs537182016 C   A   3
1   10542   rs572818783 C   T   1
1   10579   rs538322974 C   A   1
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   35
1   10642   rs558604819 G   A   21
1   11008   rs575272151 C   G   403
1   11012   rs544419019 C   G   403
1   11063   rs561109771 T   G   15
1   13011   rs574746232 T   G   3
1   13110   rs540538026 G   A   134
1   13116   rs62635286  T   G   414
1   13118   rs200579949 A   G   414
ADD COMMENT
1
Entering edit mode

It worked perfectly fine.. thanks for the help.. @Kevin

ADD REPLY
0
Entering edit mode

You're welcome, Sohail

ADD REPLY
0
Entering edit mode

This is almost perfect. The problem are sites like this:

0/1:1,1:2:26:0|1:27975_T_C:39,0,26

This site gives me two heterozygous counts instead of 1, I am trying to solve it, adding to gsub something like this but is not working gsub(/^0\|1|^1\|0|^0\/1|^1\/0/,"")

Any Idea?

ADD REPLY
0
Entering edit mode

Hey, that will require an extra if statement. The code above is designed for multi-allelic records that are encoded as 1/2, 2/2, 1/0, etc. I do not exactly have time to edit this, so, my recommendation would be to just split your multiallelic records via bcftools norm -m-any and then run the above AWK code.

How did you produce your file?

ADD REPLY
0
Entering edit mode

I made the SNP calling with GATK.

ADD REPLY
0
Entering edit mode

I am not surprised (seriously) - this is an issue that re-occurs every now and then. GATK's programs can produce VCFs in a format that is difficult to use with other scripts/programs.

ADD REPLY
1
Entering edit mode

Yep, thats True, I this case, because my file is not phased, the solution was to use your script only with the nonphases counts. like this gsub(/0\/1|1\/0/,"").

Thank you for your script It help me a lot.

ADD REPLY
0
Entering edit mode

Hey! Thanks. It works fine. Just a small update to use this command with a VCF file including multiallelic sites (up to four).

paste <(awk -F"\t" 'BEGIN {print "CHR\tPOS\tREF\tALT"} !/^#/ {print $1"\t"$2"\t"$4"\t"$5}' file.vcf) \
<(awk 'BEGIN {print "nHet"} !/^#/ {print gsub(/0\|1|1\|0|0\/1|1\/0|0\|2|2\|0|0\/2|2\/0|0\|3|3\|0|0\/3|3\/0|0\|4|4\|0|0\/4|4\/0/, "")}'  file.vcf)
ADD REPLY
0
Entering edit mode

Thank you. Can you show some sample output just to help users?

ADD REPLY

Login before adding your answer.

Traffic: 2431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6