I'm trying to extract allele frequency information for all variants in the latest ExAC data release. I'm currently using vcftools version 0.1.13, and am coming into errors that relate, I think, to versions.
When I validate the ExAC data VCF file, I get an error message for every INFO line in the file, that says something like the below:
>> vcf-validator ExAC.r0.3.1.sites.vep.vcf
INFO field at 1:13404 .. INFO tag [AC_Het=0,1,0] expected different number of values (expected 2, found 3)
FILTER field at 1:13418 .. The filter(s) [AC_Adj0_Filter] not listed in the header.
This suggests, I believe, that the ExAC data VCF file is not in the format vcftools wants it to be, which is v4.0, v4.1 or v4.2. If this is true, how do I make the ExAC data VCF file vcftools- compatible?
The real error I'm trying to overcome comes from when I try to extract the frequency information:
>> vcftools --gzvcf ExAC.r0.3.1.sites.vep.vcf.gz --freq --chr 1 --out chr1_freq
VCFtools - v0.1.13
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--gzvcf ExAC.r0.3.1.sites.vep.vcf.gz
--chr 1
--freq
--out chr1_freq
Using zlib version: 1.2.8
After filtering, kept 0 out of 0 Individuals
Error: Require Genotypes in VCF file in order to output Frequency Statistics.
Can anyone help me out?
Apologies - this is indeed a bug in ExAC. AC_Het should be Number=. and unfortunately this is the best way to do this at the moment, since its a semi-complicated field: in reality, it's something like Number="G minus R" but since something like that does not exist in the spec, we'll have to settle for '.' and one will have to parse it manually.
We will fix the VCF with this and other minor header issues shortly.
I just found the version number of ExAC, and it is actually v4.2! This cannot be a version issue, now. Can you explain how it's possible for ExAC to have a bug? For example, why did you paste the line above? Sorry if I'm missing something obvious!
A: What is the reason for most software errors in Bioinformatics according to you?
Thanks for writing an issue! Hopefully we hear from them soon!
It may be counting the 3 possible heterozygous states: G/A, G/T, and A/T
@rbagnall yes, but then
ID=AC_Het,Number=A
shouldn't be used.UPDATE : the bug is now fixed: https://github.com/konradjk/exac_browser/issues/256#issuecomment-282719065