How to Calulate Allele Frequency from a VCF File?
0
1
Entering edit mode
22 months ago
Emoji ▴ 10

I have a VCF file with 200 samples (mitochondrial genome of Plasmodium falciparum). Here is a pic to take a look at: enter image description here

And a few relevant lines from the actual file:

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the 
same order as listed">
 ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order 
 as listed">
 ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  FP0008-C   FP0009-C
 Pf_M76611       19      .       A       T       33196.8 MissingVQSLOD;Mitochondrion     
 AC=9;AF=0.0003019;AN=29816;ANN=T|intragenic_variant|MODIFIER|PF3D7_MIT00100|PF3D7_MIT00100|gene_variant|PF3D7_MIT00100|||n.19A>T||||||;AS_InbreedingCoeff=0.0437;AS_QD=20.96;BaseQRankSum=0.141;C  
DS;DP=559138;ExcessHet=0.0026;FS=2.892;InbreedingCoeff=0.52;MLEAC=8;MLEAF=0.0002683;MQ=60;MQR 
ankSum=0;QD=20.96;ReadPosRankSum=0.856;RegionType=Mitochondrion;SOR=0.789;set=genotypegvcfs  
GT:AD:DP:GQ:PL  0/0:52,0:52:99:0,120,1800  0/0:54,0:54:99:0,120,1800

A small subset of samples I am working on can be found here: https://www.mediafire.com/file/6dkct6guk5zx83m/sample.vcf/file

I am trying to calculate allele frequency (AF) for each variant in the dataset. The population is assumed to be all the samples in the file.

In other words, even though the VCF file contains AF statistics from population data, I would like to know how to recalculate AF only from the VCF file. I believe I need to know how AC and AN are calculated first.

AF is calculated this way: allele count (AC) / allele number (AN)

I tried to calculate them from the information from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples), but then I learned that this calculation is not suitable for mithocondrial data.

All lines in the VCF file contain an AF statistic:

$ grep -v 'AF=' sample.vcf | grep -v '^#'
<No output>

The information is provided under INFO column, but how were they calculated? If I do not have AC and AN to begin with, what should I do? Is there a way to calculate AC and AN from the other information provided in the dataset? such as the data mentioned in the FORMAT column? GT: genotype AD: unfiltered allele depth DP : read depth at this position for this sample (Integer) GQ : conditional genotype quality, encoded as a phred quality PL : the phred-scaled genotype likelihoods rounded to the closest integer

Thanks

frequency allele VCF • 6.0k views
ADD COMMENT
0
Entering edit mode

pretty much all vcftools will enable this for you; vcftools, plink2 etc

ADD REPLY
0
Entering edit mode

for mithocondrial data ?

ADD REPLY
0
Entering edit mode

Thanks for the reply. I am not just looking for a tool to quickly calculate something. I checked the documentation for vfctools, but it just described how to use it to get the result. Plus, i am not sure if it covers mithocondrial genome. I would like to know about the calculation. How can i do this manually?

ADD REPLY
2
Entering edit mode
bcftools +fill-tags in.vcf -- -t AN,AC,AF

AC= COUNT(ALT-ALLELE)

AN= COUNT(ALT-ALLELE)+COUNT(REF-ALLELE)

AF=AC/AN

ADD REPLY
0
Entering edit mode

Thanks, but it looks like when i tried to calculate them from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples). I do not think it works that way, at least not for the mithocondrial genome.

ADD REPLY
1
Entering edit mode

That's correct. You don't have biallelic alleles in mitochondria, so you can use either the Allele depth (AD) or depth of coverage (DP) for each individual. DP should be a bit more raw than AD.

ADD REPLY
0
Entering edit mode

Thanks, but how do i do that?

ADD REPLY
1
Entering edit mode

for AD it's that second element 52,0 for FP0008-C then 54,0 for FP0009-C. When you have more subjects it should be interesting. You can add those between individuals and divide each allele by the total.

ADD REPLY
0
Entering edit mode

Thanks, aplogy for follow up questions, as i am new to this field. I am just trying to clarify your explanation. What you are saying is to add numbers, for these two samples, 52 and 54 which would be the total and then divide each allele count (counts for 0/0, 0/1 and 1/1) by that total?

ADD REPLY
1
Entering edit mode

Imagine it was more heteroplasmy in these two, so like 42,10 and 50,4 then the frequencies would be 92/106=.868 and 14/106=.132

ADD REPLY
0
Entering edit mode

Awesome! I get it now. However, there is still a small problem. All the ADs are like this: some number (between 1 and 60):0. So, the number on the right side of any AD read is always 0. That would mean that my fractions are alway 0/total and total/total whih would result in 0 and 1. How is this possible?

ADD REPLY
1
Entering edit mode

ok, what do the DP's look like? ever more than one number?

ADD REPLY
0
Entering edit mode

For each sample, there is only one number.

ADD REPLY
0
Entering edit mode

Hmm i would expect some heteroplasmy to show up in AD or DP at some positions

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6