I have a VCF file with 200 samples (mitochondrial genome of Plasmodium falciparum). Here is a pic to take a look at:
And a few relevant lines from the actual file:
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the
same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order
as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT FP0008-C FP0009-C
Pf_M76611 19 . A T 33196.8 MissingVQSLOD;Mitochondrion
AC=9;AF=0.0003019;AN=29816;ANN=T|intragenic_variant|MODIFIER|PF3D7_MIT00100|PF3D7_MIT00100|gene_variant|PF3D7_MIT00100|||n.19A>T||||||;AS_InbreedingCoeff=0.0437;AS_QD=20.96;BaseQRankSum=0.141;C
DS;DP=559138;ExcessHet=0.0026;FS=2.892;InbreedingCoeff=0.52;MLEAC=8;MLEAF=0.0002683;MQ=60;MQR
ankSum=0;QD=20.96;ReadPosRankSum=0.856;RegionType=Mitochondrion;SOR=0.789;set=genotypegvcfs
GT:AD:DP:GQ:PL 0/0:52,0:52:99:0,120,1800 0/0:54,0:54:99:0,120,1800
A small subset of samples I am working on can be found here: https://www.mediafire.com/file/6dkct6guk5zx83m/sample.vcf/file
I am trying to calculate allele frequency (AF) for each variant in the dataset. The population is assumed to be all the samples in the file.
In other words, even though the VCF file contains AF statistics from population data, I would like to know how to recalculate AF only from the VCF file. I believe I need to know how AC and AN are calculated first.
AF is calculated this way: allele count (AC) / allele number (AN)
I tried to calculate them from the information from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples), but then I learned that this calculation is not suitable for mithocondrial data.
All lines in the VCF file contain an AF statistic:
$ grep -v 'AF=' sample.vcf | grep -v '^#'
<No output>
The information is provided under INFO column, but how were they calculated? If I do not have AC and AN to begin with, what should I do? Is there a way to calculate AC and AN from the other information provided in the dataset? such as the data mentioned in the FORMAT column? GT: genotype AD: unfiltered allele depth DP : read depth at this position for this sample (Integer) GQ : conditional genotype quality, encoded as a phred quality PL : the phred-scaled genotype likelihoods rounded to the closest integer
Thanks
pretty much all vcftools will enable this for you; vcftools, plink2 etc
for mithocondrial data ?
Thanks for the reply. I am not just looking for a tool to quickly calculate something. I checked the documentation for vfctools, but it just described how to use it to get the result. Plus, i am not sure if it covers mithocondrial genome. I would like to know about the calculation. How can i do this manually?
AC= COUNT(ALT-ALLELE)
AN= COUNT(ALT-ALLELE)+COUNT(REF-ALLELE)
AF=AC/AN
Thanks, but it looks like when i tried to calculate them from GT (counting 0/0, 1/1 and 0/1 in all the 200 samples). I do not think it works that way, at least not for the mithocondrial genome.
That's correct. You don't have biallelic alleles in mitochondria, so you can use either the Allele depth (AD) or depth of coverage (DP) for each individual. DP should be a bit more raw than AD.
Thanks, but how do i do that?
for AD it's that second element
52,0
for FP0008-C then54,0
for FP0009-C. When you have more subjects it should be interesting. You can add those between individuals and divide each allele by the total.Thanks, aplogy for follow up questions, as i am new to this field. I am just trying to clarify your explanation. What you are saying is to add numbers, for these two samples, 52 and 54 which would be the total and then divide each allele count (counts for 0/0, 0/1 and 1/1) by that total?
Imagine it was more heteroplasmy in these two, so like
42,10
and50,4
then the frequencies would be 92/106=.868 and 14/106=.132Awesome! I get it now. However, there is still a small problem. All the ADs are like this: some number (between 1 and 60):0. So, the number on the right side of any AD read is always 0. That would mean that my fractions are alway 0/total and total/total whih would result in 0 and 1. How is this possible?
ok, what do the DP's look like? ever more than one number?
For each sample, there is only one number.
Hmm i would expect some heteroplasmy to show up in AD or DP at some positions