I have a VCF with the following lines:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=G,Type=Integer,Description="Allelic Depths of REF and ALT(s) in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL
1 15557977 . TG CA . . . GT:AD:DP 0/1:11,5:16 0/0:21,1:22
1 146728217 . G A . . . GT:AD:DP 0/1:19,21:40 0/0:42,0:42
I am under the impression that to calculate the minor allele frequency (AF), I need to divide AD by DP. I need clarification for this specific calculation since the AD attribute has two comma-separated values. Are the two comma-seperated values indicating the major and minor alleles? Does that mean for the calculation of the minor AF that I only care about the smaller of the two numbers?
Looking at the first line, under the tumor column: AD = 11,5 & DP = 16. Would it be 5/16 = 0.3125?
This is what I am thinking, but I was having trouble finding distinct confirmation in my searches.
Additionally, sometimes VCF files do not have multiple AD values -- does that mean to calculate the minor AF that I just use the single AD value? Or do I need to subtract the provided AD value from the DP value and then take the smaller value of those two (AD, DP - AD), to calculate it?
Edit for @2nelly -- Example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL
chr1 2993807 . C G . PASS AC=4;ADP=211;AN=4;HET=0;HOM=1;NC=0;SF=0,1;WT=0 GT:RDF:DP:ADF:ABQ:FA:RBQ:GQ:ADR:PVAL:AD:RDR:RD:SDP:FREQ 1/1:0:211:178:52:0.9905:36:255:31:2.0028E-123:209:1:1:211:99.05% 1/1:0:211:178:52:0.9905:36:255:31:2.0028E-123:209:1:1:211:99.05%
In this line, under the tumor column: AD = 209, DP = 211. To calculate the minor AF, I assume it would actually be 211-209 = 2 for the minor AD, and then 2/211 = 0.0095 ?
Thank you in advance for clarifying this for me!
Dear cookersjs,
You are right about the division
16 is the total depth in tumor
11 is the depth of REF in tumor
5 is the depth of ALT in tumor
Regarding the last part of your question, it is better to upload an example of the whole line. Maybe these variants should be filtered out. How did you produce the vcf file? You should normally get more info than these.
Thanks for confirming! I have added a vcf sample that illustrates the second case I was describing
For sure is coming from another vcf file. something is wrong with this line. Control and case samples are homozygous with the same values!!!! How did you call these variants? Can you please post the header of vcf?
The file was provided to me, I'm not sure how it was generated. That at least clears up that the file was the problem here, thanks!
Hello cookersjs ,
could you please explain you definition of "minor allel frequency"? My understanding is, that this is the frequency of the second most allele in a given population (and this can be the reference allele as well!).
What you are calculating in your example is the fraction of reads supporting the alternate allele. As described in the header, the first value in the AD field are the reads supporting the REF allele and second the one in the ALT column.
fin swimmer
Hi finswimmer,
My understanding of minor allele frequency is that it is the frequency at which the alternate allele occurs in the sample. Based on the replies from 2nelly and from my own interpretation, in the AD (allelic depth) attribute there are two comma-seperated values.
The sum of those two values is equal to the single DP value. In the first example, the DP was 16, and the ADs for ref and alt were 11 and 5, respectively. Since I am interested in the "minor" allele frequency, that would mean the smaller AD value is the one I am interested in. I can get the frequency by dividing the AD(minor) by DP, or 5/16 = 0.3125.