How do I find allele frequencies from GATK output?
1
0
Entering edit mode
3 months ago
mgranada3 ▴ 50

I have outputs from my GATK pipeline using the SnpEff step. I need to produce a Muller plot for my time series experiment conducted at 0, 50, 100, 150, and 200 generations.

Pipeline: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/

Table:

In the header it says:

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

enter image description here

However all my "allele frequencies are either 0.5 or 1.0. Is this the file necessary to create a Muller plot or do I need to perform additional steps?

allele-frequency GATK muller-plot • 870 views
ADD COMMENT
1
Entering edit mode

AF here is the frequency within the samples of your VCF. As you have only one sample here... If you know a database of AF in the general population you can use , for example, bcfools annotate to insert this information...

ADD REPLY
0
Entering edit mode

I have 3 conditions, 5 timepoints, 5 replicates per timepoint, is there anyway to combine these VCF files so the AF reflects the frequency in all 5 replicates? So I could see how gene frequencies change over time?

ADD REPLY
3
Entering edit mode
3 months ago
2nelly ▴ 350

AF reflects the number of alleles that were detected in this position. If the genotype (GT) is homozygous it takes the value 1.00 where as if it is heterozygous (2 different alleles) becomes 0.5. In case you have multiallelic sites, AF values can be further reduced if I am not mistaken.

If you want to look for the allelic frequency of every mutation you can just use the values DP (total depth) and AD (allelic depth, AD_REF and AD_ALT). DP is the sum of all AD values (separated by comma). For instance in the first line POS 251395 G>A, I see that there is a homozygous mutation. The DP is 51 and the AD is 0 (REF) + 51 (ALT). So the allele frequency is (51/(0+51))100 e.g.100% In a heterozygous scenario like this in the second line POS 437550, the DP is 1137 and the AD 851 for REF and 286 for ALT. That means that the frequency of this heterozygous mutation is (286/(851+286))100 = 25.15%

To summarize the formula for the calculation you want should be (AD_ALT/DP) * 100

I hope this is helpful.

ADD COMMENT
0
Entering edit mode

Thank you this was extremely helpful. If I have 5 samples for a single timepoint, can I just average the frequencies?

ADD REPLY
1
Entering edit mode

Well, theoretically you could if they are replicates (either biological or technical). However, keep in mind potential batch effect and sample variability.

ADD REPLY
0
Entering edit mode

One of my samples displays 3 values here, how do I calculate this?

GT:AD:DP:GQ:PL

1/2:23,88,16:127:73:3125,293,73,2140,0,2177

ADD REPLY
0
Entering edit mode

GT 1/2 means that you have 2 alternative alleles in that position. In total you have 127 reads in that position. To find the frequency of each allele you divide by 127: 23/127 REF 88/127 ALT1 16/127 ALT2 One of these should be the reference allele. I suspect it is the first number (23), like in regular heterozygous genotypes 0/1, not sure though. If someone can clarify this please go ahead.

ADD REPLY
0
Entering edit mode

Does this equation work for haploid organisms? Because I saw that during haplotype caller I should have used a "-ploidy" which I definitely didn't do, rendering all this data probably useless.

ADD REPLY
0
Entering edit mode

By default ploidy is set to 2 in HaplotypeCaller, so no worries.

ADD REPLY

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6