I have outputs from my GATK pipeline using the SnpEff step. I need to produce a Muller plot for my time series experiment conducted at 0, 50, 100, 150, and 200 generations.
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
However all my "allele frequencies are either 0.5 or 1.0. Is this the file necessary to create a Muller plot or do I need to perform additional steps?
AF here is the frequency within the samples of your VCF. As you have only one sample here... If you know a database of AF in the general population you can use , for example, bcfools annotate to insert this information...
I have 3 conditions, 5 timepoints, 5 replicates per timepoint, is there anyway to combine these VCF files so the AF reflects the frequency in all 5 replicates? So I could see how gene frequencies change over time?
AF reflects the number of alleles that were detected in this position. If the genotype (GT) is homozygous it takes the value 1.00 where as if it is heterozygous (2 different alleles) becomes 0.5. In case you have multiallelic sites, AF values can be further reduced if I am not mistaken.
If you want to look for the allelic frequency of every mutation you can just use the values DP (total depth) and AD (allelic depth, AD_REF and AD_ALT).
DP is the sum of all AD values (separated by comma). For instance in the first line POS 251395 G>A, I see that there is a homozygous mutation. The DP is 51 and the AD is 0 (REF) + 51 (ALT). So the allele frequency is (51/(0+51))100 e.g.100%
In a heterozygous scenario like this in the second line POS 437550, the DP is 1137 and the AD 851 for REF and 286 for ALT. That means that the frequency of this heterozygous mutation is (286/(851+286))100 = 25.15%
To summarize the formula for the calculation you want should be (AD_ALT/DP) * 100
Well, theoretically you could if they are replicates (either biological or technical).
However, keep in mind potential batch effect and sample variability.
GT 1/2 means that you have 2 alternative alleles in that position.
In total you have 127 reads in that position.
To find the frequency of each allele you divide by 127:
23/127 REF
88/127 ALT1
16/127 ALT2
One of these should be the reference allele. I suspect it is the first number (23), like in regular heterozygous genotypes 0/1, not sure though.
If someone can clarify this please go ahead.
Does this equation work for haploid organisms? Because I saw that during haplotype caller I should have used a "-ploidy" which I definitely didn't do, rendering all this data probably useless.
AF here is the frequency within the samples of your VCF. As you have only one sample here... If you know a database of AF in the general population you can use , for example,
bcfools annotate
to insert this information...I have 3 conditions, 5 timepoints, 5 replicates per timepoint, is there anyway to combine these VCF files so the AF reflects the frequency in all 5 replicates? So I could see how gene frequencies change over time?