Calculating empirical allele frequency
0
0
Entering edit mode
4.3 years ago
drowl1 ▴ 30

Hello!

It looks like the GATK forums site is down this whole week so I am unable to get an answer to a somewhat basic question.

I have generated several VCF files of individual haploid samples using GATK 4.1.7.0 and the variants have their allele frequencies as 1. I would like to calculate their empirical allele frequencies for subsequent plotting of their site frequency spectrum. How do I go about this?

Thanks in advance!

allelle frequency VCF GATK Haploid • 2.3k views
ADD COMMENT
1
Entering edit mode

If I understood your request correctly, I'd merge VCF files and sum up 0 if the variant is in reference state (0/0), 1 if it is heterozygous (0/1) and 2 if it is homozygous. Important to not count (./.) variants since they were not called. Then divide by double number of samples where the variant was called. / should be replaced with | for phased data.

ADD REPLY
0
Entering edit mode

Hi German,

I actually want to see how the allele frequency spectrum looks like for every individual sample before any further processing and thus I wish to calculate the empirical alternate allele frequencies of the variants in the individual samples/VCF file.

For example, in Freebayes, one may use the AO and RO values to do so(AO/(AO + RO), but I cannot figure out the equivalent in GATK. This is what I want to know.

ADD REPLY
1
Entering edit mode

Hi @drowl1

in the results I have from GATK I have them in format GT:AD:DP:GQ:PL. So the 2nd column should contain depths from both alleles, I believe. So it should be something like 0/1:45,37:82:etc which shows you number of reads supporting reference (45) and alternative (37).

However it is not mandatory to have AD in the output of GATK, as far as I know, so if you don't have this field, you have to re-call your samples...

ADD REPLY
0
Entering edit mode

Well noted, German, and I now understand that I can use the Alt values in the AD and DP.

But going by GATK's definitions of annotations, whatever I get from this ( Alt value in AD/DP) is the allele fraction. It's great that you have worked with GATK so perhaps you may be able to explain to me if this value is the same as the alternate allele frequency?

These two terms are quite confusing.

ADD REPLY
1
Entering edit mode

Oh, this one I think is easy. DP contains one number - it is a coverage depth of this position. AD is number of reads supporting each variant - several numbers, divided by commas. Sum of AD should, ideally, be equal to DP (I guess does not work for indels but not sure about that). So I'd take a value from AD and divide it by the sum of AD - that's how you get allele frequency.

I also worked with FreeBayes - indeed, the name of fields is different there, but the principle is the same.

ADD REPLY
0
Entering edit mode

I see. In the case of GATK, most of the time, their DP doesn't match the sum of the AD. I am looking at haploid SNPs so, in the AD field, I only have two values. I read somewhere some time ago, can't remember the exact place, but the GATK team explain that the DP value may not sometimes match the sum of AD because the algorithm considers all the reads when computing the DP, while it only takes account of useful/informative reads when computing the AD values.

In my SNP set, this is true and the sum of AD is not equal to DP. for example:

GT:AD:DP:GQ:PL. 1:33,49:88:99:720,0 -AD ≠ DP

So in such a case, how would you get allele frequency?

ADD REPLY
1
Entering edit mode

Personally I would use 33 / ( 33 + 49 ). There are also multi-allelic sites - if e.g. one allele was not called, you will see that sum(AD) != DP. Same for errors in calling/mapping (maybe there are several soft-clipped reads - not sure if GATK counts them for DP). I would use AD and check the distribution - is the median of heterozygous variants AF will be much different from 0.47-0.49 - then there is a mistake and it has to be revisited.

ADD REPLY
0
Entering edit mode

Hi German,

Thanks a lot for your suggestions and explanation, much appreciated. In a previous post in the GATK forums, I have seen that the authors recommend using AD/DP to calculate empirical allele frequency so I guess I'll go with that.

The discussion is here: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2015-11-25-2015-07-16/6202-VCF-file-and-allele-frequency

ADD REPLY
1
Entering edit mode

Hi, I agree, in this case (as I understood from the description) DP is equal to sum of AD

It says DP=51, but then later GT:AD:DP:GQ:PL 0/1:36,11:47:99:210,0,1046 lists DP=47

and then of curse AD/DP is the right way as well as AD / sum AD

ADD REPLY

Login before adding your answer.

Traffic: 2459 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6