I would like to get the alternate allele counts (AC) and the total allele counts (AN) for any variant in each of the five 1000 Genomes super-populations (AFR, AMR, EAS, EUR, SAS) as well as the global population (ALL).
1000 Genomes offers its Allele Frequency Calculator which gives an output for the global population (ALL) and each sub-population (ACB, ACW, BEB, etc.) like the following:
CHR POS ID REF ALT ALL_POP_TOTAL_CNT ALL_POP_ALT_CNT ALL_POP_FRQ ...
1 10177 . A AC 5008 2130 0.43 ...
This gives me exactly what I need, but ideally I would like to have a solution that I can implement in a pipeline (aka independent of the online interface), perhaps using vcftools or bcftools. I know I can sum the values for the sub-populations to get the values for each respective super-population, but I also wonder if there is a simpler/faster way that I'm missing.
What I've tried already:
- I can easily get AF for the global and super-populations using ANNOVAR, but I still need AC and AN.
- I can get AC and AF from dbNSFP 2, but this limits the variants to non-synonymous SNPs only. Technically, I could calculate AN by dividing AC by AF, but this introduces rounding errors because AF has been truncated. Additionally, if AC and AF are zero, then I won't be able to calculate AN at all.
I have tried using the fill-an-ac script for VCF files using the technique suggested here. This will update the AN and AC fields just fine, but it doesn't update the AF field for some reason.
I've dabbled in the idea of adding up the genotypes (e.g. 0|0, 0|1, 1|1, etc.) in the VCF/BCF files, but I was hoping to avoid this if possible.
Question:
How can I get the AC, AN, and AF of any variant for each of the five 1000 Genomes super-populations as well as the global population? Can I do this without first calculating the sub-populations?
NOTE: I know AF is included in the 1000 Genomes VCF/BCF files, but if someone knows how to get AC, AN, and AF in one fell swoop (similar to Allele Frequency Calculator) then it would be greatly appreciated.
Jorge, I like this answer, but I'm still afraid that multiplying
Nsamples * AF
to get AC will introduce rounding errors because AF is truncated. As an additional note, if I was to do it your way, I could actually skip using ANNOVAR and pull the super-population frequencies directory from the VCF/BCF file. I came up with a solution that worked for my purposes if you'd like to check it out. Thanks for your input.