I need to get the global 1000 genomes phase 1 minor allele frequencies for all 1000 Genomes low coverage phase 1 SNPs. I have the 1000 genomes .vcf files. Which is the easiest way to get this values? Any suggested tool?
I need to get the global 1000 genomes phase 1 minor allele frequencies for all 1000 Genomes low coverage phase 1 SNPs. I have the 1000 genomes .vcf files. Which is the easiest way to get this values? Any suggested tool?
If you already have these Phase 1 1000 genome VCFs downloaded, then the Global Allele Frequency is within AF
under the INFO
column. The VCF format is complex, so don't try to write your own code to parse out INFO/AF
. If you need a flattened tab-delimited format, then use tools like bcftools query. Download and install bcftools as explained here, and then you can use a command like this:
bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t%AF\t%AC\t%AN\n' ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz
This generates output that looks like this:
22 16050408 T C 0.06 134 2184
22 16050984 C G 0.0023 5 2184
22 16051722 TA T 0.01 32 2184
22 16052239 A G 0.46 1010 2184
22 16053659 A C 0.76 1655 2184
Where the columns are CHROM, POS, REF, ALT, AF, AC, AN
. ALT Allele Count (AC
) and Total Allele Count (AN
) are useful to know, even though you're only seeking AF
which is equal to AC/AN
.
An important point is that AF
is the frequency of the ALT
allele. But sometimes the REF
is the minor allele. For example, if AF
is 0.76
, then it is too common in the population to be called the "minor" allele. So if AF
is greater than 0.50
, then set MAF=(1-AF)
. Otherwise MAF=AF
. I'm not very experienced in germline genetics, so there may be other caveats. But this should be enough to get you started.
Hi Cyriac
You explained very well. I'm not able to get MAF in my vcf annotation from either VEP or Annovar. Instead, I'm getting 1000G_All. So is the formula you mentioned above [MAF = 1-AF] perfectly correct, because I'm not able to get any document regarding this. It will be really helpful if you can send a link or a document that describes this formula.
Thanks
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
So, the question is, how do I get the MAF from a VCF? It doesn't matter that this VCF is the 1000 genomes vcf, right?
Perhaps you could clarify this question? Just so you know, 1000 genomes does not strictly refer to humans. I assume you are talking about humans, though.
Anyway, for the purposes of the forum, it would be useful if you explained what you are trying to accomplish, why you are trying to accomplish it, what you have tried, and... well, what organism you are working with is always helpful.
Explaining what you mean by "phase 1" would also be helpful, so that people don't have to look it up. I looked it up, and read about it here. But, well... it's not clear to me why anyone would care about that. As far as I can tell, phase 1 is a preliminary, inaccurate part of the human 1000 genomes project. Why would you want to use that for anything, when there are subsequent, more accurate phases?
Doesn't it? As far as I know, 1000 genomes project is humans only, while the 10K genome project also includes other species.
So, I used to think "thousand genomes" applied to humans only. Then I started working at JGI, and found out that there are other projects called "thousand genomes" that are related to other organisms (such as Aspergillus). My co-workers were baffled when I assumed that when they said "thousand genomes", they were talking about human genomes.
I'm not sure how important this is in most of the world. When I was at UT Southwestern, "thousand genomes" strictly meant a human project. But at JGI, it strictly means not a human project. So, I think it is useful to specify the organism, and also to provide a link to the project, to prevent unnecessary confusion.
Alright - I wasn't aware of other thousand genome projects.
I see similar "vocabulary bubbles" in people who study cancer genetics (like me) vs people who spent their careers studying genetics of germline/mendelian diseases. Every research institute invents its own vocabulary. Clearly, we don't get out much. :)