I would like to understand the reason why Phases 1 and 3 of the 1000 Genomes data have very different allele frequencies for certain SNPs.
I have been comparing SNP allele frequencies among a certain group of individuals ("cases") with the allele frequencies reported by the 1000 Genomes project (specifically the EUR super population). For a small group of about 20 SNPs I have found extreme differences between these two allele frequencies. An example was the SNP rs533515.
However, the allele frequency differences were so extreme that I was suspicious. Looking a bit further, I noticed that this SNP has vastly different allele frequencies reported in the different Phases of the 1000g data. Initially I had been working with Phase 3, assuming it was more up to date and therefore "better." For rs533515 in particular, I can find the Phase 3 allele frequency as follows
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20130502/ALL.chr11.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'
with the result
A C AC=12;AF=0.00239617;AN=5008;NS=2504;DP=14408;EAS_AF=0.001;AMR_AF=0.0029;AFR_AF=0;EUR_AF=0.001;SAS_AF=0.0082;AA=A|||
Note that the European allele frequency given by Phase 3 is very small, 0.001.
Now, I can get the same information from Phase 1:
tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20110521/ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'
and the result is
A C AVGPOST=0.9963;AC=1503;SNPSOURCE=LOWCOV,EXOME;AN=2184;ERATE=0.0047;VT=SNP;THETA=0.0006;AA=A;RSQ=0.9941;LDAF=0.6883;AF=0.69;ASN_AF=0.64;AMR_AF=0.68;AFR_AF=0.47;EUR_AF=0.87
Here, the European allele frequency is 0.87. This is in fact much closer to the allele frequency among my "cases." I found this to be the case for all the SNPs for which my "case" frequencies differed wildly from the Phase 3 allele frequencies.
What is the reason for this large difference between the allele frequencies according to Phases 1 and 3 of 1000 Genomes? Should I be using only Phase 1 data at this point (this is what the NCBI 1000 Genomes Browser does)?
Since I added my comment I'm actually seeing more of these issues which is very concerning since I spent a lot of time adding the Phase3 VCFs to my annotation pipeline. Here's another locus missing from Phase3 calls with a high allele frequency in Phase1.
Phase1:
Phase3: