Discrepancies between 1000 Genomes Phase 1 vs. Phase 3 allele frequencies
2
5
Entering edit mode
10.1 years ago
Greg P ▴ 70

I would like to understand the reason why Phases 1 and 3 of the 1000 Genomes data have very different allele frequencies for certain SNPs.

I have been comparing SNP allele frequencies among a certain group of individuals ("cases") with the allele frequencies reported by the 1000 Genomes project (specifically the EUR super population). For a small group of about 20 SNPs I have found extreme differences between these two allele frequencies. An example was the SNP rs533515.

However, the allele frequency differences were so extreme that I was suspicious. Looking a bit further, I noticed that this SNP has vastly different allele frequencies reported in the different Phases of the 1000g data. Initially I had been working with Phase 3, assuming it was more up to date and therefore "better." For rs533515 in particular, I can find the Phase 3 allele frequency as follows

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20130502/ALL.chr11.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'

with the result

A    C    AC=12;AF=0.00239617;AN=5008;NS=2504;DP=14408;EAS_AF=0.001;AMR_AF=0.0029;AFR_AF=0;EUR_AF=0.001;SAS_AF=0.0082;AA=A|||

Note that the European allele frequency given by Phase 3 is very small, 0.001.

Now, I can get the same information from Phase 1:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp//release/20110521/ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 11:64,497,189-64,497,189 | grep -v '^#' | awk -F'\t' '{if ($3=="rs533515") print $4 FS $5 FS $8}'​

and the result is

A    C    AVGPOST=0.9963;AC=1503;SNPSOURCE=LOWCOV,EXOME;AN=2184;ERATE=0.0047;VT=SNP;THETA=0.0006;AA=A;RSQ=0.9941;LDAF=0.6883;AF=0.69;ASN_AF=0.64;AMR_AF=0.68;AFR_AF=0.47;EUR_AF=0.87

Here, the European allele frequency is 0.87. This is in fact much closer to the allele frequency among my "cases." I found this to be the case for all the SNPs for which my "case" frequencies differed wildly from the Phase 3 allele frequencies.

What is the reason for this large difference between the allele frequencies according to Phases 1 and 3 of 1000 Genomes? Should I be using only Phase 1 data at this point (this is what the NCBI 1000 Genomes Browser does)?

snp • 5.0k views
ADD COMMENT
0
Entering edit mode

Since I added my comment I'm actually seeing more of these issues which is very concerning since I spent a lot of time adding the Phase3 VCFs to my annotation pipeline. Here's another locus missing from Phase3 calls with a high allele frequency in Phase1.

Phase1:

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz '17:21319121-21319121'
[get_local_version] downloading the index file...
17    21319121    rs1714864    C    T    100    PASS    AA=C;AC=1091;AF=0.50;AFR_AF=0.50;AMR_AF=0.50;AN=2184;ASN_AF=0.50;AVGPOST=0.9988;ERATE=0.0004;EUR_AF=0.50;LDAF=0.4995;RSQ=0.3613;SNPSOURCE=EXOME;THETA=0.0002;VT=SNP

Phase3:

tabix ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.autosomes.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz '17:21319121-21319121'
[get_local_version] downloading the index file...

No SNP
ADD REPLY
1
Entering edit mode
10.1 years ago
Vivek ★ 2.7k

Only difference I can see is that an Indel is being in called in the vicinity of this SNP in the phase 3 call-set with a good allele count which might effect the allele counts for the SNP.

11    64497185    .    CAAAA    AAAAA,AAAAAA,CAAAAA,CAAAAAA,CAAA,C    100    PASS    AC=257,44,570,3,73,20;AF=0.0513179,0.00878594,0.113818,0.000599042,0.0145767,0.00399361;AN=5008;NS=2504;DP=14380;EAS_AF=0.0179,0,0.1915,0.001,0,0;AMR_AF=0.013,0,0.2651,0,0,0.0101;AFR_AF=0.0673,0.0189,0.0862,0.0015,0.0552,0.0023;EUR_AF=0.0149,0,0.0596,0,0,0.008;SAS_AF=0.1288,0.0194,0.0194,0,0,0.002
ADD COMMENT
0
Entering edit mode

Can you elaborate? Is it possible for there to be an ambiguity between a SNP and an indel? Or perhaps between several simultaneous variants and an indel, or something of that kind?

ADD REPLY
0
Entering edit mode

Depending on how they count alleles over the population there is potential for ambiguity here. What appears to have been counted as an A>C change in the phase1 call set might be getting counted as a deletion of consecutive As in the phase 3 calls.

The reference sequence in this region:

>11:64497185-64497198
CAAAAAAAAAAAAC
ADD REPLY
0
Entering edit mode
10.1 years ago
Ram 44k

AFAIK the genotypes have been classified into categories differently in Phase 3 compared to Phase 1. Are we sure the subtypes for EUR have not changed between the phases?

EDIT: I just dug a bit deeper, and it seems I was mistaken. While the number of alleles has increased, I don't think the classification basis has changed. The increase in allele quantity should not result in such a huge change in AF values. Let's see what others have to say about this.

ADD COMMENT

Login before adding your answer.

Traffic: 2353 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6