Hi,
I am having trouble interpreting the genotype calls and the respective allele frequency information in the most recent 1000 genomes data release.
Let's take a look at an example from phase 1 integrated call sets - it's a SNP with the id rs3748597.
Reference allele is T and the reported alternate allele is C at chromosome 1, position 888659. The corresponding functional change at the protein sequence level is a change from Isoleucine to Valine at amino acid position 300.
This is the raw VCF from the integrated call sets for this SNP reported for 1092 individuals:
1 888659 rs3748597 T C 100 PASS AVGPOST=1.0000;AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;THETA=0.0005;LDAF=0.9282;VT=SNP;AC=2027;RSQ=1.0000;ERATE=0.0003;AF=0.93;ASN_AF=0.92;AMR_AF=0.92;AFR_AF=0.90;EUR_AF=0.95 GT:DS:GL 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5,-2.3279,-0.002046 1|1:2.000:-5.00,-5.00,0.00 1|1:2.000:-5.00,-3.74,-0.00 0|1:1.000:-5.00,0.00,-5.00 1|1:2.000:-5.00,-5.00,0.00
Although I removed a substantial portion of the genotype information for the sake of space, the observation still holds: all of the 1092 individuals carry this variant - most are even homozygote for this variant, that is, they carry the alternate allele on both chromosomes.
There are more of such examples.
Could you please help me understand:
Has this observation - that some variants have incredibly high frequencies, in fact some "alternate" alleles might well be the true reference - already been reported? Am I missing something obvious or understanding the genotype and frequency information incorrectly? (Explained: I understand that some reference alleles are in fact true minor alleles - I am simply surprised to come across cases where alternate allele can reach to frequencies as high as 93%.)
dbSNP reports MAF/MinorAlleleCount: T=0.072/156. I understand that there may be discrepancies regarding the allele frequency due to conceptual or methodological reasons, however, am completely puzzled about the observed MAF=1.0/2184 and the dnSNP MAF=0.072/156. Any explanation? (Explained: dbSNP correctly reports the true minor allele, which happens to be the reference allele. Refence call is possibly made based on individuals carrying the true minor allele.)
Thank you.
Maybe I misunderstand your question but ...
It's not necessary that the reference allele be the "major" allele. In this case, apparently, the reference is based on someone who carries the true "minor" allele.
Hey brentp - that is indeed the case. In fact, dbSNP correctly reports the minor allele in this example, which happens to be the reference allele. (I will modify that part accordiingly) I am just surprised that in the light of such high frequencies for non-trivial number of alternate alleles, the reference managed to find the true minor allele. Thanks.
I'm a little confused by your use of "reference". The human population has 7 billion people in it. There is no Platonic 'true reference' sequence. We just pick one sequence call the reference, knowing the limitations of that approach.
perhaps you could add the answer separately as well, it would help new readers
done
.