Why might ancestral allele states in 1000G be wrong?
1
3
Entering edit mode
6.5 years ago
hyanwong ▴ 70

According to ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README ancestral alleles in the human SNPs in 1000 genomes data are determined by comparison with chimp, orangutan, and macaque. Here's an example from the vcf for chromosome 1 (e.g. from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/):

1       527169  rs563246443     A       G       100     PASS    AC=4;AF=0.000798722;AN=5008;NS=2504;DP=10410;EAS_AF=0;AMR_AF=0;AFR_AF=0.003;EUR_AF=0;SAS_AF=0;AA=g

Which says that the ancestral allele (AA) is "g". But when I look at the alignments in Ensembl (e.g. https://www.ensembl.org/Homo_sapiens/Variation/Compara_Alignments?align=1098&db=core&r=1%3A591289-592289&v=rs563246443&vdb=variation&vf=95730370), I find that the other primate species all have "A" at that locus:

rs563246443 SNP

Human › chromosome:GRCh38:1:591779:591799:1 Chimpanzee › chromosome:Pan_tro_3.0:17:83132247:83132267:1

                     R          
Human      ATCATAGTTGACAATTGCCTA
Chimpanzee ATCATAGTTGACAGTTGCCTA

Human › chromosome:GRCh38:1:591779:591799:1 Orangutan › chromosome:PPYG2:1:229887820:229887840:1

                    R          
Human     ATCATAGTTGACAATTGCCTA
Orangutan ATCATAGTTGACAATTGTCTA

Human › chromosome:GRCh38:1:591779:591799:1 Macaque › chromosome:Mmul_8.0.1:16:77192000:77192020:-1

                  R          
Human   ATCATAGTTGACAATTGCCTA
Macaque CTCATAGTTGACAGTTGTCTA

What gives? Does anyone know why this might have gone wrong in 1000G, and how general the problem might be?

1000 genomes ancestral alleles • 2.3k views
ADD COMMENT
0
Entering edit mode

I'm not sure. Looking at the data, I'd suggest that the ancestral allele is indeed A. The G variant is a rare allele and is only present in the African 1000 Genomes population, as judged by the dbSNP record: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=563246443

*in fact, the dbSNP record does not even list the ancestral allele

ADD REPLY
0
Entering edit mode

Yes, although from what I’ve read, dbSNP only uses the chimp sequence as the ancestral state, which is much less sophisticated than the 1000G method. I wondered if either the alignments for this region with other species have improved since the 100G calculation, or if there’s a bug in the 1000G AA estimation pipeline

ADD REPLY
0
Entering edit mode

It makes me confused. And, how could we annotate the right ancestral allele for vcf file?

ADD REPLY
0
Entering edit mode
5.3 years ago
darink ▴ 10

Look instead at the EPO multi species primate alignment (that is what 1000 Genomes uses for ancestral calls). There's a "G" there (sorry, Ensemble is currently having problems so cannot share the link).

This muti-species alignment is now quite dated so it's possible that the lower confidence (ie. lower case letter) ancestral allele calls in 1000 Genomes are incorrect. I would trust the current primate assemblies more than the EPO data.

ADD COMMENT

Login before adding your answer.

Traffic: 1798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6