According to ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README ancestral alleles in the human SNPs in 1000 genomes data are determined by comparison with chimp, orangutan, and macaque. Here's an example from the vcf for chromosome 1 (e.g. from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/):
1 527169 rs563246443 A G 100 PASS AC=4;AF=0.000798722;AN=5008;NS=2504;DP=10410;EAS_AF=0;AMR_AF=0;AFR_AF=0.003;EUR_AF=0;SAS_AF=0;AA=g
Which says that the ancestral allele (AA) is "g". But when I look at the alignments in Ensembl (e.g. https://www.ensembl.org/Homo_sapiens/Variation/Compara_Alignments?align=1098&db=core&r=1%3A591289-592289&v=rs563246443&vdb=variation&vf=95730370), I find that the other primate species all have "A" at that locus:
rs563246443 SNP
Human › chromosome:GRCh38:1:591779:591799:1 Chimpanzee › chromosome:Pan_tro_3.0:17:83132247:83132267:1
R
Human ATCATAGTTGACAATTGCCTA
Chimpanzee ATCATAGTTGACAGTTGCCTA
Human › chromosome:GRCh38:1:591779:591799:1 Orangutan › chromosome:PPYG2:1:229887820:229887840:1
R
Human ATCATAGTTGACAATTGCCTA
Orangutan ATCATAGTTGACAATTGTCTA
Human › chromosome:GRCh38:1:591779:591799:1 Macaque › chromosome:Mmul_8.0.1:16:77192000:77192020:-1
R
Human ATCATAGTTGACAATTGCCTA
Macaque CTCATAGTTGACAGTTGTCTA
What gives? Does anyone know why this might have gone wrong in 1000G, and how general the problem might be?
I'm not sure. Looking at the data, I'd suggest that the ancestral allele is indeed A. The G variant is a rare allele and is only present in the African 1000 Genomes population, as judged by the dbSNP record: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=563246443
*in fact, the dbSNP record does not even list the ancestral allele
Yes, although from what I’ve read, dbSNP only uses the chimp sequence as the ancestral state, which is much less sophisticated than the 1000G method. I wondered if either the alignments for this region with other species have improved since the 100G calculation, or if there’s a bug in the 1000G AA estimation pipeline
It makes me confused. And, how could we annotate the right ancestral allele for vcf file?