What is the most complete vcf file of population allele frequencies that can be built/downloaded from public datasets nowadays?
About 5 years or so ago, it used to be the latest release of the CSHL HapMap 12 populations that were part of the 1000 genomes project. These were:
CHB CHD MEX GIH TSI LWK ASW MKK HCB JPT CEU YRI
For example, the EnsEMBL project currently has these 12 populations available as a vcf here:
http://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/
CSHL-HAPMAP-HAPMAP-CHB.vcf.gz
CSHL-HAPMAP-HAPMAP-CHD.vcf.gz
CSHL-HAPMAP-HAPMAP-MEX.vcf.gz
CSHL-HAPMAP-HAPMAP-GIH.vcf.gz
CSHL-HAPMAP-HAPMAP-TSI.vcf.gz
CSHL-HAPMAP-HAPMAP-LWK.vcf.gz
CSHL-HAPMAP-HAPMAP-ASW.vcf.gz
CSHL-HAPMAP-HAPMAP-MKK.vcf.gz
CSHL-HAPMAP-HapMap-HCB.vcf.gz
CSHL-HAPMAP-HapMap-JPT.vcf.gz
CSHL-HAPMAP-HapMap-CEU.vcf.gz
CSHL-HAPMAP-HapMap-YRI.vcf.gz
Each SNP has an AF entry from which a multi-populations vcf with rsids, alleles and frequencies can be built, where the AFs are such as AF_CHB, AF_CHD, etc.
The 1000 genomes project populations documentation describes more than these 12 populations, but I haven't seen equivalent population vcfs with AFs built from the individuals within the population for the remainder of these, apart from the original HapMap 12 marked with a *
below:
###Populations and codes
* CHB Han Chinese Han Chinese in Beijing, China
* JPT Japanese Japanese in Tokyo, Japan
CHS Southern Han Chinese Han Chinese South
CDX Dai Chinese Chinese Dai in Xishuangbanna, China
KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam
* CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only)
CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry
* TSI Tuscan Toscani in Italia
GBR British British in England and Scotland
FIN Finnish Finnish in Finland
IBS Spanish Iberian populations in Spain
* YRI Yoruba Yoruba in Ibadan, Nigeria
* LWK Luhya Luhya in Webuye, Kenya
GWD Gambian Gambian in Western Division, The Gambia
MSL Mende Mende in Sierra Leone
ESN Esan Esan in Nigeria
* ASW African-American SW African Ancestry in Southwest US
ACB African-Caribbean African Caribbean in Barbados
MXL Mexican-American Mexican Ancestry in Los Angeles, California
PUR Puerto Rican Puerto Rican in Puerto Rico
CLM Colombian Colombian in Medellin, Colombia
PEL Peruvian Peruvian in Lima, Peru
* GIH Gujarati Gujarati Indian in Houston, TX
PJL Punjabi Punjabi in Lahore, Pakistan
BEB Bengali Bengali in Bangladesh
STU Sri Lankan Sri Lankan Tamil in the UK
ITU Indian Indian Telugu in the UK
I presume there is more public data nowadays that can be accessed to build a more complete vcf files of allele frequencies per population, with as many populations as possible to use as a reference dataset with new data.
Thanks in advance.
Hello 14134125465346445!
It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8838
This is typically not recommended as it runs the risk of annoying people in both communities.