What datasets are people using nowadays for genotype imputation and ancestry estimation? HapMap and 1000 Genomes are good, but it was some years since their release and both have some limitations on the number of populations included and resolution (especially HapMap which is a few genome builds behind and requires lifting over).
I know of several much larger studies like:
- 100 000 Genomes (UK-centric)
- UK Biobank (UK-centric)
- GenomeAsia100K (Asia-centric, in a pilot phase)
and all three require pre-approved access (which may not be granted for simple use in QC and imputation due to concerns about the privacy of participants).
An answer from 5 years ago mentiones Simons Genome Diversity Project (SGDP) and the Estonian Biocentre Human Genome Diversity Panel (EGDP) which indeed have more populations, but they also have smaller numbers.
Are there any other international projects like HapMap or 1000 Genomes that could be used for ancestry estimation and genotype imputation? Or is there anything in a pilot phase?
Or, for ancestry estimation is there an open subset of any of the large studies taking a subset of SNPs that are most predictive of population-level ancestry and making it available for this purpose?
It seems that UK BioBank provides principal components loadings: https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22009 (more precisely here: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=149744 or here: biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/snp_pca_map.txt). I assume one can use that to overlay their results with the PCA plots on pages 11 and 24 (https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/genotyping_qc.pdf) - but this is only useful for visual inspection. Though I am confused because there is another one here: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=1955. Figures S6, S7 and extended data figure 3 from https://doi.org/10.1038/s41586-018-0579-z are also useful for intepretation.
But one should not use the UK Biobank PCA components beyond 16-18 as mentioned in QC report and in https://academic.oup.com/bioinformatics/article/36/16/4449/5838185.
I'm not sure that it's necessary to have very large sample sizes for ancestry estimation (imputation is a different matter). I think you could get pretty accurate results if you merged the SGDP, HGDP, 1000 Genomes.