Question

New datasets for ancestry estimation and imputation?

0

Entering edit mode

3.2 years ago

krassowski.michal ▴ 180

What datasets are people using nowadays for genotype imputation and ancestry estimation? HapMap and 1000 Genomes are good, but it was some years since their release and both have some limitations on the number of populations included and resolution (especially HapMap which is a few genome builds behind and requires lifting over).

I know of several much larger studies like:

100 000 Genomes (UK-centric)
UK Biobank (UK-centric)
GenomeAsia100K (Asia-centric, in a pilot phase)

and all three require pre-approved access (which may not be granted for simple use in QC and imputation due to concerns about the privacy of participants).

An answer from 5 years ago mentiones Simons Genome Diversity Project (SGDP) and the Estonian Biocentre Human Genome Diversity Panel (EGDP) which indeed have more populations, but they also have smaller numbers.

Are there any other international projects like HapMap or 1000 Genomes that could be used for ancestry estimation and genotype imputation? Or is there anything in a pilot phase?

plink hapmap 1000genomes imputation genotype • 1.6k views

ADD COMMENT • link updated 3.2 years ago by curious ▴ 820 • written 3.2 years ago by krassowski.michal ▴ 180

0

Entering edit mode

Or, for ancestry estimation is there an open subset of any of the large studies taking a subset of SNPs that are most predictive of population-level ancestry and making it available for this purpose?

ADD REPLY • link 3.2 years ago by krassowski.michal ▴ 180

0

Entering edit mode

It seems that UK BioBank provides principal components loadings: https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=22009 (more precisely here: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=149744 or here: biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/snp_pca_map.txt). I assume one can use that to overlay their results with the PCA plots on pages 11 and 24 (https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/genotyping_qc.pdf) - but this is only useful for visual inspection. Though I am confused because there is another one here: https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=1955. Figures S6, S7 and extended data figure 3 from https://doi.org/10.1038/s41586-018-0579-z are also useful for intepretation.

ADD REPLY • link 3.2 years ago by krassowski.michal ▴ 180

0

Entering edit mode

But one should not use the UK Biobank PCA components beyond 16-18 as mentioned in QC report and in https://academic.oup.com/bioinformatics/article/36/16/4449/5838185.

ADD REPLY • link 3.2 years ago by krassowski.michal ▴ 180

0

Entering edit mode

I'm not sure that it's necessary to have very large sample sizes for ancestry estimation (imputation is a different matter). I think you could get pretty accurate results if you merged the SGDP, HGDP, 1000 Genomes.

ADD REPLY • link 3.2 years ago by 4galaxy77 2.9k

score 1 · Answer 1 · 2021-10-21

1

Entering edit mode

3.2 years ago

curious ▴ 820

topmed imputation server will give great imputation of EUR,AFR. A lot of people will still used HGDP or 1kgp for continental ancestry estimation. You can always make a custom panel too

ADD COMMENT • link 3.2 years ago by curious ▴ 820