Question

Genotype Imputation And Population Substructure

6

Entering edit mode

13.2 years ago

Darren J. Fitzpatrick ★ 1.1k

Hi

I have a set of SNPs (~500,000) genotyped from 1000 individuals but the ethnicity of the individuals is unknown for a large portion of the individuals.

I wish to impute ungenotyped SNPs from HapMap data. Given that I don't know the ethnicity, I am unsure which HapMap population to use - in fact, I am really unsure how to proceed. Currently, my thinking is as follows:

Infer ethnicity of individuals, perhaps using STRUCTURE
Divide SNP data based on ethnicity
Use the different ethinicty based subsets to impute unknown genotypes using the relevant HapMap population, e.g., the CEU population for those indiviuals who are Caucasian

Does this seem reasonable? Would you have any suggestions on how to tackle this problem?

imputation population • 4.9k views

ADD COMMENT • link updated 13.2 years ago by Genotepes ▴ 950 • written 13.2 years ago by Darren J. Fitzpatrick ★ 1.1k

Ram · Answer 1 · 2012-02-13

Hi,

actually, one of the approaches would be to identify substructures of populations (using PCA rather than STRUCTURE, I guess).

If clear patterns emerge, you can divide your population into more homogenous subsets - .

As for the imputation, there are several "schools". A very "orthodox" apporach would be to put HapMap3 data in your PCA (or 1000 G data) for common SNPs in order to find, for each of your subset populations the closest (ethnically) panel. As you are suggesting. And then you would impute genotypes in each of your sub-population with the closest panel.

Nevertheless, a more flexible approach was developepd recently by Howie and Marchini.

In this approach, the program (IMPUTE) is searching, for each small chromosomal region that you want to impute, in a large ethnically mixed panel, the chromosome chunks that are close to the chromosome to be imputed.

If your data shows clear ethnical separation - your individuals are 100% Europeans and very divergent from any other panel population - then you will be automatically back to your imputation using a 100% European panel. However, if some regions show less divergence between populations, then, for these regions, the imputation will use a larger panel. For me, this approach is theoretically appealing because this is a kind of generalisation of the basic populaiton-specific apporach where you have to impose a threshold. It seems that in practice it also works quite well - but it is very new and therefore cannot guarantee 100%

Beware that now, IMPUTE strongly advise pre-pahsing before running the imputation. For this prepashing, it can be interesting to have your own data divided into homogenous populations. But I wouldn't advise populations < 200 individuals because you need enough individuals for phasing.

Best

Christian

Check this reference for more (and clearer) information,

http://www.g3journal.org/content/1/6/457.full