Genotype Imputation And Population Substructure
1
6
Entering edit mode
12.8 years ago

Hi

I have a set of SNPs (~500,000) genotyped from 1000 individuals but the ethnicity of the individuals is unknown for a large portion of the individuals.

I wish to impute ungenotyped SNPs from HapMap data. Given that I don't know the ethnicity, I am unsure which HapMap population to use - in fact, I am really unsure how to proceed. Currently, my thinking is as follows:

  1. Infer ethnicity of individuals, perhaps using STRUCTURE

  2. Divide SNP data based on ethnicity

  3. Use the different ethinicty based subsets to impute unknown genotypes using the relevant HapMap population, e.g., the CEU population for those indiviuals who are Caucasian

Does this seem reasonable? Would you have any suggestions on how to tackle this problem?

imputation population • 4.6k views
ADD COMMENT
9
Entering edit mode
12.8 years ago
Genotepes ▴ 950

Hi,

actually, one of the approaches would be to identify substructures of populations (using PCA rather than STRUCTURE, I guess).

If clear patterns emerge, you can divide your population into more homogenous subsets - .

As for the imputation, there are several "schools". A very "orthodox" apporach would be to put HapMap3 data in your PCA (or 1000 G data) for common SNPs in order to find, for each of your subset populations the closest (ethnically) panel. As you are suggesting. And then you would impute genotypes in each of your sub-population with the closest panel.

Nevertheless, a more flexible approach was developepd recently by Howie and Marchini.

In this approach, the program (IMPUTE) is searching, for each small chromosomal region that you want to impute, in a large ethnically mixed panel, the chromosome chunks that are close to the chromosome to be imputed.

If your data shows clear ethnical separation - your individuals are 100% Europeans and very divergent from any other panel population - then you will be automatically back to your imputation using a 100% European panel. However, if some regions show less divergence between populations, then, for these regions, the imputation will use a larger panel. For me, this approach is theoretically appealing because this is a kind of generalisation of the basic populaiton-specific apporach where you have to impose a threshold. It seems that in practice it also works quite well - but it is very new and therefore cannot guarantee 100%

Beware that now, IMPUTE strongly advise pre-pahsing before running the imputation. For this prepashing, it can be interesting to have your own data divided into homogenous populations. But I wouldn't advise populations < 200 individuals because you need enough individuals for phasing.

Best

Christian

Check this reference for more (and clearer) information,

http://www.g3journal.org/content/1/6/457.full

ADD COMMENT
1
Entering edit mode

Nice answer. Would PCA actually separate out the populations nicely?

ADD REPLY
1
Entering edit mode

Hi. This would separate them quiet nicely indeed. There is a minor problem about the number of dimensions but for intercontinental differences, the first two axes will make do ..

ADD REPLY
0
Entering edit mode

@genotepes: Thanks for that!

ADD REPLY
0
Entering edit mode

I agree with Genotepes.

You can do imputation with IMPUTE2, then use the complete HapMap reference panel and don't care about the population structure at all!

IMPUTE2 includes algorithms to choose the optimum subset of the reference panel for you. This is better than making a subset of the population by yourself since it is very ieasy to import some bias.

Check this: http://mathgen.stats.ox.ac.uk/impute/using_multi_population_reference_panels.html#how_does_it_work

ADD REPLY

Login before adding your answer.

Traffic: 1744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6