Hi,
actually, one of the approaches would be to identify substructures of populations (using PCA rather than STRUCTURE, I guess).
If clear patterns emerge, you can divide your population into more homogenous subsets - .
As for the imputation, there are several "schools". A very "orthodox" apporach would be to put HapMap3 data in your PCA (or 1000 G data) for common SNPs in order to find, for each of your subset populations the closest (ethnically) panel. As you are suggesting.
And then you would impute genotypes in each of your sub-population with the closest panel.
Nevertheless, a more flexible approach was developepd recently by Howie and Marchini.
In this approach, the program (IMPUTE) is searching, for each small chromosomal region that you want to impute, in a large ethnically mixed panel, the chromosome chunks that are close to the chromosome to be imputed.
If your data shows clear ethnical separation - your individuals are 100% Europeans and very divergent from any other panel population - then you will be automatically back to your imputation using a 100% European panel.
However, if some regions show less divergence between populations, then, for these regions, the imputation will use a larger panel.
For me, this approach is theoretically appealing because this is a kind of generalisation of the basic populaiton-specific apporach where you have to impose a threshold. It seems that in practice it also works quite well - but it is very new and therefore cannot guarantee 100%
Beware that now, IMPUTE strongly advise pre-pahsing before running the imputation. For this prepashing, it can be interesting to have your own data divided into homogenous populations. But I wouldn't advise populations < 200 individuals because you need enough individuals for phasing.
Best
Christian
Check this reference for more (and clearer) information,
http://www.g3journal.org/content/1/6/457.full
Nice answer. Would PCA actually separate out the populations nicely?
Hi. This would separate them quiet nicely indeed. There is a minor problem about the number of dimensions but for intercontinental differences, the first two axes will make do ..
@genotepes: Thanks for that!
I agree with Genotepes.
You can do imputation with IMPUTE2, then use the complete HapMap reference panel and don't care about the population structure at all!
IMPUTE2 includes algorithms to choose the optimum subset of the reference panel for you. This is better than making a subset of the population by yourself since it is very ieasy to import some bias.
Check this: http://mathgen.stats.ox.ac.uk/impute/using_multi_population_reference_panels.html#how_does_it_work