Hi, I have been given as part as a homework a set of genotyped human SNP for many individuals; I am supposed to identify their ancestral populations (disclaimer: as a bonus question I won't submit and which is likely an example from the readings I have to do, so I am not asking for a key in hand solution). Using STRUCTURE, I can get an estimate of the number of ancestral populations of my sample and their allele frequencies.
My SNPs are coded by 0 or 1 (ancestral or derived states), but I know their labels (rsXXXXX). Some SNP are found in one state at a much higher frequency in some population, which makes them informative: I can can manually crosscheck my frequency estimates with the population frequencies found in HapMap.
For example, with SNP rs924201 , populations A and C present the derived state in around 50% of cases and B the ancestral in more than 80%. HapMap tells me that ~80% of Africans share a same variant of this SNP , but only roughly 50% of European and Asians. I can guess that B would likely be from Africa, given the fact that my sample is rather large.
Is there a way to fit my imputed allele frequencies to real world frequencies? Like, finding all markers with noticeably different frequencies between my predicted groups and matching them to real world populations. I would need a way to retrieve the frequencies of the most common variants for each of the HapMap populations. I could maybe then try some kind of lasso on the frequencies of the most common allele, in order to know which real world population is the closest of my predicted ancestral population.
Thanks for your advice!