Given an exome or targeted human VCF of one or more samples, I need a program to determine the "superpopulation" of each sample, as listed here:
http://www.1000genomes.org/category/frequently-asked-questions/population
ASN EUR AFR AMR SAN
The program should return a single three letter code for each sample.
Submissions will be judged on speed using 10 randomly selected subsets of 1KG samples - you cannot count on any "crucial" regions being covered.
Each "miss" will result in a penalty that is effectively 50% of the best time for the next best tier (a miss of one call will tack on half the entire time it took to call all 10 correctly)
So what am I allowed, if I cannot count on any specific region being there? How targeted could it be? Clearly some target regions will be uninformative...
sometimes we receive targeted resequencing samples that are, for example, just a bunch of cardiac genes. I would still like to make a guess as to the superpopulation.