I'm doing a project on reference bias, where it would be convenient to be able to assign regions of the GRCh38 reference assembly to the 1000 Genomes superpopulation(s) that they most closely resemble. If I could do this, I would be able to look for mapping bias in each region towards the superpopulation(s) that the assembled sequence for the region was most representative of.
I've been told there is existing work out there that has gone through and made ethnicity or source population inferences about different parts of the reference assembly. Has anyone seen such a paper? I can't find anything on Google Scholar. I can find references saying that the majority of the assembly is from RP11, who we think is of African-American ancestry (and probably would have been placed in the AFR superpopulation, if "Americans of African Ancestry in Buffalo, New York" was a population covered by 1000 Genomes), but for the parts that aren't RP11 clones, I don't have any information.
Would I be better off using something like STRUCTURE to try and pull out actual shared haplotypes between the reference and the 1000 Genomes samples?