I want to determine the ancestry of 1000 individuals using SNPs from complete genomics whole genome sequencing. I'm using the complete genomics one thousand genomes samples to help assign ancestry into EUR, AFR, ASN, SAS, AMR. As a sanity check, I first run plink and fastSTRUCTURE on only the 1KG samples. My problem is that the SAS (SAN below) samples are not assigned to the SAS cluster with a strong probability, while all other populations have very high posterior probabilities for clusters. The SAS samples are mistakenly being assigned to the EUR cluster with higher probabilities than those computed or the SAS structure. Example probabilities for the SAS cluster:
HG02491_blood 0.382268 SAN
HG02601_blood 0.384698 SAN
HG02735_blood 0.389515 SAN
HG02733_blood 0.398396 SAN
HG02662_blood 0.398821 SAN
HG02657_blood 0.413964 SAN
HG02658_blood 0.415986 SAN
HG02600_blood 0.419449 SAN
HG02602_blood 0.447606 SAN
HG02786_blood 0.451546 SAN
HG02659_blood 0.458110 SAN
HG02787_blood 0.463829 SAN
HG02724_blood 0.493687 SAN
HG02790_blood 0.497987 SAN
HG02688_blood 0.509672 SAN
HG02783_blood 0.517858 SAN
HG02784_blood 0.518656 SAN
HG02685_blood 0.521347 SAN
HG02725_blood 0.523252 SAN
HG02727_blood 0.523703 SAN
HG02728_blood 0.525215 SAN
HG02684_blood 0.530562 SAN
HG02789_blood 0.543813 SAN
HG02687_blood 0.553202 SAN
HG02726_blood 0.556683 SAN
HG02686_blood 0.568139 SAN
HG02785_blood 0.574065 SAN
HG02729_blood 0.580628 SAN
HG02791_blood 0.594059 SAN
HG02689_blood 0.597515 SAN
I wonder if my methods are flawed. After making gen and sam files from the 1KG samples, I run plink to filter the SNPs down to 28681 positions, and then run fastSTRUCTURE.
plink --noweb --maf 0.01 --hwe 0.05 --geno 0.01 --recode --make-bed --gen {input} --sample {input} --out {outputBed}
plink --bfile {outputBed} --indep-pairwise 1000 50 0.05 --exclude range {rangeFile} --out {outputPrune}
plink --bfile {outputBed} --extract {outputPrune.in} --make-bed --out {filteredBed}
structure.py -K 5 --input={filteredBed} --output={structureOut} --full --seed=100
Does anyone spot a problem, or have a suggestion for making the SAS samples cluster correctly?
With k=10, I can place more than half of the SAS samples confidently.