Question

fastSTRUCTURE failing to cluster one thousand genomes south asians

0

Entering edit mode

10.6 years ago

samesense ▴ 50

I want to determine the ancestry of 1000 individuals using SNPs from complete genomics whole genome sequencing. I'm using the complete genomics one thousand genomes samples to help assign ancestry into EUR, AFR, ASN, SAS, AMR. As a sanity check, I first run plink and fastSTRUCTURE on only the 1KG samples. My problem is that the SAS (SAN below) samples are not assigned to the SAS cluster with a strong probability, while all other populations have very high posterior probabilities for clusters. The SAS samples are mistakenly being assigned to the EUR cluster with higher probabilities than those computed or the SAS structure. Example probabilities for the SAS cluster:

HG02491_blood    0.382268    SAN
HG02601_blood    0.384698    SAN
HG02735_blood    0.389515    SAN
HG02733_blood    0.398396    SAN
HG02662_blood    0.398821    SAN
HG02657_blood    0.413964    SAN
HG02658_blood    0.415986    SAN
HG02600_blood    0.419449    SAN
HG02602_blood    0.447606    SAN
HG02786_blood    0.451546    SAN
HG02659_blood    0.458110    SAN
HG02787_blood    0.463829    SAN
HG02724_blood    0.493687    SAN
HG02790_blood    0.497987    SAN
HG02688_blood    0.509672    SAN
HG02783_blood    0.517858    SAN
HG02784_blood    0.518656    SAN
HG02685_blood    0.521347    SAN
HG02725_blood    0.523252    SAN
HG02727_blood    0.523703    SAN
HG02728_blood    0.525215    SAN
HG02684_blood    0.530562    SAN
HG02789_blood    0.543813    SAN
HG02687_blood    0.553202    SAN
HG02726_blood    0.556683    SAN
HG02686_blood    0.568139    SAN
HG02785_blood    0.574065    SAN
HG02729_blood    0.580628    SAN
HG02791_blood    0.594059    SAN
HG02689_blood    0.597515    SAN

I wonder if my methods are flawed. After making gen and sam files from the 1KG samples, I run plink to filter the SNPs down to 28681 positions, and then run fastSTRUCTURE.

plink --noweb --maf 0.01 --hwe 0.05 --geno 0.01 --recode --make-bed --gen {input} --sample {input} --out {outputBed}
plink --bfile {outputBed} --indep-pairwise 1000 50 0.05 --exclude range {rangeFile} --out {outputPrune}
plink --bfile {outputBed} --extract {outputPrune.in} --make-bed --out {filteredBed}
structure.py -K 5 --input={filteredBed} --output={structureOut} --full --seed=100

Does anyone spot a problem, or have a suggestion for making the SAS samples cluster correctly?

gwas plink ancestry fastStructure • 3.6k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by samesense ▴ 50

Ram · Answer 1 · 2015-01-27

1

Entering edit mode

10.5 years ago

Brice Sarver ★ 3.8k

Things look okay at first glance.

Sometimes the analyses recover geographic substructure in one population before recovering additional clusters. I would look to see what happens across different values of K. You can select an 'appropriate' value of K using a variety of means, but these results need to be interpreted in a biological context to be meaningful.

As a first-pass, see if SAS is recovered as a distinct cluster with a greater value of K.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.5 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

With k=10, I can place more than half of the SAS samples confidently.

ADD REPLY • link 10.5 years ago by samesense ▴ 50