Hello everyone,
I have WGS data of 118 samples, out of which 72 are from one country, consisting of 11 breeds , that are under consideration. I used others to distinguish/ control. After running the NJ tree from the called SNP data, I ran PCA which separates 72 and 46 (The 46 clearly clusters into 4 groups), but rest of the 72 makes only one scattered cluster. So in the next step, I only take these 72 to make another PCA, resulting in a total of three clusters (1 breed, 1 partial hybrid breed and rest is a scattered mush).
the maf was calculated as 1/2n. Following are the command lines used to produce the PCA:
plink --bfile out.all --keep keep --maf 0.00423 --make-bed --chr-set 29 --out out
plink --bfile ./out.all --indep-pairwise 50 5 0.2 --chr-set 29 --out out
plink --bfile ./out.all --extract out.all.prune.in --make-bed --chr-set 29
Any help is very much appreciated. Awaiting.
How were these samples collected and processed before variant calling? What variant calling filters did you use? There could be alternative sources of variation that are taking over your first two principal components.
What percentage of variation do your PCs explain? If this number is quite low, it is worth trying other methodologies for composition (like tree building).