I am working on bacterial GWAS. I have two batches: When I applied GWAS on the first batch, I found 5 effective principal components (based on a scree plot for eigenvalues of an MDS) to control population stratification. When I applied GWAS on the second batch, I found 3 effective principal components (based on a scree plot of an MDS). When I merged the batches, I found again 3 principal components (based on a scree plot of an MDS). However, I expected to see around 8 components! what that means? Does it mean that some of my principal components got lost? How many components should I add for the merged GWAS analysis?
Please show all commands that you have used, and obviously mention how you are conducting PCA.
I applied MDS on a distance matrix for all samples based on phylogeny.
Thanks, well, I cannot see exactly the commands that you're running and, e.g., how you are merging your datasets and how many dimensions you are including to control for population stratification; so, I am left to hypothesise in this regard. Important to also know the percent explained variation for each dimension, and if they actually segregate your samples in a bi-plot in the way that you think.
No. If you merge 2 datasets, the primary sources of variation will change; therefore, so will a MDS analysis performed on this merged dataset. You should be able to configure the program to output more or all dimensions.
Dear @Kevin Blighe Actually, as I mentioned before, for the first dataset, I added 5 dimensions for controlling population structure, and for the second one, I added three. And the number of dimensions is decided based on the knee of scree plots for components of the MDS. And for the merge one, based on the knee of my scree plot for new MDS, again 3 dimensions should be added to control for population structure. Another point, I have all the dimensions, but I decided the number of dimensions to add to the model based on knee of my scree plots. And for merging dataset, I am merging them after quality control, by intersecting the variants appearing in both, and then applying a linear/logistic model.
I see - thanks for explaining! It is still important to look at the actual percent accumulative explained variation along each successive dimension. Just using the 'knee' / 'elbow' method may not be a good metric if used in isolation.