Analyzing PCA Clustering from PLINK Output
1
0
Entering edit mode
3 months ago
Smriti ▴ 40

I am working with PCA results derived from PLINK output files for a dataset comprising 674 samples. After performing K-Means clustering on the PCA data, I observed the following:

  1. K-Means Clustering Results: The K-Means algorithm identified three well-defined clusters in the PCA plot. The clusters appear distinct and separated.

  2. Severity-Based Coloring: When I colored the PCA plot based on severity categories (Mild, Moderate, Severe), I noticed that the clusters include samples from all severity groups. Specifically:

    • Each of the three clusters contains samples across the severity categories.
    • No specific clustering pattern emerges related to the severity groups.

Question

Given the observation that severity groups are distributed across the identified clusters, what could be the reasons for not observing severity-specific clustering? Could it indicate that severity does not directly influence the PCA clusters, or might there be other factors at play?

How can I further analyze or adjust my approach to potentially uncover any severity-specific clustering patterns? Are there additional methods or considerations that might help in understanding the relationship between severity and clustering in this context?

plink • 634 views
ADD COMMENT
1
Entering edit mode

You could try looking into other PC combinations. You are only looking at the first 2 PC axes at the moment, so the severity-based result, if actually significant, is likely small and showing up elsewhere.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion dthorbur enter image description here

Also, when applied ANOVA for each PC pair individually (in total have, 10 PC eigenvectors); found significance in between severity groups for none of the pair as per ANOVA p value (all with >0.10 in my case). Having said that, how do I interpret it now? Should the severity aspect be clearly removed from the story? Is it all population stratification impact? as this is a GWAS study but then all samples are from one particular hospital only.

Any comments/suggestions are appreciated!

ADD REPLY
0
Entering edit mode

I think you should use a permanova instead of a series of ANOVAs given it's built for this use-case.

If you have data on population stratification, then you can analyse the data with stratification as a random effect in a mixed effects model, but to me it sounds like severity is not an important factor in your study. This can still be a reported result to show you've looked into this.

ADD REPLY
0
Entering edit mode

the output PCA results: enter image description here

enter image description here

ADD REPLY
0
Entering edit mode
3 months ago

It indicates that Severity is not the primary driver of variation in your dataset.

I would utilise my PCAtools package to try to uncover what are the primary sources of variation in your dataset. It is clear that Severity is not the primary source.

I would also conduct a standard differential expression analysis comparing Mild vs Moderate vs Severe - do the genes make sense?; do you achieve stat. sig. p-values?

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin thanks for your response but actually this is a part of GWAS study and i have three around 670 covid-19 infected patients all from one particular hospital. based on clinical parameters, they have been segregated into mild moderate and severe. and from PCA intent was to see if there are any severity specific clustering... PLINK based pca output were generated; eigenvectors and eigenvalues; which were then passed into Python codes to get PCA biplot. for PC1 and PC2, no severity wise clusters were observed and as dthorbur suggested, i did for all PC components and put anova test as well to confirm any significance influence of severity on dataset; which might got captured in some other PC combinations. However, no significance p values was obtained in either of the combinations.

And so the query remained as such: when PCA was indented to see severity clusters, no clusters were observed and when PCA was done with k means clustering, with some unknown variable, it found out to be three well defined clusters. So I am unable to interpret my results.

What variable did k means clustering take into account? or on what basis, my data is clustered into three small clusters?? In a GWAS study, as population stratification is important to analyze, i am unable to proceed further.

Figures are inserted above.

Any comments/ suggestions please...

ADD REPLY

Login before adding your answer.

Traffic: 1854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6