I need to do a PCC (Pearson Correlation Coefficient) test to see the correlation (if any) in between genes and the clinical parameters.
I am having sample wise clinical data (for parameters like TLC, Platelet, Neutrophils, Lymphocytes, SGOT, Total protein etc with their respective measuring units). I have gene and their normalised counts in each sample.
I found a few studies where such insights have been looked into, but detailed methodology is nowhere mentioned. One way can be to normalize (by Z score) clinical data to get each value in a particular range (as each parameter has its own measuring unit). And read counts of genes are already normalised.
BUT how sensible it is to do PCC with two different attributes (read counts and clinical values) as read counts of genes have come from RNA seq data, and clinical data is directly from patients?
The Pearson correlation measures linear component of association between two variables, as the ratio between the covariance and the product of standard deviations. It's not a matter of sensibility (?), but rather is related to what is your research question, which is indeed not specified in your post. If your aim is to measure linear association between a given gene and a clinical variable, there's nothing wrong. Remember that every PCC you'd find, is completely unsufficient to exclude any other driving association within genes and clinical features, so pay attention to drive any conclusion based on correlation.
If you may want to find a set of genes that could be considered somehow related to a particular clinical feature, you could try a linear regression against the specific feature using regularization (Lasso or, better, ElasticNet). Here a guide on how to implement it in sklearn.
Thanks for your reply.
so, the aim is to get a set of genes highly associated with any clinical parameter.
But how come the first one from WGCNA is obsolete or you are referring to a better approach?
Indeed is a robust approach and used by thousands of researchers every year. However its semplicity caused an abuse by many researchers, who completely miss the theoretical concept behind and just replicate the vignette as it is, without understanding what is a scale-free network and what this means. I'd leave two relevant papers raising warnings on the power-law assumption of scale free networks.
The first article used WGCNA, which is a completely different and obsolete technique. The correlation is against gene modules, not single genes.