Hi,
I have 150 samples of expression data for two different conditions(or experiments).
can correlation (e.g. cor() in R) be enough to tell relations between random or pre-selected sets of genes?
Also what extra work required/better to do to validate it (only computationally) more or to go further?
NOTE: lets assume we pick only correlations between genes which p-vale <0.005 and cor > 0.70 or < -0.70 , if you also thing another correlation value is better, please tell.
Edit:
Data type: expression
Species: Human
Conditions samples: A = 90 samples , B = 60 samples , Total = 150
Genes: a set of desired genes
Aim: find correlation between those genes.
I appreciate your comments
could you specify which type of data ? expression ?
Yes expression data.
how many samples / genes / condition / what is the species ? and more important What is the hypothesis you want to test ? You should edit your question to add these informations.
Now I added some more info.
ok thanks. IMO you should try a hierarchical clustering on the genes. In R:
if A is your expression matrix (columns = samples ; row = genes)
You could also try WGNCA : https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/
Yes WGCNA is a good way to find co-expressed genes and then plot them to see how they behave in two different conditions or to what level they are correlated but the correlation plot of even the dendrogram will only reveal correlation if those genes selected are either validated set of genes know in the lab or in published literature if you are using them. Randomly selecting genes for correlation might not work out depending on what criteria you are selecting them. Better to use some published data that gives some set of genes or try WGCNA as mentioned by NicoBxl
Thanks. I used
hclust()
and it gave me unexpected result.e.g.
If geneX and geneY correlation value is 0.60 and p-value <0.005 , in
hclust()
geneX and Y would be more far from each compared to other less correlated genes. Any comment?I am gonna use WGNCA too.
There is a difference between correlation and co-expression.
Then your suggestion of using WGCNA is for co-expression which is another purpose for me as it tracks genes which up and down together. While I want to do correlation plot for another purpose. Thanks.
You can use classical unsupervised clustering for your genes of interest on the normalized expression value between samples coming from both the conditions. This will help you to find the correlation coefficient and if these also cluster your samples in 2 classes category wise . So try that. Then take a look at this link here how WGCNA is used and can be exploited your hypothesis. See also here and you can take a look how to select specific gene modules as well.Take a look at different clustering methods as well.
Thank you. Lets say 2 clusters happen, how to do Correlation Coefficient for them? is it embedded inside the unsupervised method or u mean to do it separately?