Hi All,
This is a follow-up to my previous post here
I intend to cluster tissues based on gene expression levels. I am trying to replicate figure 1 of this paper
Based on the inputs given in my previous post, the input data has been converted to the following format using categorical information of gene expression levels for more than 1000 genes. I have presented the data with two columns of ensembl gene id's for the purpose of illustration.
ENSG00000000003 ENSG00000000419 ....
adrenal gland 1.000000 4.000000 ...
appendix 2.000000 3.500000 ...
bone marrow 1.000000 3.000000 ...
breast 2.000000 3.000000 ...
bronchus 4.000000 3.000000 ...
caudate 1.000000 2.500000 ...
From the above data, I'd like to compute the spearman's rho correlation matrix and convert it to a distance measure for clustering.
Could someone explain how spearman's rho correlation has to be computed ? (I looked at in-built functions in R suggested in my previous post. However, I would like to understand how it is computed)
Many thanks for the link. I read thorough the explanation . I'd like to ask for clarifications on how to interpret the computation of correlation matrix
The following is the sample data that is considered
Using
corr <- cor(df,method = "spearman")
the following output is obtained
From what I understand the above matrix is constructed using df^T(transpose)*df which gives a tissue x tissue correlation matrix with variances on the diagonals and covariance on the non-diagonal entries. Could you please explain how this matrix can be interpreted?
Also, in the above-mentioned link a formula is mentioned when all the ranks are distinct. Could you please explain how to assign ranks when the values of a variable is not distinct (e.g data stored in df)?