Clustering with Jaccard Distance
2
1
Entering edit mode
4.6 years ago
bazok ▴ 40

Dear Colleagues, I have a couple of sample data I am comparing for similarity - I want to know which sample is close to/cluster with which sample. I got top 500 genes expressed across them and got out the jaccard distance matrix using these genes. I tried heatmap with dendogram using the distance matrix directly and also after computing 1 - entries of symnum(cor(jaccard_distance) ). The clusterings were different for the two approaches and I was wondering which is the correct one - Plotting directly the jaccard distance matrix or that with 1 - entries of symnum(cor(jaccard_distance) ).

Thanks

RNA-Seq R gene • 3.1k views
ADD COMMENT
1
Entering edit mode
4.6 years ago
venu 7.1k

Why not take variably expressed genes across all samples and do clustering?

I am not sure of exact reasons, but your approach doesn't seem what you want for the results you are expecting. For example, from RNA-seq some genes, such as ribosomal genes/house keeping genes always have higher read count. So when you take top expressed genes from all samples, there are always those genes common and bias jaccard score calculation.

ADD COMMENT
0
Entering edit mode

Thanks venu, I filtered out housekeeping genes. I didnt use variably expressed genes as I think there can be false similarity - as in genes highly expressed in one sample but lowly expressed in the other may create fake close proximity/similarity.

ADD REPLY
1
Entering edit mode
4.6 years ago

Also suspicious to me is cor(jaccard_distance). Are you working with the correlation between the Jaccard distances? Also I am not sure what you mean by 1- entries of symnum(). Don't so 1- symnum() since symnum() is producing symbols not numbers.
You may want to show actual code so that what you're doing is clear.

ADD COMMENT
0
Entering edit mode

Hi Jean-Karim, The below is the code

library(RColorBrewer)
coul <- colorRampPalette(brewer.pal(8, "PiYG"))(256)
# Approach 1
heatmap(as.matrix(dist), col = coul)
heatmap(as.matrix(dist),symm=TRUE,margins= c(15,1),labRow=gsub("^Paper_","",rownames(dist)), labCol=gsub("^Paper_","",colnames(dist)),cexRow=0.5,cexCol=0.5, col= colorRampPalette(brewer.pal(9, "Blues"))(25))

# Approach 2
symnum( cU <- cor(dist) )
hU <- heatmap(cU, Rowv = FALSE, symm = TRUE, labRow=gsub("^Paper_","",rownames(dist)),labCol=gsub("^Paper_","",colnames(dist)),distfun = function(c) as.dist(1 - c), keep.dendro = TRUE,margins= c(20,15),col= colorRampPalette(brewer.pal(9, "Blues"))(25))
ADD REPLY
1
Entering edit mode

Please use code formatting to make for easier reading (that's the small button with 0s and 1s).
Your two heatmaps end up being different because the matrices they use are different. Approach 1 is the heatmap of the distance (presumably 1-Jaccard index) whereas approach 2 uses the matrix of correlations between the distances. The latter means that you treat the distances as features (i.e. a data point is represented by the vector of its distances to all the other points). While this is sometimes done, is there a particular motivation for it in this case?

ADD REPLY
0
Entering edit mode

Thanks Jean-Karim.When I look at the raw data, the clustering from approach 1 wasnt what I was expecting so I ran into approach 2. Can you elaborate more on the approach 2 and when it is usually used?

ADD REPLY
1
Entering edit mode

Something like approach 2 is sometimes used to do classification or dimensionality reduction where data points are represented as a vector of distances. For more on the topic, see The dissimilarity space: Bridging structural and statistical pattern recognition by Duin and Pkalska.

ADD REPLY

Login before adding your answer.

Traffic: 1999 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6