Question

Clustering with Jaccard Distance

1

Entering edit mode

5.0 years ago

bazok ▴ 40

Dear Colleagues, I have a couple of sample data I am comparing for similarity - I want to know which sample is close to/cluster with which sample. I got top 500 genes expressed across them and got out the jaccard distance matrix using these genes. I tried heatmap with dendogram using the distance matrix directly and also after computing 1 - entries of symnum(cor(jaccard_distance) ). The clusterings were different for the two approaches and I was wondering which is the correct one - Plotting directly the jaccard distance matrix or that with 1 - entries of symnum(cor(jaccard_distance) ).

Thanks

RNA-Seq R gene • 3.5k views

ADD COMMENT • link 5.0 years ago by bazok ▴ 40

score 1 · Answer 1 · 2020-05-04

1

Entering edit mode

5.0 years ago

venu 7.1k

Why not take variably expressed genes across all samples and do clustering?

I am not sure of exact reasons, but your approach doesn't seem what you want for the results you are expecting. For example, from RNA-seq some genes, such as ribosomal genes/house keeping genes always have higher read count. So when you take top expressed genes from all samples, there are always those genes common and bias jaccard score calculation.

ADD COMMENT • link 5.0 years ago by venu 7.1k

0

Entering edit mode

Thanks venu, I filtered out housekeeping genes. I didnt use variably expressed genes as I think there can be false similarity - as in genes highly expressed in one sample but lowly expressed in the other may create fake close proximity/similarity.

ADD REPLY • link 5.0 years ago by bazok ▴ 40

Jean-Karim Heriche · Answer 2 · 2020-05-04

1

Entering edit mode

5.0 years ago

Jean-Karim Heriche 27k

Also suspicious to me is cor(jaccard_distance). Are you working with the correlation between the Jaccard distances? Also I am not sure what you mean by 1- entries of symnum(). Don't so 1- symnum() since symnum() is producing symbols not numbers.
You may want to show actual code so that what you're doing is clear.

ADD COMMENT • link 5.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Hi Jean-Karim, The below is the code

library(RColorBrewer)
coul <- colorRampPalette(brewer.pal(8, "PiYG"))(256)
# Approach 1
heatmap(as.matrix(dist), col = coul)
heatmap(as.matrix(dist),symm=TRUE,margins= c(15,1),labRow=gsub("^Paper_","",rownames(dist)), labCol=gsub("^Paper_","",colnames(dist)),cexRow=0.5,cexCol=0.5, col= colorRampPalette(brewer.pal(9, "Blues"))(25))

# Approach 2
symnum( cU <- cor(dist) )
hU <- heatmap(cU, Rowv = FALSE, symm = TRUE, labRow=gsub("^Paper_","",rownames(dist)),labCol=gsub("^Paper_","",colnames(dist)),distfun = function(c) as.dist(1 - c), keep.dendro = TRUE,margins= c(20,15),col= colorRampPalette(brewer.pal(9, "Blues"))(25))

ADD REPLY • link updated 5.0 years ago by Jean-Karim Heriche 27k • written 5.0 years ago by bazok ▴ 40

1

Entering edit mode

Please use code formatting to make for easier reading (that's the small button with 0s and 1s).
Your two heatmaps end up being different because the matrices they use are different. Approach 1 is the heatmap of the distance (presumably 1-Jaccard index) whereas approach 2 uses the matrix of correlations between the distances. The latter means that you treat the distances as features (i.e. a data point is represented by the vector of its distances to all the other points). While this is sometimes done, is there a particular motivation for it in this case?

ADD REPLY • link 5.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks Jean-Karim.When I look at the raw data, the clustering from approach 1 wasnt what I was expecting so I ran into approach 2. Can you elaborate more on the approach 2 and when it is usually used?

ADD REPLY • link 5.0 years ago by bazok ▴ 40

1

Entering edit mode

Something like approach 2 is sometimes used to do classification or dimensionality reduction where data points are represented as a vector of distances. For more on the topic, see The dissimilarity space: Bridging structural and statistical pattern recognition by Duin and Pkalska.

ADD REPLY • link 5.0 years ago by Jean-Karim Heriche 27k